Logo

Command Palette

Search for a command to run...

CSv3 cluster operations

Overview

This guide covers four cluster-level operations on a Cloud 66 Skycap v3 (CSv3) K3s cluster:

All four operations are asynchronous. Triggering one creates a timeline operation that you can watch — they don't block the Dashboard.

Reduced vs High Availability

A CSv3 cluster runs in one of two availability modes, set when you create it and shown on the cluster page:

  • Reduced Availability (RA) — your manager and worker roles share the same nodes, so the managers also run application workloads. It needs fewer servers and costs less, which suits development, testing, and smaller workloads, but it offers limited redundancy: a single shared manager is a single point of failure, and as the cluster grows the control plane competes with your app pods for resources. RA is not recommended for production.
  • High Availability (HA)three or more dedicated managers (running only the control plane) plus one or more workers. This is the production-grade topology: it survives a manager node failure and keeps control-plane work off your application nodes.

When to move from RA to HA

Move to HA when either is true:

  • You need to survive a node failure. A reduced-availability cluster can't — losing the shared manager takes the whole cluster down.
  • The shared managers are under pressure. Because managers also run workloads in RA, the control plane can end up competing with your app — showing up as slow kubectl or Dashboard responses, scheduling lag, or etcd slow-write warnings. Dedicated managers remove that contention.

To upgrade, use Upgrade to High Availability on the cluster page (see Upgrading to high availability). Cloud 66 provisions the dedicated managers and migrates your workloads onto the worker nodes; it's a guided flow with some manual steps.

Adding a node

Adding a node means provisioning a new server in your cloud provider and joining it to your K3s cluster — most commonly a worker (which runs your application workloads). Adding manager nodes (the control-plane nodes — CSv3's term for what older Maestro docs call masters) works differently; see Upgrading to high availability below.

How to add a worker

Open your application in the Dashboard → cluster page → Workers tab → select the relevant server pool → click Add servers and set the new pool size.

Upgrading to high availability

A CSv3 cluster starts with a single shared manager. You don't add managers one at a time — moving to a fault-tolerant three-manager control plane is a separate Upgrade to High Availability action on the cluster page (not Scale up). It's a guided flow with both automated and manual steps; follow the prompts in the dashboard.

What happens when you add a node

The whole flow runs as a Scale up timeline operation you can watch:

  1. Cloud 66 checks that no other scale operation is already running on the pool, and that the new size is compatible with your database replication requirements.
  2. A new server is provisioned with your cloud provider.
  3. Kubernetes (K3s) is installed on it.
  4. The new server joins your existing cluster as a worker.
  5. Once it reports healthy, the node is available for scheduling.

Common failures

If add-node fails, the timeline operation will show one of these errors:

Error messageWhat it means
Cloud 66 cannot connect to at least one of your stack servers (with sudo permissions), deployment aborted, unable to continueSSH to one of your existing servers (where we need to fetch the join token) failed. Most commonly a firewall change, key rotation, or a server already in a bad state.
Cloud 66 cannot create all of your required serversThe cloud provider rejected the server creation call. Quota, region availability, or credential problems.
Cannot fetch agent_join_token from the server (file not present) or agent_join_token on the server is emptyThe existing manager isn't running K3s correctly, so the token file is missing or empty. The cluster itself is in a degraded state — adding a node is not the fix; investigate the manager first.
Unable to create any servers in your cloudEvery server allocation attempt failed at the cloud provider level.
We have created your servers, however there was an issue installing server components.Servers came up, but the post-install scaffolding step failed. The full underlying error is appended to this message.

Resizing the cluster

In CSv3, "resize" means changing the number of nodes in a server pool, not changing the size (CPU/RAM) of existing nodes.

Scale up

Same procedure and code path as Adding a node.

Scale down

Cluster page → Workers tab → select the pool → reduce the server count, or remove individual servers from the pool.

Cloud 66 marks the selected servers for removal and deletes them in a single operation. Servers running database workloads are excluded from automatic scale-down to protect data; if the servers you're removing host databases, you'll see:

Can't scale down because there are still N servers running database workloads

To remove a database-hosting server you need to first migrate or remove the database workload from it.

Changing node size (vertical resize)

In-place vertical resize is not supported. You can't grow an existing node from, say, 2 GB to 4 GB through Cloud 66. To increase node capacity:

  1. Add new nodes at the larger size to the relevant server pool.
  2. Drain the smaller nodes one at a time.
  3. Remove the smaller nodes from the pool.

This horizontal pattern is the supported path for capacity upgrades.

Replication guard

If any of your database services has replication enabled with a minimum-server requirement of 3, you cannot scale a pool below 3 servers — this applies whenever replication needs a three-server floor, regardless of which servers you pick. You'll see:

You must first disable replication on all database services using this server pool

Disable replication on the affected services first, scale down, then re-enable replication.

Cordoning a node

Cordoning marks a Kubernetes node as unschedulable: existing pods keep running, but no new pods will be scheduled onto it. This is the standard prelude to draining a node, or a way to take a node out of rotation temporarily without disturbing what's already on it.

How to cordon

  • Per node: cluster page → server detail → Cordon.
  • Per pool: cluster page → pool detail → Cordon pool (cordons every server in the pool).

What Cloud 66 does

The Dashboard enqueues a Cordon "<server-name>" timeline operation, which runs kubectl cordon <node-name> against your cluster from Cloud 66's control plane. The operation has a 5-minute timeout.

Preconditions

The cluster must have a healthy control-plane manager reachable. If no healthy manager can be found, the operation fails with:

Unable to perform actions on the node as no healthy kubernetes control-plane could be found

Fix any unhealthy managers (typically by restoring SSH, restarting K3s, or replacing the manager) before retrying.

Live node status

After the cordon completes, the Dashboard updates to show the node as cordoned — so you can verify it took effect there, or by running kubectl get nodes with your downloaded kubeconfig.

Draining a node

Draining a node evicts the workloads running on it (subject to PodDisruptionBudgets and grace periods) and cordons it in the same step. Drain when you want to take a node out of service before removing it from the pool.

How to drain

  • Per node: cluster page → server detail → Drain.
  • Per pool: cluster page → pool detail → Drain pool (drains every server in the pool independently).

What Cloud 66 does

The Dashboard enqueues a Drain "<server-name>" timeline operation, which runs kubectl drain against the node with a 30-minute timeout (the full operation can run up to ~35 minutes counting overhead).

The drain follows standard Kubernetes semantics:

  • Pods covered by a PodDisruptionBudget will only be evicted if the PDB allows it.
  • Pods without controllers (i.e. bare Pod objects, not from a Deployment/StatefulSet/etc.) can block drain.
  • DaemonSet-managed pods are skipped by default.

Preconditions

Same as cordon: a healthy control-plane manager must be reachable.

Common failures

Error messageWhat to check
Drain "<name>" Failed (timeline title; the detail line reads Failed to drain server <name>: <underlying error>)The kubectl drain command surfaced an error. Most often a PDB violation or a pod that won't terminate. Open the timeline entry and inspect the detail line for the underlying error.
Drain "<name>" Timed OutA pod refused to evict within 30 minutes. Often a stuck terminating pod or a tight PDB. For your own application pods, reduce the workload's terminationGracePeriodSeconds or temporarily widen the PDB.
Unable to perform actions on the node as no healthy kubernetes control-plane could be foundNo reachable manager. Fix the manager before retrying.

If a drain times out you can also drop to kubectl directly with your downloaded kubeconfig and use kubectl drain --force --grace-period=... to override the defaults.