CSv3 cluster operations
Overview
This guide covers four cluster-level operations on a Cloud 66 Skycap v3 (CSv3) K3s cluster:
- Adding a node — joining a fresh server to an existing cluster
- Resizing the cluster — increasing or decreasing the node count of a server pool
- Cordoning a node — marking a node unschedulable while keeping running pods in place
- Draining a node — evicting workloads from a node before removal
At present, CSv3 node management is exposed exclusively through the Cloud 66 Dashboard. There is no cx CLI command and no public REST API endpoint for add / resize / cordon / drain. If you need to script around them, you can still apply kubectl cordon / kubectl drain directly against your cluster using the kubeconfig you can download from the Dashboard, but the Cloud 66-side bookkeeping (timeline operations, scale-down deletion) only happens when triggered from the Dashboard.
All four operations are asynchronous. Triggering one creates a timeline operation that you can watch — they don't block the Dashboard.
Reduced vs High Availability
A CSv3 cluster runs in one of two availability modes, set when you create it and shown on the cluster page:
- Reduced Availability (RA) — your manager and worker roles share the same nodes, so the managers also run application workloads. It needs fewer servers and costs less, which suits development, testing, and smaller workloads, but it offers limited redundancy: a single shared manager is a single point of failure, and as the cluster grows the control plane competes with your app pods for resources. RA is not recommended for production.
- High Availability (HA) — three or more dedicated managers (running only the control plane) plus one or more workers. This is the production-grade topology: it survives a manager node failure and keeps control-plane work off your application nodes.
When to move from RA to HA
Move to HA when either is true:
- You need to survive a node failure. A reduced-availability cluster can't — losing the shared manager takes the whole cluster down.
- The shared managers are under pressure. Because managers also run workloads in RA, the control plane can end up competing with your app — showing up as slow
kubectlor Dashboard responses, scheduling lag, or etcd slow-write warnings. Dedicated managers remove that contention.
To upgrade, use Upgrade to High Availability on the cluster page (see Upgrading to high availability). Cloud 66 provisions the dedicated managers and migrates your workloads onto the worker nodes; it's a guided flow with some manual steps.
Adding a node
Adding a node means provisioning a new server in your cloud provider and joining it to your K3s cluster — most commonly a worker (which runs your application workloads). Adding manager nodes (the control-plane nodes — CSv3's term for what older Maestro docs call masters) works differently; see Upgrading to high availability below.
How to add a worker
Open your application in the Dashboard → cluster page → Workers tab → select the relevant server pool → click Add servers and set the new pool size.
Upgrading to high availability
A CSv3 cluster starts with a single shared manager. You don't add managers one at a time — moving to a fault-tolerant three-manager control plane is a separate Upgrade to High Availability action on the cluster page (not Scale up). It's a guided flow with both automated and manual steps; follow the prompts in the dashboard.
What happens when you add a node
The whole flow runs as a Scale up timeline operation you can watch:
- Cloud 66 checks that no other scale operation is already running on the pool, and that the new size is compatible with your database replication requirements.
- A new server is provisioned with your cloud provider.
- Kubernetes (K3s) is installed on it.
- The new server joins your existing cluster as a worker.
- Once it reports healthy, the node is available for scheduling.
Common failures
If add-node fails, the timeline operation will show one of these errors:
| Error message | What it means |
|---|---|
Cloud 66 cannot connect to at least one of your stack servers (with sudo permissions), deployment aborted, unable to continue | SSH to one of your existing servers (where we need to fetch the join token) failed. Most commonly a firewall change, key rotation, or a server already in a bad state. |
Cloud 66 cannot create all of your required servers | The cloud provider rejected the server creation call. Quota, region availability, or credential problems. |
Cannot fetch agent_join_token from the server (file not present) or agent_join_token on the server is empty | The existing manager isn't running K3s correctly, so the token file is missing or empty. The cluster itself is in a degraded state — adding a node is not the fix; investigate the manager first. |
Unable to create any servers in your cloud | Every server allocation attempt failed at the cloud provider level. |
We have created your servers, however there was an issue installing server components. | Servers came up, but the post-install scaffolding step failed. The full underlying error is appended to this message. |
A failed scale-up does not retry automatically. If a scale-up fails, any partially-created servers may need to be cleaned up before you try again. Open a support ticket if the timeline shows a half-finished scale-up.
Resizing the cluster
In CSv3, "resize" means changing the number of nodes in a server pool, not changing the size (CPU/RAM) of existing nodes.
Scale up
Same procedure and code path as Adding a node.
Scale down
Cluster page → Workers tab → select the pool → reduce the server count, or remove individual servers from the pool.
Cloud 66 marks the selected servers for removal and deletes them in a single operation. Servers running database workloads are excluded from automatic scale-down to protect data; if the servers you're removing host databases, you'll see:
Can't scale down because there are still N servers running database workloads
To remove a database-hosting server you need to first migrate or remove the database workload from it.
Changing node size (vertical resize)
In-place vertical resize is not supported. You can't grow an existing node from, say, 2 GB to 4 GB through Cloud 66. To increase node capacity:
- Add new nodes at the larger size to the relevant server pool.
- Drain the smaller nodes one at a time.
- Remove the smaller nodes from the pool.
This horizontal pattern is the supported path for capacity upgrades.
Replication guard
If any of your database services has replication enabled with a minimum-server requirement of 3, you cannot scale a pool below 3 servers — this applies whenever replication needs a three-server floor, regardless of which servers you pick. You'll see:
You must first disable replication on all database services using this server pool
Disable replication on the affected services first, scale down, then re-enable replication.
Cordoning a node
Cordoning marks a Kubernetes node as unschedulable: existing pods keep running, but no new pods will be scheduled onto it. This is the standard prelude to draining a node, or a way to take a node out of rotation temporarily without disturbing what's already on it.
How to cordon
- Per node: cluster page → server detail → Cordon.
- Per pool: cluster page → pool detail → Cordon pool (cordons every server in the pool).
What Cloud 66 does
The Dashboard enqueues a Cordon "<server-name>" timeline operation, which runs kubectl cordon <node-name> against your cluster from Cloud 66's control plane. The operation has a 5-minute timeout.
Preconditions
The cluster must have a healthy control-plane manager reachable. If no healthy manager can be found, the operation fails with:
Unable to perform actions on the node as no healthy kubernetes control-plane could be found
Fix any unhealthy managers (typically by restoring SSH, restarting K3s, or replacing the manager) before retrying.
Live node status
After the cordon completes, the Dashboard updates to show the node as cordoned — so you can verify it took effect there, or by running kubectl get nodes with your downloaded kubeconfig.
Draining a node
Draining a node evicts the workloads running on it (subject to PodDisruptionBudgets and grace periods) and cordons it in the same step. Drain when you want to take a node out of service before removing it from the pool.
How to drain
- Per node: cluster page → server detail → Drain.
- Per pool: cluster page → pool detail → Drain pool (drains every server in the pool independently).
What Cloud 66 does
The Dashboard enqueues a Drain "<server-name>" timeline operation, which runs kubectl drain against the node with a 30-minute timeout (the full operation can run up to ~35 minutes counting overhead).
The drain follows standard Kubernetes semantics:
- Pods covered by a PodDisruptionBudget will only be evicted if the PDB allows it.
- Pods without controllers (i.e. bare
Podobjects, not from a Deployment/StatefulSet/etc.) can block drain. - DaemonSet-managed pods are skipped by default.
Preconditions
Same as cordon: a healthy control-plane manager must be reachable.
Common failures
| Error message | What to check |
|---|---|
Drain "<name>" Failed (timeline title; the detail line reads Failed to drain server <name>: <underlying error>) | The kubectl drain command surfaced an error. Most often a PDB violation or a pod that won't terminate. Open the timeline entry and inspect the detail line for the underlying error. |
Drain "<name>" Timed Out | A pod refused to evict within 30 minutes. Often a stuck terminating pod or a tight PDB. For your own application pods, reduce the workload's terminationGracePeriodSeconds or temporarily widen the PDB. |
Unable to perform actions on the node as no healthy kubernetes control-plane could be found | No reachable manager. Fix the manager before retrying. |
If a drain times out you can also drop to kubectl directly with your downloaded kubeconfig and use kubectl drain --force --grace-period=... to override the defaults.
Related
- Database replication — replication requirements that affect scale-down
- Kubernetes documentation — Safely drain a node — upstream reference for drain semantics