Migrating a node from Proxmox to Talos

Goal for the new stack is Omni controlled bare metal Talos stack.

https://www.talos.dev/v1.7/talos-guides/install/omni/

Reclaiming PVE node in k3s cluster

Removing VM from Cluster

kubectl get nodes
thaynes@HaynesHyperion:~$ kubectl get nodes
NAME      STATUS   ROLES                       AGE   VERSION
kubem01   Ready    control-plane,etcd,master   65d   v1.30.3+k3s1
kubem02   Ready    control-plane,etcd,master   65d   v1.30.3+k3s1
kubem03   Ready    control-plane,etcd,master   65d   v1.30.3+k3s1
kubew01   Ready    <none>                      65d   v1.30.3+k3s1
kubew02   Ready    <none>                      65d   v1.30.3+k3s1
kubew03   Ready    <none>                      65d   v1.30.3+k3s1
kubew04   Ready    <none>                      65d   v1.30.3+k3s1

Find the node you want and drain it with:

kubectl drain kubew04 --ignore-daemonsets --delete-local-data

Then just delete it:

kubectl delete node kubew04

And now it's gone!

thaynes@HaynesHyperion:~$ kubectl get nodes
NAME      STATUS   ROLES                       AGE   VERSION
kubem01   Ready    control-plane,etcd,master   65d   v1.30.3+k3s1
kubem02   Ready    control-plane,etcd,master   65d   v1.30.3+k3s1
kubem03   Ready    control-plane,etcd,master   65d   v1.30.3+k3s1
kubew01   Ready    <none>                      65d   v1.30.3+k3s1
kubew02   Ready    <none>                      65d   v1.30.3+k3s1
kubew03   Ready    <none>                      65d   v1.30.3+k3s1

Reclaim Nodes from PVE

Now that the k3s isn't relying on the node we can shut down or delete that VM. Then remove this node from the HA cluster and migrate all HA VMs off to other nodes.

NOTE I also have a load balancer for the proxmox UI so I'll clean up that config to remove this

Ceph

Set OSDs to "out" and wait for Ceph to rebalance
Destroy MGR, MON, and MDS from the UI
Once OSDs are empty destroy them

PVE

Once nothing is running on the node we can follow these steps to remove nodes from proxmox.

Delete the node with:

pvecm delnode <NODE>

Then clean it out here:

root@pve01:/etc/pve/nodes# cd /etc/pve/nodes
root@pve01:/etc/pve/nodes# rm -R pve05/

You can also clean the node itself out by deleting:

systemctl stop pve-cluster corosync
pmxcfs -l
rm /etc/corosync/*
rm /etc/pve/corosync.conf
killall pmxcfs
systemctl start pve-cluster

And rm -R /etc/pve/nodes but I didn't get that far.

Update MS-01 BIOS

While I'm at it there's a BIOS update to apply.

First upgrade BIOS for MS-01. - tutorial - download

Clean Drives

Boot using Gparted and delete any data on the drives we will be using. This is especially important for Ceph as that is picky and hard to fix up the drives from Talos.

Once you add the node to the cluster, before configurign Ceph, run this to wipe the partition table:

$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: disk-wipe
  namespace: rook-ceph
spec:
  restartPolicy: Never
  nodeName: talosm01
  containers:
  - name: disk-wipe
    image: busybox
    securityContext:
      privileged: true
    command: ["/bin/sh", "-c", "dd if=/dev/zero bs=1M count=100 oflag=direct of=/dev/nvme0n1"]
EOF
pod/disk-wipe created

$ kubectl wait --timeout=900s --for=jsonpath='{.status.phase}=Succeeded' pod disk-wipe
pod/disk-wipe condition met

$ kubectl delete pod disk-wipe
pod "disk-wipe" deleted