Reboot hangs for UKI images #2384

vipsharm · 2024-03-23T14:49:23Z

Kairos version:
3.0.1

CPU architecture, OS, and Version:
Ubuntu 23.10

After cluster is setup, rebooting the OS hangs in QEMU VM.

IMG_1572.MOV

IMG_1571.MOV

Itxaka · 2024-03-25T10:40:57Z

With longhorn installed and rebooting:

Itxaka · 2024-03-25T10:44:00Z

took a while but it restarted and then everything came up properly after a few minutes

mudler · 2024-03-27T08:24:40Z

@vipsharm as we can't reproduce this, can you share the ISO you are using so we can try to reproduce with that one? Also, did you give to the machine enough RAM/CPU resources?

jimmykarily · 2024-03-29T16:45:40Z

it turns out the state partition gets filled up. In my case it was even full after the upgrade before kubernetes even started (one need to hold ScrLk while the errors pass by to take a screenshot :D) :

I increased the size of the state partition with:

#cloud-config
stylus:
  site:
    name: dimitris-edge-host-9
    edgeHostToken: <reducted>
    paletteEndpoint: api.dev.spectrocloud.com

install:
  device: "/dev/vda"
  auto: true
  partitions:
    state:
      size: 5000
      fs: ext4
stages:
  initramfs:
    - users:
        kairos:
          groups:
            - sudo
          passwd: kairos

and this allowed me to boot and see the Pods coming up. After everything was running, I did a cold reboot and waited until it reboots. Then the space issue came back:

I think somehow containerd ends up writing the container filesystems on the state partition and it gets that filled up. Maybe there is some directory we should be mounting elsewhere and we don't?

jimmykarily · 2024-04-01T09:17:32Z

The last thing that was printed before it started printing the "No space left on device" afaict was "simple directory" (about /usr) from here:

https://github.com/kairos-io/immucore/blob/8a142fe41f046098eb5676e055f50de02d342c13/pkg/state/steps_uki.go#L402

Given the errors refer to directories in /sysroot/usr` I guess the copying here is what fails: https://github.com/kairos-io/immucore/blob/8a142fe41f046098eb5676e055f50de02d342c13/pkg/state/steps_uki.go#L410

jimmykarily · 2024-04-01T09:22:25Z

We only perform a check on the type of directory (mount point or not) on the top level dir (in this case /usr). What happens if a subdir is a mount point? cp -a will go crazy copying the mounted dir. Is it the case here as well? Is /sysroot/usr/local/.state/ where the COS_STATE partition gets mounted?

jimmykarily · 2024-04-01T09:26:35Z

What was the reason we only scanned top level directories again? (https://github.com/kairos-io/immucore/blob/8a142fe41f046098eb5676e055f50de02d342c13/pkg/state/steps_uki.go#L379). @Itxaka do you remember maybe?

Itxaka · 2024-04-01T09:38:59Z

We didn't had anything in submounts that it's private no? So any submounts in the subdirs of the root dirs would be propagated when moving the mountpoints?

Something like that I think.

jimmykarily · 2024-04-01T09:42:57Z

We didn't had anything in submounts that it's private no? So any submounts in the subdirs of the root dirs would be propagated when moving the mountpoints?

Something like that I think.

maybe it's the order of things then? Maybe it starts copying /usr before it moves any mounts?

Itxaka · 2024-04-01T09:45:54Z

Could it be that your issue come from ram? After all the new sysroot is mounted as tmpfs so it the VM has low memory or the image is big it can lead to ram exhaustion and that would cause the copy stuff to fail?

I just wonder why aren't we seeing the same issue anywhere else when testing?

jimmykarily · 2024-04-01T11:35:47Z

I thought the same and increased my VMs RAM to 15Gb but the error is still there. We probably don't see the error because we are not running much on Kubernetes. If you see the screenshot with the errors, it's writing complete container filesystems there. This could easily get huge in size.

Given we are only trying to "fake" a chroot environment, I don't think it was ever the intention to copy/duplicate such huge files. Also the directory they reside (/sysroot/usr/local/.state) should probably be a mount itself, not copied over.

jimmykarily · 2024-04-01T11:41:39Z

I tried to cleanup my cluster before I reboot, by removing helm releases and deployments. I wanted to see if by cleaning up as many containers as I could, it would not fail. Either I didn't delete enough container or it doesn't make any difference but the error is still there after reboot.

jimmykarily · 2024-04-01T11:50:13Z

I increased the VMs RAM to 30Gb and after taking quite some more time, it eventually booted with no disk errors. So this proves that the problem is the insane amount of data we are copying to the in-memory filesystem.

Itxaka · 2024-04-01T12:31:25Z

blergh.

We tried to make it really simple but then it didnt work. So maybe we would need to rework it and mount things directly under the new fake sysroot to avoid all teh copying?

currently:

/ is the sysroot
we do everything on it (agent stages, mounts, etcs)
we create the fake /sysroot and move everything in there
boot

We probably should do:

/sysroot as asysroot from start (We start on / need to mount everything to /sysroot and chroot to it)
run then everything under the new fake sysroot as usual

The only reason we needed to move from / to /sysroot was that / was of type rootfs and that broke the kubernetes thingie
We created /sysroot as tempfs and remounted everything in there as a workaround and that made containerd and such behave.

So now we would need to rework it to be on the tmpfs from the scratch, just have a minimal / rootfs, move /proc /dev /run and such into /sysroot and chroot into it from the first immucore step so we run everything in the proper final system.

We wil still need to do the Copy there BUT we do it before we do all the mounting of state and such, so its simpler as its basically a 1-1 copy, with no extra mounts in there

Itxaka · 2024-04-01T12:35:27Z

I think its easier like that, doing all the preparation before the copying. Otherwise another approach is to make the cp/mount function recursive...... or use rsync or something more intelligent for the copying?

jimmykarily · 2024-04-01T12:46:44Z

If it's possible to work on the tmpfs from the beginning I agree, it sounds like a better option. The rest sounds complicated and might result to different problems down the line.

mudler · 2024-04-02T07:02:44Z

seem to be the case also for real HW (not only affecting VMs)

mudler · 2024-04-03T07:38:07Z

kairos-io/immucore#271

antongisli · 2024-04-03T19:47:38Z

does this issue only occur on VMs?

jimmykarily · 2024-04-04T06:20:37Z

does this issue only occur on VMs?

No it shouldn't matter if it's a VM or not. It's now fixed on master.

mudler · 2024-04-04T13:30:56Z

Closing now as this is fixed in master and will be part of the upcoming 3.0.4 release. (see #2428 )

vipsharm added bug Something isn't working triage Add this label to issues that should be triaged and prioretized in the next planning call unconfirmed labels Mar 23, 2024

mudler removed the triage Add this label to issues that should be triaged and prioretized in the next planning call label Mar 25, 2024

mudler added the prio: high label Mar 25, 2024

mudler added the question Further information is requested label Mar 27, 2024

mudler changed the title ~~Reboot in QEMU VM hangs for UKI images~~ Reboot hangs for UKI images Apr 2, 2024

jimmykarily removed unconfirmed question Further information is requested labels Apr 2, 2024

mudler assigned Itxaka Apr 3, 2024

mudler mentioned this issue Apr 4, 2024

📢 Kairos release v3.0 #1791

Closed

53 tasks

mudler closed this as completed Apr 4, 2024

mudler mentioned this issue Apr 4, 2024

[patch-release] v3.0.x #2428

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reboot hangs for UKI images #2384

Reboot hangs for UKI images #2384

vipsharm commented Mar 23, 2024

Itxaka commented Mar 25, 2024

Itxaka commented Mar 25, 2024

mudler commented Mar 27, 2024

jimmykarily commented Mar 29, 2024 •

edited

jimmykarily commented Apr 1, 2024

jimmykarily commented Apr 1, 2024

jimmykarily commented Apr 1, 2024

Itxaka commented Apr 1, 2024

jimmykarily commented Apr 1, 2024

Itxaka commented Apr 1, 2024

jimmykarily commented Apr 1, 2024 •

edited

jimmykarily commented Apr 1, 2024

jimmykarily commented Apr 1, 2024

Itxaka commented Apr 1, 2024

Itxaka commented Apr 1, 2024

jimmykarily commented Apr 1, 2024

mudler commented Apr 2, 2024

mudler commented Apr 3, 2024

antongisli commented Apr 3, 2024

jimmykarily commented Apr 4, 2024 •

edited

mudler commented Apr 4, 2024

Reboot hangs for UKI images #2384

Reboot hangs for UKI images #2384

Comments

vipsharm commented Mar 23, 2024

Itxaka commented Mar 25, 2024

Itxaka commented Mar 25, 2024

mudler commented Mar 27, 2024

jimmykarily commented Mar 29, 2024 • edited

jimmykarily commented Apr 1, 2024

jimmykarily commented Apr 1, 2024

jimmykarily commented Apr 1, 2024

Itxaka commented Apr 1, 2024

jimmykarily commented Apr 1, 2024

Itxaka commented Apr 1, 2024

jimmykarily commented Apr 1, 2024 • edited

jimmykarily commented Apr 1, 2024

jimmykarily commented Apr 1, 2024

Itxaka commented Apr 1, 2024

Itxaka commented Apr 1, 2024

jimmykarily commented Apr 1, 2024

mudler commented Apr 2, 2024

mudler commented Apr 3, 2024

antongisli commented Apr 3, 2024

jimmykarily commented Apr 4, 2024 • edited

mudler commented Apr 4, 2024

jimmykarily commented Mar 29, 2024 •

edited

jimmykarily commented Apr 1, 2024 •

edited

jimmykarily commented Apr 4, 2024 •

edited