Oh, and mailing lists are a bliss to use compared to (barely loading) forges, at least to me and especially with public inbox and tools like b4 and lei for patch review, management and applying. For the sending side it's basically a git send-email command to pve-devel@list.proxmox.com, see https://git-send-email.io for a simple tutorial.
But when there's the discussion of the amount of time Qemu spends "in grub" and "probing legacy devices", maybe my use case is different, but my VMs aren't constantly being rebooted and when the VM is up it is near native speed so...
[1] https://docs.redhat.com/en/documentation/red_hat_enterprise_...
The killer feature still missing from microVMs for me is the ability to enable CUDA support without passing through the entire GPU. vfio is just too much of a pain and too limiting. Sometimes I want to use my GPU on the host. Vulkan works fairly well with virtio-gpu and Venus, but I need CUDA. Venus is also still missing some important things like accelerated video encoding.
CUDA support is possible based on a cursory dive. I'll keep you posted on it
ref: https://thevirtualhorizon.com/2024/05/31/how-to-configure-th...
I’ve also been wanting a setup like this but don’t have to courage to use pve-microvm. First class microVM support would be very nice.
We’ve been on a similar journey, but came at it from the opposite direction. We started SlicerVM in 2022 after seeing how slow Multipass felt when launching more than one Linux VM, even though it is relatively lean. Tearing them down was slower.. we made it seconds either way for a 30 node cluster and kept it internal until August last year.
With Slicer, microVMs are the native primitive: API launch, guest-agent exec/shell/cp/forward workflows, isolated networking, and agent sandboxes are built into the control plane.
That was not our first use case. Back then we were standing up Kubernetes clusters quickly for OpenFaaS e2e testing and customer scale-out support across multiple machines. The agent/sandbox workflows came naturally after that.
We do see people come over from Proxmox when they want something more directly driven from code, especially with a deeper guest-agent model: exec, file copy, port forwarding, fs watches, etc. When you string it all together it becomes very powerful and what we've gradually dogfooded for our code review bot that started out by using SSH/SFTP to completely native SDK (Go/TS).
One thing I’d separate in the benchmarks is in-guest boot time vs. actual time-to-interactive/useful. For agent-style workloads, the number that tends to matter is: API request made -> VM created/cloned -> network policy applied -> guest agent reachable -> exec/shell/cp/forward works. Snapshot cloning, network device setup, and control-plane readiness all show up there.
TTI can also be moved around depending on tradeoffs: no real init system, snapshot resume, CrosVM-style lower-level primitives, or a VMM built for one narrow job. We use systemd in the guest, so we’re intentionally carrying some weight there.
I also liked that you retained module support for Docker. Supporting Docker, Kubernetes-ish workloads, and eBPF tends to add a lot of useful weight back in.
There’s room for several tools here. The space is moving quickly, and I’m looking forward to seeing which approaches consolidate.
If folks are looking to scratch that microVM, or programmable / bash / agent / SDK driven primitive, you're welcome to check us out and join the Discord.
Shame you did not mention once in your long post that you are based on Firecracker, because I'm sure I'm not the first who was about to post "why is this better than Firecracker".
Also it is a shame you've adopted the subscription billing model instead of allowing people to buy perpetual licenses.
I dislike the subscription model in a pure sense, but also I dislike the "but its 'only' $x a month" argument oft-used by developers. Sure, in theory that's the case. But like everyone else in the world, I also have $x a month of other monthly expenses in my life, and I simply do not need or want N+1 software subscriptions. It all adds up.
The same applies to business environments, except the cost becomes even more exponential because you have (X-employees * N-subscriptions)/month.
Given some similarities, I’d like to briefly mention `krun` here. Although it’s an OCI-compatible container runtime, it uses MicroVMs with a similar approach. Perhaps we can exchange ideas here? I recall that GPU passthrough is also a recurring topic there.
I'm also a bit confused on how to use libkrun. It seems to be implemented in rust but provide a C API. Can it be used in rust projects?
Also, it made me curious if it would be possible to create a Linux distribution where every process runs in a microvm.
In my own microVM experiments I’ve actually managed to get the machine to boot from a plain folder (some virtiofs setup, I can look around if anyone’s interested, but there should be more documentation about it now) - I find that pretty awesome.
anyway, the author posted the sources on github and got in touch with the proxmox people, maybe they want to absorb that into the product (which would be very very cool).
back when I used to use cursor I build this mcp but it should work for codex or claude
it lets me easily spin up vms with specs
its tough to create boxes now due to ram prices but got mine at a great time when it was very cheap; i just wish i had bought more then
These are my impressions.
First of all it's a very competent product, mainly thanks to Ceph making it HCI. Without Ceph, I'm not sure what we would do.
It's as effective as you design it, make sure to separate storage and cluster traffic to ensure robustness, and speed. Make sure to use at least 10GbE switch for storage, for fast migrations.
And managing ceph is very important, basically boils down to 1) never let it run out of space, and 2) the more devices you have the easier it is to manage.
Automating against Proxmox definitely is the biggest pain point, and this needs the most work done.
I've spent countless hours, pre-AI, building our automation setup using both Terraform and Ansible. I sort of wish I had tried AI earlier because it does make things easier.
Some things like automating the creation of templates will forever be a complex procedure in Ansible. And I abandoned Terraform completely because the API was too unpredictable for Terraforms strict state, Ansible was a much better fit.
Their AuthZ takes some getting used to, the fact that if you select "Privilege Separation" it countes the user's permissions AND the token permissions, and the token permissions must always be lower than the users.
Templates existing on one node, but taking a unique VM ID across the cluster is also a bit confusing. It means in practice we're always deploying VMs on the same node, before migrating them somewhere else.
I haven't touched backups with Ansible yet.
The backup restore and the VM startup is done through ansible > PVE CLI.
I also have a testing VM that has a "CLEAN" snapshot that I restore to multiple times a day, using ansible > PVE CLI. Once the VM snapshot is restored I turn it back on as well using the PVE CLI
I would love to use this in production, but dont know how much it can break things. Proxmox should just implement this in mainline.
Also replacing network access with af-vsock is actually interesting if you want to simplify bring-up. SSH does some magic with vsocks these days too.
Creating a single VM, with vm within vm (performance hit would be negligible for the orchestration work of agents), and it might offer some alternatives without having to customize Proxmox as much?
Just be careful with the virtualized file systems to not create write amplification issues.
(inb4 "Google it" -- I'd really appreciate a recommendation from a human, and not just a random blog post that might well be slop.)