DRBD inside KVM virtual machine

Discussion:

Nick Morrison

14 years ago

Hello,

My apologies if this is a frequently-asked question, but I hope it can be answered easily. I'm a bit new to this, so forgive my n00bness.

I have two physical servers, reasonably specced. I have a 250GB LVM volume spare on each physical server (/dev/foo/storage). I would like to build a KVM/QEMU virtual machine on each physical server, connect /dev/foo/storage to the virtual machines, and run DRBD inside the two VM guests.

From there, I plan to run heartbeat/pacemaker to provide a HA/failover NFS server to other VM guests residing on the same physical servers.

Rationale:

I started this project by doing the DRBD and heartbeat/pacemaker/NFS on the physical machines, and nfs-mounting a folder containing the VM guest's hard disk .img files, but ran into problems when I tried switching primary/secondary and moving the NFS server - under some circumstances, I couldn't unmount my /dev/drbd0, because the kernel said something still had it locked (even though the NFS server was supposedly killed.) I am assuming this is a complication with mounting an NFS share on the same server as it's shared from. So: I decided to think about doing the NFS serving from inside a KVM, instead.

I've also toyed with OCFS2 and Gluster; I thought perhaps doing an active/passive DRBD (+NFS server) would create less risk of split-brain.

Am I mad? Should it work? Will performance suck compared with running DRBD directly on the physical machines? I understand I will probably have high CPU usage during DRBD syncing, as QEMU's IO (even virtio) will probably load up the CPU, but perhaps this will be minimal, or perhaps I can configure QEMU to let the VM guest talk very directly to the physical host's block device..

Your thoughts are welcomed!

Cheers,
Nick

Andreas Hofmeister

14 years ago

Permalink

Post by Nick Morrison
Am I mad? Should it work?

It does. We run DRBD + Pcmk + OCFS2 for testing. I would not use such a
setup in production though.

Post by Nick Morrison
Will performance suck compared with running DRBD directly on the physical machines? I understand I will probably have high CPU usage during DRBD syncing, as QEMU's IO (even virtio) will probably load up the CPU, but perhaps this will be minimal,

You will see noticeable higher latencies with both, disk and network.
"macvtap" may help a bit with the latter, without your network
throughput would be limited too.

Post by Nick Morrison
or perhaps I can configure QEMU to let the VM guest talk very directly to the physical host's block device..

PCI Passthrough is somewhat problematic. Some
Chipset/Board/BIOS/Kernel/PCIe-Card/Driver combinations may even work,
but there is a good chance that you will see sucking performance and/or
unexplainable crashes. Unless you have a LOT of time to track these
problems, do not use PCI Passthrugh,

Ciao
Andi

Florian Haas

14 years ago

Permalink

Post by Andreas Hofmeister

Post by Nick Morrison
Am I mad? Should it work?

It does. We run DRBD + Pcmk + OCFS2 for testing. I would not use such a
setup in production though.

Agree.

Post by Andreas Hofmeister

Post by Nick Morrison
Will performance suck compared with running DRBD directly on the
physical machines? I understand I will probably have high CPU usage
during DRBD syncing, as QEMU's IO (even virtio) will probably load up
the CPU, but perhaps this will be minimal,

I've never seen a DRBD box, physical or virtual, being CPU bound while
syncing. If you're verifying or using data-integrity-alg with an
expensive shash algorithm then perhaps, but it's much more likely the
machine would still be I/O bound.

Post by Andreas Hofmeister
PCI Passthrough is somewhat problematic. Some
Chipset/Board/BIOS/Kernel/PCIe-Card/Driver combinations may even work,
but there is a good chance that you will see sucking performance and/or
unexplainable crashes. Unless you have a LOT of time to track these
problems, do not use PCI Passthrugh,

SCSI pass-through may be an option (a little-know Qemu/KVM feature).
Unlikely to be much better than virtio, though.

Hope this helps,
Florian

--
Need help with DRBD?
http://www.hastexo.com/knowledge/drbd

Nick Morrison

14 years ago

Permalink

Post by Florian Haas

Post by Andreas Hofmeister

Post by Nick Morrison
Am I mad? Should it work?

It does. We run DRBD + Pcmk + OCFS2 for testing. I would not use such a
setup in production though.

Agree.

Thanks for your feedback.

FYI, I set it up to test it out, and it seemed to work. The concerning part (aside from your advice to not use this architecture in production) was that during sync, which satisfyingly ran at a little over the 900,000,000 bps mark (according to tcpstat), the CPU utilisation of the associated kvm process would shoot up to 150-200% (4-CPU VM) and stay there for the duration of the synchronisation. I suppose this is a QEMU/virtio/kvm IO mapping thing. Perhaps SCSI passthrough would help here.

But based on your advice, I'll take another look at OCFS2, build a test environment inside the VMs, with the aim of installing it on the physical hosts for production. Will do everything possible to avoid split-brain.

Is there any easy and well defined way to set up a HA shared storage system with just two hosts, without using DRBD active/active? NFS server+client on the same machine is problematic, but perhaps there's another method I haven't thought of.

Cheers!

N

Nick Morrison

14 years ago

Permalink

Post by Nick Morrison
Is there any easy and well defined way to set up a HA shared storage system with just two hosts, without using DRBD active/active? NFS server+client on the same machine is problematic, but perhaps there's another method I haven't thought of.

Maybe I could use LXC to separate the NFS server and client :-)

(Specifically I'd like to run, on my two hosts, DRBD/??? to provide shared storage, and mount this shared storage locally for VM guests' disk image files to live on. This enables live migration, and would allow quick and complete failover should we lose one node.)

N

Arnold Krille

14 years ago

Permalink

Post by Nick Morrison
My apologies if this is a frequently-asked question, but I hope it can be
answered easily. I'm a bit new to this, so forgive my n00bness.
I have two physical servers, reasonably specced. I have a 250GB LVM volume
spare on each physical server (/dev/foo/storage). I would like to build a
KVM/QEMU virtual machine on each physical server, connect /dev/foo/storage
to the virtual machines, and run DRBD inside the two VM guests.

Why would you want to do that?

Post by Nick Morrison
From there, I plan to run heartbeat/pacemaker to provide a HA/failover NFS
server to other VM guests residing on the same physical servers.
I started this project by doing the DRBD and heartbeat/pacemaker/NFS on the
physical machines, and nfs-mounting a folder containing the VM guest's
hard disk .img files, but ran into problems when I tried switching
primary/secondary and moving the NFS server - under some circumstances, I
couldn't unmount my /dev/drbd0, because the kernel said something still
had it locked (even though the NFS server was supposedly killed.) I am
assuming this is a complication with mounting an NFS share on the same
server as it's shared from. So: I decided to think about doing the NFS
serving from inside a KVM, instead.
I've also toyed with OCFS2 and Gluster; I thought perhaps doing an
active/passive DRBD (+NFS server) would create less risk of split-brain.

...

I toyed with ocfs2 for about two days. Then I just went for gfs2 and since
then our dual-primary system is providing the storage for the virtual systems
just fine. That is we mostly use files as disks for the VMs. If you are okay
with clustered lvm, you can also do that. But as far as I see it, that is
harder to extend unless you export the underlying drbd-resource via iscsi/AoE
to the third and folowing nodes.
With gfs2 (or ocfs2) the third node can just mount the same dir via nfs.

Post by Nick Morrison
Am I mad? Should it work? Will performance suck compared with running
DRBD directly on the physical machines?

It is one additional layer of complexity and weak points: For your cluster to
work, you not only need the host to work but also some special virtual
machines. And of course you add one layer of buffering and cpu-load.

Have fun,

Arnold

Nick Morrison

14 years ago

Permalink

...

If you don't mind my asking: Which distro are you using for your GFS2 cluster?

Nick

Arnold Krille

14 years ago

Permalink

...

Debian squeeze with the cluster-stuff from backports. And the kernel from
backports because of the e1000e network cards, which makes drbd complain
because kernel-space is one micro-version smaller then userspace. I plan to
downgrade the kernel once I create or find dkms-packages for the e1000e...

I have to admitt that I haven't yet decided whether the third and following
nodes will receive the virt-pool via iscsi+gfs or nfs, iscsi+gfs has the
advantage that I could do multipath-iscsi.

That being said, I cant wait for drbd to support more then two active mirrors!

Have fun,

Arnold

Nick Morrison

14 years ago

Permalink

Post by Arnold Krille

Post by Nick Morrison
I have two physical servers, reasonably specced. I have a 250GB LVM volume
spare on each physical server (/dev/foo/storage). I would like to build a
KVM/QEMU virtual machine on each physical server, connect /dev/foo/storage
to the virtual machines, and run DRBD inside the two VM guests.

Why would you want to do that?

I have two servers; I would like to provide shared storage, and run virtual machines. I don't want to care which server the VM runs on.

I would prefer to use primary/secondary+nfs rather than dual-primary+gfs2/ocfs2, because of the hassle of split-brain. I mount nfs-floating-ip:/data/share into /var/lib/libvirt/images/nfs, and it doesn't matter whether the NFS server / floating IP is local or on the other machine.

Except it does, because it's NFS.

This setup seems to work cleanly if I mount the NFS shares from a third machine. Pacemaker can fail the NFSD/IP/DRBD from machine to machine while files are open on the client, and it all just works. Lovely. And no risk of split brain.

But I don't have three machines :-( Only two.

If I try to mount an NFS share from the *same server* that's exporting it, I run into trouble when I try to failover - nfs-kernel-server seems to hold onto some file or other (lsof hangs, so I can't find out what). umounting the filesystem sometimes works and sometimes doesn't; regardless, drbd still can't secondary itself, saying someone is holding the device open.

My idea was therefore to run the NFS server from inside a virtual machine, static to a single physical server. Ugly, perhaps :-) I thought it was weirdly elegant.

I believe there is a product called the gluster virtual appliance, which (I think) uses a similar technique (though using gluster, obviously, and not DRBD.)

I've had hassle after hassle trying to make pacemaker drive OCFS2 and O2CB using stock Ubuntu server installations - missing the ocf:ocfs2:o2cb resource agent, for example, and having no supported way of installing it.

I'm sure I'm missing something terribly obvious in all this. I've been working on it for quite some man-hours now, and I'm probably not seeing the wood for the trees..

Any advice or words of encouragement warmly accepted! :-)

Cheers and beers,
Nick