linuxwebcluster.com

Linux Data Replication Methods for High Availability

Many Linux servers have critical data that must be available from a redundant source in case the primary server fails. Sometimes this requires a high availability cluster, and sometimes it just requires of copy of the data somewhere else. When configuring some form of data replication on a Linux server, an important point to consider is the nature of the data itself. Sometimes this can severely limit the replication methods available to you.

I recently spoke with a System Administrator who wanted a general purpose replication scheme that could be used on any kind of data. He wasn't sure what kinds of data he would eventually want to replicate, and really just wanted to cover his bases and make sure that if he chose our product it would have broad applicability within his datacenter. As it turns out, this led to an interesting discussion of what replication methods are best for different types of data, and why.

 

Asynchronous vs. Synchronous

Generally speaking there are two choices. The first and easiest way is to replicate files asynchronously with a tool like rsync. This can be as simple as running rsync manually or via cron every once in a while, or as complicated as having custom tools that watch for changes in your data and automatically synchronize machines when changes are detected. There is a kernel feature called Inotify that can help with the latter if you want to write your own custom app to perform certain actions when specified files or directories change. Whether you keep it simple or not, this method is nice because the fact that you are replicating data won't slow down  your application. But on the downside, the data is not replicated immediately. Whether the delay is one second or one day, there is still a delay and if you have a failover, you can be assured that your secondary server is not guaranteed to pick up the data where the first server left off.

The second choice is to use a replicated cluster filesystem or DRBD, both of which are typically used in synchronous mode. This can ensure that two current copies of the data exist at all times. If one server goes down, the data on the other will pick up at the exact same spot and no data will be lost. All filesystem writes are guaranteed to be flushed at both servers before either one considers the transaction complete. While this sounds far better than asynchronous replication, it is much more complicated and therefore comes at a cost.

Because cluster filesystems put an additional layer of software between the block device and the application, and because they often have to perform many additional tasks when writing data (like talking to remote servers, storing version information, etc.), they can be orders of magnitude slower than normal Linux filesystems like ext3. If the setup is optimized and especially if it uses a high speed and low latency interconnect like Infiniband or Myrinet, the delays can be minimized. But these interconnects are expensive compared to traditional gigabit Ethernet, so to get rid of the speed problem the cost of even very small clusters is now increased by a few thousand dollars at a minimum. The complexity is still much greater as well, which can reduce reliability.

 

Problem Summary

Is either method necessarily better? Not really. It depends on what's most important: speed or not losing your most recent data updates. The first kind of replication (asynchronous) gives us fast writes and is easy to implement, but lacks strict data integrity guarantees. The second method (synchronous) will guarantee that both copies are identical, but it's slow, more complex, and possibly very expensive as well. As it turns out, the type of data you are replicating is also a critical factor in choosing the correct method. You can't just choose either one based on your preference.

 

Large Files

If you are replicating large binary files like the ones used by MySql, there is an inherent problem with rsync'ing the data to another server. If you change even one byte of that file and then try to update the mirror (i.e., run rsync again), the whole file needs to be scanned to see what has changed. It's the nature of rsync. It checks to see if the checksum of the file is different, and if it is, then it has to look for what changed and replicate the differences. But on a huge file, just scanning for the change takes a long time and quickly bogs the system down. In practice, rsync-style asynchronous replication generally fails for large files that change continuously.  We are left with two alternatives: synchronous replication, and/ or some kind replication capability that is built in to the application itself.

Synchronous replication using a cluster filesystem is advantageous here because no lengthy scanning needs to take place. A filesystem write is captured before it is written to disk and the small change is immediately sent to the replication partner. Then both systems flush the change at once and report back to the application that the write was completed. Compared to the rsync method, flushing the change was incredibly fast.

The other choice for large files is to rely on replication that's built in to the application itself. MySql has replication mechanisms but not all applications do. In addition, these replication devices have pitfalls of their own. What happens when a transaction fails to complete on a slave node? Generally, replication will stop and wait for you to manually resolve the problem. Will you be notified? It's possible but probably not unless you setup a special facility for this. If the slave node is now out of sync, can it be re-synced?  Again, probably not. The slave needs to be re-created from a fresh copy of the master database, and replication restarted from a known point. If this happens frequently it can be very time consuming. Can you be certain that the slaves are identical to the master? After all, it is not a copy per se, but a set of data on another server that is supposed to have undergone the same transactions as the master. You can check externally to make sure they are identical, but this is not built in and over time divergence due to missed or failed transactions is a real administrative concern. All of these issues need to be considered when you are trying to decide the best way to implement database or large file replication.

 

Small Files

Small and / or rarely changing files require an entirely different set of strategies. As a general rule rsync and other asynchronous methods are very efficient in this case. Say you have a thousand smaller files and you randomly change 20 of them. Because everything is local and low overhead, the writes are fast and the filesystems are simple. Rsync will see very quickly that 980 of the files have not changed and that only 20 need to be updated. It will calculate the differences in those 20--which is very fast because they are small--and send them to the replication partner. The differences will be applied and the mirror will be updated.

Asynchronous replication is probably the method of choice here because it is the simplest effective solution. The overhead of a cluster filesystem is not likely to be worth the tradeoff in speed and cost, unless those issues are unimportant in your environment or unless you need both copies of the mirror to be identical in real time. This situation where there are large numbers of small and relatively static files is commonly seen in fileservers and web servers. In fact, whenever you download something from the Internet  you may notice that it comes from one of many "mirror" sites that all have a copy of the same file. A nightly rsync is the traditional way that these mirrors are maintained. If a new file is to be made available, the web developer will upload it to a master server and every night the slave servers will use rsync to update their copies of the available files. The next morning, all the mirrors will be identical thanks to asynchronous replication.

 

Conclusion

When choosing the best data replication method for your particular situation, there are two things that must take center stage. First, understand the strengths and weaknesses of asynchronous and synchronous replication. Asynchronous replication is fast and simple, but does not replicate in real time and thus cannot guarantee identical sets of data at the moment of failover.  Speed and ease of implementation are prioritized. Synchronous replication always guarantees identical data sets, but it's usually much slower and possibly expensive. This method prioritizes data integrity above all else.

Second, think about the kind of data you are going to be replicating. This could determine that only one method or the other will be practical. If large database files need to be replicated then a synchronous cluster filesystem or application specific tool is likely to be the best choice. If you have mostly small files or large files that rarely if ever change, asynchronous replication may have the edge by virtue of its speed and simplicity.

 

Scalability and High Availability: In the Cloud vs. In Your Datacenter

Cloud computing opens up huge opportunities for deploying identical servers quickly and efficiently. The challenge is that most applications are not built for distributed or parallel operation across many servers, leaving us with some underlying problems that remain to be solved:

  • Shared Data: How will all nodes access shared data like application configuration files? Do I need a cluster filesystem?
  • Load Balancing: I can start 20 nodes, but how will I distribute inbound connections across them?
  • Health Checking: When a node fails, how will it get removed from load balancing until it becomes healthy again?
  • Failover / High Availability: What if the database server or load balancer fails? How will my 20 application servers continue to operate if this single point of failure (SPOF) has a problem?

 

Today we can deploy more nodes than ever before in record time. What the aforementioned challenges show is that simply having more nodes does not mean you have an effective cluster that allows your application to scale. You need to consider these additional factors if you want scalability and high availability that works in the real world.

A Linux system administrator may be surprised to look at this list and observe that these are the very same issues that they face with "In House" clusters. But this is as it should be; after all, cloud computing is really about creating servers. The fact that they are virtual instead of physical doesn't make all the networking challenges go away.

Yesterday I saw a solution from RightScale that addressed this very issue. It was refreshing to see another company that understands true scalability and doesn't try to mislead their customers by telling them that they can scale their performance instantly by just switching on more nodes.

The RightScale solution is called RightScale Website Edition and they provide a diagram that neatly captures all the plumbing required to make a service scalable and highly available. There are two load balancers, two database servers with replication, and multiple app servers.

But all this comes with a cost. Their offering is weighed down by a $2500 setup fee and a $500 monthly payment. You are getting good technology andtheir approach is well thought out. What is needed is an application for these services in your local environment.

ClusterMaker software does exactly that. It is the only way to deliver cloud style ease of use into your own datacenter. After installing it on a Linux server along with your applications, you can add application nodes (or compute nodes, for the HPC crowd) by simply setting them for PXE boot and turning them on. The master server running ClusterMaker will take care of the rest:

  • Deploying an operating system
  • Configuring a cluster filesystem
  • Creating a shared root cluster where all nodes see the same data
  • Setting up load balancing and automatic health monitoring
  • Web based performance monitoring and node management
  • Creating a backup master by cloning the original server
  • Setting up data replication and synchronization between the two
  • Creating a failover virtual IP address that is always active so the nodes always have access to the shared root despite inevitable failures
  • Installing easy to use system state snapshotting with system restore points
  • Separate snapshotting for data / databases that can be used without taking the database offline

 

While the cloud is an amazing platform with many scalability advantages over traditional hardware, it is not the right choice for all applications. Our goal is to provide an alternative that lets you quickly and easily implement similar levels of Linux high availability and scalability in your own datacenter.

ClusterMaker is easy to use, extremely cost effective, and has one more big advantage: It provides a consistent framework for cluster building. In the open source world, there are so many different cluster tools that Linux system administrators each tend to have their own favorite ways to approach the problem. This reduces productivity every time one admin has to pick up where another left off. Using ClusterMaker, "brain drain" from employee turnover or transfer in this area is virtually eliminated, saving the company hundreds or thousand of dollars in administrator time.

 

Resolved: kernel panic when creating point in time mysql database snapshot

At some point in the Centos / RHEL kernel-2.6.18-164.x kernel tree, a bug was introduced that caused an lvm snapshot of a root volume to panic the kernel ( see LVM discussion list post here ) . As a result, anyone who snapshotted a database that was not on a separate logical volume immediately hung their server (while trying to back it up in the name high availability - oh, the irony). Actually, it works once, then hangs the server every time after that.

However, this bug has been examined and resolved as of the kernel-2.6.18-178 tree. If this issue affects you, there are a few things you can do while waiting for the newer kernel version to appear in your favorite yum repository. The latest test kernel RPMs can usually be downloaded here, courtesy of John Linville of Redhat:

http://people.redhat.com/linville/kernels/rhel5/

And all of these should have the patch installed, plus anything else that has recently has fixed or added.

Alternatively, we provide a working kernel and the corresponding kernel-devel package on our site as well:

http://www.linuxwebcluster.com/download/kernel-2.6.18-182.el5.i686.rpm

http://www.linuxwebcluster.com/download/kernel-devel-2.6.18-182.el5.i686.rpm

Just download and install on RHEL / Centos 5.x with rpm. After installation a quick reboot should load up the new kernel and your snapshots will work again just fine.

 


 

Here's the original kernel oops and LVM-discuss post so google searchers can hopefully find this information:

BUG: scheduling while atomic: java/0x00000001/2959
[<c061637f>] <3>BUG: scheduling while atomic: java/0x00000001/2867
[<c061637f>] schedule+0x43/0xa55
[<c042c40d>] lock_timer_base+0x15/0x2f
[<c042c46b>] try_to_del_timer_sync+0x44/0x4a
[<c0437dd2>] futex_wake+0x3c/0xa5
[<c0434d5f>] prepare_to_wait+0x24/0x46
[<c0461ea7>] do_wp_page+0x1b3/0x5bb
[<c0438b01>] do_futex+0x239/0xb5e
[<c0434c13>] autoremove_wake_function+0x0/0x2d
[<c0463876>] __handle_mm_fault+0x9a9/0xa15
[<c041e727>] default_wake_function+0x0/0xc
[<c046548d>] unmap_region+0xe1/0xf0
[<c061954f>] do_page_fault+0x233/0x4e1
[<c061931c>] do_page_fault+0x0/0x4e1
[<c0405a89>] error_code+0x39/0x40
=======================
schedule+0x43/0xa55
[<c042c40d>] <0>------------[ cut here ]------------
kernel BUG at arch/i386/mm/highmem.c:43!
invalid opcode: 0000 [#1]
SMP
last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc
ip6t_REJECTdCPU: 3 ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo cry
EIP: 0060:[<c041cb08>] Not tainted VLI
EFLAGS: 00010206 (2.6.18-164.2.1.el5 #1)
EIP is at kmap_atomic+0x5c/0x7f
eax: c0012d6c ebx: fff5b000 ecx: c1fb8760 edx: 00000180
esi: f7be8580 edi: f7fa7000 ebp: 00000004 esp: f5c54f0c
ds: 007b es: 007b ss: 0068
Process mpath_wait (pid: 3273, ti=f5c54000 task=f5c50000 task.ti=f5c54000)ne
Stack: c073a4e0 c0462f7f f7b0eb30 f7b40780 f5c54f3c 0029c3f0 f63b5ef0 f7be8580
f7b40780 f7fa7000 00008802 c0472d75 f7b0eb30 f7c299c0 00001000 00001000
00001000 00000101 00000001 00000000 00000000 f5c5007b 0000007b ffffffff
Call Trace:
[<c0462f7f>] __handle_mm_fault+0xb2/0xa15
[<c0472d75>] do_filp_open+0x2b/0x31
[<c061954f>] do_page_fault+0x233/0x4e1
[<c061931c>] do_page_fault+0x0/0x4e1
[<c0405a89>] error_code+0x39/0x40
=======================
Code: 00 89 e0 25 00 f0 ff ff 6b 50 10 1b 8d 14 13 bb 00 f0 ff ff 8d 42 44 c1 e
EIP: [<c041cb08>] kmap_atomic+0x5c/0x7f SS:ESP 0068:f5c54f0c
<0>Kernel panic - not syncing: Fatal exception

0c 29 c3 a1 54 12 79 c0 c1 e2 02 29 d0 83 38 00 74 08 <0f> 0b 2b

 

Mental Ray frame buffer issue

We just went through a render farm tech support call with a customer using Maya / Mental Ray on RenderFarmer 1.1, and I thought this information might be useful to others. The problem was that Mental Ray likes to use /usr/tmp for its local frame buffer, but /usr/tmp is shared by all nodes because of the shared root filesystem. So not only is it slow to access because it's across the network, but the nodes are writing all over each others temp files.

First we tried a bind mount that hooks /usr/tmp into /tmp, which was mounted on a local ramdisk. The access speed was great, but processing time was very slow. Why? The ramdisk was tiny, just 10 MB by default. Nowhere near enough for Mental Ray scratch space.

Solution two was to remove the symlink on each node to /usr/tmp, then create a local /usr/tmp (inside the rootfs, which is also a ramdisk). Presto!  We now have a /usr/tmp that is not limited to an arbitrary size, and because it's in memory, it's lightning fast.

Write speeds observed on RenderFarmer 1.1 cluster using gigabit ethernet :
Compute Node --> remote filesystem 70 MB/s
Master Server --> local disk filesystem 720 MB/s
Compute Node --> local ramdisk /usr/tmp 2900 MB/s
 

atl1e attansic linux driver

Several customers have had some issues getting the atl1e driver to work correctly on Centos Linux 5.2 and 5.3. There is a kmod-atl1e package floating around some RPM repositories, but that doesn't always work right. After spending a fair amount of time helping customers work around problems caused by kmod (drivers are not in the normal place, updating a kernel breaks it and then it won't let you remove / reinstall, etc.) I found the easy way to get these drivers.

Start here. This gentleman was kind enough to rework the Makefile that comes with the atl1e driver source and post the new one with lots of fixes. If that ever goes down you can download it from us here.

Just untar it with 'tar xvf l1e.tar' and cd into the src/ subdirectory, and run 'make install'. The modules are now in the usual place, /lib/modules/<kernel-version>/kernel/drivers/net/atl1e.

 
  • «
  •  Start 
  •  Prev 
  •  1 
  •  2 
  •  3 
  •  4 
  •  Next 
  •  End 
  • »


Page 1 of 4

Subscribe to the Linux Admins Blog and get new posts delivered by email!
Enter your email address:

Delivered by FeedBurner

Linux News


Tell the developers:

The type of clustering you are most likely to deploy is:
 
What Linux distro do you use for clusters?
 

Copyright 2010    RapidScale Clusters, LLC