PoC Deduplication of Logical Volume Image
From Wirespeed
This is a Proof of Concept I wrote to minimize storage space for an image of a Logical Volume. The idea is that an LV changes only for a limited amount of its whole size and recreating a full image wastes a lot of duplicate data. I run my virtual machines from an LV and a disaster recovery consists of a full LV image combined with a file based backup. The image is to be recreated on a relatively long term basis, whereas the filesystem backup is a daily thing. A new image should be created every time the internal LVM configuration changes or grub changed (new kernel). I should insert some comments in the code, but it being simply a PoC, it still needs a lot of thought:
Contents |
Issues
Problem with this little script is that
- it has little or no concept sanity check.
- never tried a full restore
- proper command line interface and
- input validation is simply non existent
- I have no clue if the md5-sha512 combination is strong enough to determine identical blocks. Maybe a more detailed collision check should be included.
- Proper error checking is another missing feature.
- Using /dev/shm (or proper PERL libraries for system commands) for intermediate results may speed things up.
- no housekeeping. Consider smart use of a directory per image and hardlinking the .bz2 files
Use
- At boot time of the physical host, create a LVM snapshot of the LV that needs to be imaged;
- Start the VM
- Start imaging the LV snapshot (consider use of 'at')
- Remove the LV snapshot
BTW: It is not a coincidence an almost identical artice is on linuxforms.org. Wrote that article too.
The script
#!/usr/bin/perl
use warnings;
use strict;
my $lv = '/dev/vg_diablo/vm_localserver';
my ( $logical_volume_name , $volume_group_name , $logical_volume_access , $logical_volume_status , $internal_logical_volume_number , $open_count_of_logical_volume , $logical_volume_size_in_sectors , $current_logical_extents_associated_to_logical_volume , $allocated_logical_extents_of_logical_volume , $allocation_policy_of_logical_volume , $read_ahead_sectors_of_logical_volume , $major_device_number_of_logical_volume , $minor_device_number_of_logical_volume ) = split( /:/ , `lvdisplay -c $lv` );
print "current_logical_extents_associated_to_logical_volume $current_logical_extents_associated_to_logical_volume\n";
my ( $volume_group_name_too , $volume_group_access , $volume_group_status , $internal_volume_group_number , $maximum_number_of_logical_volumes , $current_number_of_logical_volumes , $open_count_of_all_logical_volumes_in_this_volume_group , $maximum_logical_volume_size , $maximum_number_of_physical_volumes , $current_number_of_physical_volumes , $actual_number_of_physical_volumes , $size_of_volume_group_in_kilobytes , $physical_extent_size , $total_number_of_physical_extents_for_this_volume_group , $allocated_number_of_physical_extents_for_this_volume_group , $free_number_of_physical_extents_for_this_volume_group , $uuid_of_volume_group ) = split( /:/ , `vgdisplay -c $volume_group_name` );
print "physical extent size $physical_extent_size\n";
open REBUILDINFO , "> rebuild.info" or die "Cannot create file: $!\n";
for ( my $count = 0; $count < $current_logical_extents_associated_to_logical_volume; $count++) {
my $part = sprintf "%08i" , $count;
system( "dd if=$lv of=part.$part bs=${physical_extent_size}k skip=$count count=1\n" );
my $md5sum = `md5sum part.$part -b`;
chomp $md5sum;
$md5sum =~ s/^([0-9a-f]{32}).*$/$1/;
my $sha512sum = `sha512sum part.$part -b`;
chomp $sha512sum;
$sha512sum =~ s/^([0-9a-f]{128}).*$/$1/;
my $filename = "$md5sum-$sha512sum";
print REBUILDINFO "part.$part $filename\n";
if ( -e "${filename}.bz2" ) {
print "Duplicate!\n";
unlink( "part.$part" );
} else {
rename( "part.$part" , $filename );
system( "bzip2 --best $filename" );
}
}
close REBUILDINFO;
exit;
Results
My webserver is in a logical volume. The largst gain would be once more images of the server are stored. Eg. on a monthly basis.
Initial size: 20GB Straight forward bzip2 -9 archive: 6.4GB Initial result with this script: 6.5GB

