PoC Deduplication of Logical Volume Image

From Wirespeed
Jump to: navigation, search

This is a Proof of Concept I wrote to minimize storage space for an image of a Logical Volume. The idea is that an LV changes only for a limited amount of its whole size and recreating a full image wastes a lot of duplicate data. I run my virtual machines from an LV and a disaster recovery consists of a full LV image combined with a file based backup. The image is to be recreated on a relatively long term basis, whereas the filesystem backup is a daily thing. A new image should be created every time the internal LVM configuration changes or grub changed (new kernel). I should insert some comments in the code, but it being simply a PoC, it still needs a lot of thought:



Problem with this little script is that

  • it has little or no concept sanity check.
  • never tried a full restore
  • proper command line interface and
  • input validation is simply non existent
  • I have no clue if the md5-sha512 combination is strong enough to determine identical blocks. Maybe a more detailed collision check should be included.
  • Proper error checking is another missing feature.
  • Using /dev/shm (or proper PERL libraries for system commands) for intermediate results may speed things up.
  • no housekeeping. Consider smart use of a directory per image and hardlinking the .bz2 files


  • At boot time of the physical host, create a LVM snapshot of the LV that needs to be imaged;
  • Start the VM
  • Start imaging the LV snapshot (consider use of 'at')
  • Remove the LV snapshot

BTW: It is not a coincidence an almost identical artice is on linuxforms.org. Wrote that article too.

The script


use warnings;
use strict;

my $lv = '/dev/vg_diablo/vm_localserver';

my ( $logical_volume_name , $volume_group_name , $logical_volume_access , $logical_volume_status , $internal_logical_volume_number , $open_count_of_logical_volume , $logical_volume_size_in_sectors , $current_logical_extents_associated_to_logical_volume , $allocated_logical_extents_of_logical_volume , $allocation_policy_of_logical_volume , $read_ahead_sectors_of_logical_volume , $major_device_number_of_logical_volume , $minor_device_number_of_logical_volume ) = split( /:/ , `lvdisplay -c $lv` );

print "current_logical_extents_associated_to_logical_volume $current_logical_extents_associated_to_logical_volume\n";

my ( $volume_group_name_too , $volume_group_access , $volume_group_status , $internal_volume_group_number , $maximum_number_of_logical_volumes , $current_number_of_logical_volumes , $open_count_of_all_logical_volumes_in_this_volume_group , $maximum_logical_volume_size , $maximum_number_of_physical_volumes , $current_number_of_physical_volumes , $actual_number_of_physical_volumes , $size_of_volume_group_in_kilobytes , $physical_extent_size , $total_number_of_physical_extents_for_this_volume_group , $allocated_number_of_physical_extents_for_this_volume_group , $free_number_of_physical_extents_for_this_volume_group , $uuid_of_volume_group ) = split( /:/ , `vgdisplay -c $volume_group_name` );

print "physical extent size                                 $physical_extent_size\n";

open REBUILDINFO , "> rebuild.info" or die "Cannot create file: $!\n";

for ( my $count = 0; $count < $current_logical_extents_associated_to_logical_volume; $count++) {
        my $part = sprintf "%08i" , $count;
        system( "dd if=$lv of=part.$part bs=${physical_extent_size}k skip=$count count=1\n" );
        my $md5sum = `md5sum part.$part -b`;
        chomp $md5sum;
        $md5sum =~ s/^([0-9a-f]{32}).*$/$1/;
        my $sha512sum = `sha512sum part.$part -b`;
        chomp $sha512sum;
        $sha512sum =~ s/^([0-9a-f]{128}).*$/$1/;
        my $filename = "$md5sum-$sha512sum";
        print REBUILDINFO "part.$part $filename\n";
        if ( -e "${filename}.bz2" ) {
                print "Duplicate!\n";
                unlink( "part.$part" );
        } else {
                rename( "part.$part" , $filename );
                system( "bzip2 --best $filename" );




My webserver is in a logical volume. The largst gain would be once more images of the server are stored. Eg. on a monthly basis.

Initial size:                      20GB
Straight forward bzip2 -9 archive:  6.4GB
Initial result with this script:    6.5GB
Personal tools