Reinventing the wheel: Automatic Bidirectional Directory Synchronization

21. March 2011 - 19:11 — Tilman

As already mentioned, I am the proud owner of an online hard drive and I love it. Enough space for everything that's worth keeping, secure access over lots of different protocols (including rsync), an automatic backup function... It would be perfect if I had an Internet connection of at least 100 MBit/s. But I haven't. (Actually the thing almost feels like a local disk when I use it from my computer at work. But that's through a university network.) So I had to come up with a little more sophisticated architecture. I bought a GuruPlug server and added an external USB hard drive to it. The basic idea is to use the USB drive over the LAN while at home but still being able to have all data synchronized to the online hard drive in order to have automatic backup and fast access from everywhere else. I thought that the sychronization part would be the easiest as there is a plethora of software available for this task but when I had a closer look at the tools, I ran into problems:

rsync, the best known synchronization software only does unidirectional synching. (So it's a mirroring tool, not a synchronization tool, to be correct.) But I want to be able to change files at home on the local drive and on the go on the online drive and have them automatically sychronized overnight.
Unison does bidirectional synchronization, also using the heavily traffic-optimized rsync protocol. Sounds perfect. But unison does not really work on locally mounted remote drives as it apparently calculates a hash sum for each and every file which results in downloading the entire online hard drive. (Currently 135 GB in my case.) Usually one would use a unison server for that, calculating the hashes locally on both sides but I do not have shell access to the server - it's only a hard drive.
JFileSync, my personal favorite, does bidirectional synchronization and can run from the shell. But when I tried to use it for my "135 GB, 70000 files" directory, it blew up my server's memory as it does a complete comparison in the first step, putting the pending actions for all files in an xml structure, as far as I can tell. (I have to add that the directories were almost in sync at that time. So it's not only because of the huge number of actions for the first sychronization.)

Is it really that hard to have two directories synchronized with the files only compared by their size and their timestamp? I ended up writing my own SyncTool in Java. And while I was at it, I added the possibility to receive the synchronization protocol via Jabber.

Libraries used

I am standing on the shoulders of giants here, I have to say. The tool uses H2 to keep track of the files, the Apache Commons and log4j for the actual copy procedure and log output, JSAP to parse the command line and Smack to connect to Jabber. So all I did was implementing the actual synchronization algorithm: Read the file list, compare and synchronize, recurse. Done using Eclipse and Maven.

Where to get it

If you want to do something similar and don't have the time to reinvent the wheel once again, you can download the SyncTool from my homepage. Available as open source under the Apache License 2.0.

How to use it

The software is intended to work in batch mode and thus performs bidirectional synchronization only without conflict resolution of any kind. If a file was modified on both sides, the older one being dismissed and overwritten by the newer one. (It should be no problem, however, to add a feature like a --dont-overwrite-conflicting-files with a few lines of additional code.) That being said, the second thing you have to think about is that you need a Java JRE. The Open JDK 6 JRE works pretty well with the program on my server, just apt-get install default-jre.

Remember to backup your files, this software comes with ABSOLUTELY NO WARRANTY.

This is the help page:

Usage: java -jar synctool-1.0.8.jar <source path> <destination path> [(-f|--dbfile) <database file>] [(-l|--logfile) <logfile>] [(-j|--jabber) <jabber address>] [(-r|--server) <jabber server>] [(-u|--user) <jabber user>] [(-p|--password) <jabber password>] [-d|--dry-run] [-h|--hashing] [-i|--ignore-directory-attributes] [-s|--silent] [--debug] [-?|--help]

  <source path>
        the source path

  <destination path>
        the destination path

  [(-f|--dbfile) <database file>]
        the path to the database file to use (default: synctool)

  [(-l|--logfile) <logfile>]
        the path for a logfile to write

  [(-j|--jabber) <jabber address>]
        send logging output as jabber message to the given address

  [(-r|--server) <jabber server>]
        the jabber server to connect to

  [(-u|--user) <jabber user>]
        the jabber user name used for logging in to the server

  [(-p|--password) <jabber password>]
        the jabber password used for logging in to the server

  [-d|--dry-run]
        perform a trial run with no changes made

  [-h|--hashing]
        generate MD5 file hashes for exact comparison

  [-i|--ignore-directory-attributes]
        do not copy attributes for directories

  [-s|--silent]
        do not print "Entering directory" and "No operation" messages

  [--debug]
        print debug messages

  [-?|--help]
        print help and exit

You are here

Reinventing the wheel: Automatic Bidirectional Directory Synchronization

Libraries used

Where to get it

How to use it

Written by