Nicolas LIMARE / pro / notes / 2013 / Integral Mail Archive with Git

Brief: It is possible to keep your complete mail history in git control. Complete history means that every single message that ever entered your mailbox will be tracked and preserved. It can be completely automated and does not take much disk space or CPU. It is a good solution to recover old mails deleted by mistake.

Almost one year ago I started storing my mail history in a git repository, as a real-life experiment to see how this setup could work. The idea was to keep the whole history of my mail communication under git control as a way to be able to recover messages deleted by mistake.Here is how it works:

  • mail is stored in a Maildir (one file per message) in ~/mail/~, accessed with Mutt and updated with Offlineimap
  • the git dir is ~/mail.git/; the work dir is ~/mail/ (configured in ~/mail.git/config); git files will not mess with Mutt and Offlineimap
  • a simple backup script in ~/mail.git/hooks/backup: #! /bin/sh GIT_WORK_TREE=/home/user/mail GIT_DIR=/home/user/mail.git cd $GIT_DIR git add $GIT_WORK_TREE/* git commit -a -m "auto backup"
  • this script is called by Offlineimap as presync and postsync hooks

The idea is to record in git every new mail (commited from the INBOX by the postsync hook) and everything I do of this mail (move, delete, edit, by the presync hook). In addition to manual invocations, Offlineimap is run every 10 minutes by a cronjob.

After 10 months, the git repository has 16000 commits and weights 2G, while my mail folder weights 2.3G for 64000 messages (includes 6 years of mail archives). Since git recorded every piece of spam that reached me and the multitude of automated mails I read once and delete. it seems quite good at compressing and deduplicating the message files.

Ans once the mail folders have been accessed once and fresh in the system memory (ie after the first run of when Mutt is open), the git backup is quick, faster then the IMAP synchronization. So, for no big cost, I have the peace of mind of having more data under some sort of extensive control.

And it's useful. I was looking for a mail today and couldn't find it. I supposed it had been deleted by mistake and looked for a copy in my git history. First I cloned the git repo, because you don't want your exploration of the history to mess with the IMAP synchronizations. The I just went back in time until I got the mail I wanted in my INBOX:

while true; do
    test -d INBOX && grep -R From.* INBOX/ && break;
    git checkout -q HEAD^; git log -1 --format=format:"%ai" HEAD;

And in a few seconds I went back one week and got the message I needed.