Nicolas LIMARE / pro / notes / 2012 / Three Non-trivial Use Cases for Git

This note was kindly published by Greg Wilson on the Software Carpentry Blog; comments and answers can be posted there.

Here are three problems I encountered while introducing fellow grad students and post-docs to Git. These are situations where Git seems not to provide the solution we need, either because it requires the addition of an external service provider, unusual configuration steps not adapted to beginners, or simply because the functionality is missing in Git. Or maybe did I just ignore the perfect command or option which would have answered all my wishes?

1. Simple Collaboration

Two or three people want to work together on a single repository, like they are used to with Subversion. No branch, no pull request, no github, just a shared bare repo on the lab server, cloned and sync'd onto everyone's own laptop via SSH. The problem is to allow proper read/write permissions for everyone. Basic UNIX groups are not practical, for example because they require to go through the local sysadmin for every new project, and to define a new group per project. In that case, my solution is to use POSIX extended ACLs:

setfacl -Rm u:USER:rwX '~/code/project.git'
setfacl -Rm d:u:USER:rwX '~/code/project.git'

It is fairly clean, standard, and works perfectly, but these ACLs are usually not enabled by default on filesystems and sysadmins may not want to allow them. Moreover, non-admin users are not used to work with this kind of permissions and having to touch them to collaborate feels somehow abnormal.

2. Sharing Binary Data

There was a group of students working on a shared dataset. One person would produce the input data, and other persons would process them, in chain, each with their own program. Everything is in a git repo, everyone's program is stored in source code and built with Make, the data is processed in chain with Make too, and so far everything is fine. But the input data is regularly updated, and it's a hundred of 10 Mpixel JPEG images. So the size of the repository quickly becomes quite heavy on everyone's machine (JPEG is already compressed, and doesn't shrink any further in Git, and binary diff of JPEG files is, or was, inefficient).

This huge weight is the problem. We want to keep the history and to be able to go back to previous versions of the data, but we absolutely don't want every one in the group to have on their own machine 25 different versions of a 100MB dataset. The solution may be git-annex, but it didn't exist at this time and it is not part of the standard Git toolset. It's certainly also possible with git rebase and squash every time the input data is updated, but this git-fu is too complex for our simple needs. I want people to be able to configure their local repo so that it only keeps less than a configurable amount of data locally, something like a permanent shallow clone.

3. Mirroring

I was also trying to sell Git as a good solution to replicate files between two machines, either for website or web app deployment, or to maintain a workspace in different places. A sort of improved Rsync, with diff transfers and the full history.

But to do that with Git, pushing updates from one machines to another, the receiving git repo needs to automatically perform a checkout. This step requires some hook configuration (this is non-trivial for beginners), and the checkout must be very carefully configured to be performed at the right place, and to behave correctly when some files need to be deleted or local changes overwritten. This is too esoteric for a simple need.