Managing large binary files with git

0 votes
asked Feb 12, 2009 by pi

I am looking for opinions of how to handle large binary files on which my source code (web application) is dependent. We are currently discussing several alternatives:

  1. Copy the binary files by hand.
    • Pro: Not sure.
    • Contra: I am strongly against this, as it increases the likelihood of errors when setting up a new site/migrating the old one. Builds up another hurdle to take.
  2. Manage them all with git.
    • Pro: Removes the possibility to 'forget' to copy a important file
    • Contra: Bloats the repository and decreases flexibility to manage the code-base and checkouts/clones/etc will take quite a while.
  3. Separate repositories.
    • Pro: Checking out/cloning the source code is fast as ever, and the images are properly archived in their own repository.
    • Contra: Removes the simpleness of having the one and only git repository on the project. Surely introduces some other things I haven't thought about.

What are your experiences/thoughts regarding this?

Also: Does anybody have experience with multiple git repositories and managing them in one project?

Update: The files are images for a program which generates PDFs with those files in it. The files will not change very often(as in years) but are very relevant to a program. The program will not work without the files.

11 Answers

0 votes
answered Feb 12, 2009 by claf

In my opinion, if you're likely to often modify those large files, or if you intend to make a lot of git clone or git checkout, then you should seriously consider using another git repository (or maybe another way to access thoses files).

But if you work like we do, and if your binary file are not often modified, then the first clone/checkout will be long, but after that it should be as fast as you want (considering your users keep using the first cloned repo they had).

0 votes
answered Feb 12, 2009 by pat-notz

If the program won't work without the files it seems like splitting them into a separate repo is a bad idea. We have large test suites that we break into a separate repo but those are truly "auxiliary" files.

However, you may be able to manage the files in a separate repo and then use git-submodule to pull them into your project in a sane way. So, you'd still have the full history of all your source but, as I understand it, you'd only have the one relevant revision of your images submodule. The git-submodule facility should help you keep the correct version of the code in line with the correct version of the images.

Here's a good introduction to submodules from Git Book.

0 votes
answered Feb 12, 2009 by daniel-fanjul

I would use submodules (as Pat Notz) or two distinct repositories. If you modify your binary files too often, then I would try to minimize the impact of the huge repository cleaning the history:

I had a very similar problem several months ago: ~21Gb of mp3's, unclassified (bad names, bad id3's, don't know if I like that mp3 or not...), and replicated in three computers.

I used an external harddisk with the main git repo and I cloned it into each computer. Then, I started to classify them in the habitual way (pushing, pulling, merging... deleting and renaming many times).

At the end, I had only ~6Gb of mp3's and ~83Gb in the .git dir. I used git-write-tree and git-commit-tree to create a new commit, without commit ancestors, and started a new branch pointing to that commit. The "git log" for that branch only showed one commit.

Then, I deleted the old branch, kept only the new branch, deleted the ref-logs, and run "git prune": after that, my .git folders weighted only ~6Gb...

You could "purge" the huge repository from time to time in the same way: Your "git clone"'s will be faster.

0 votes
answered Feb 3, 2010 by tony-diep

SVN seems to handle binary deltas more efficiently than git

Had to decide on a versioning system for documentation (jpgs, pdfs, odts). Just tested adding a jpeg and rotating it 90 degrees 4 times (to check effectiveness of binary deltas). git's repository grew 400%. SVN's repository grew by only 11%

So it looks like SVN is much more efficient with binary files

So my choice is git for source code and SVN for binary files like documentation.

0 votes
answered Feb 9, 2011 by rafak

I discoverd git-annex recently which I find awesome. It was designed for managing large files efficiently. I use it for my photo/music (etc.) collections. The development of git-annex is very active. The content of the files can be removed from the git repo, only the tree hierarchy is tracked by git (through symlinks). However, to get the content of the file, a second step is necessary after pulling/pushing, e.g.:

$ git annex add mybigfile
$ git commit -m'add mybigfile'
$ git push myremote 
$ git annex copy --to myremote mybigfile ## this command copies the actual content to myremote 
$ git annex drop mybigfile ## remove content from local repo
...
$ git annex get mybigfile ## retrieve the content
## or to specify the remote from which to get:
$ git annex copy --from myremote mybigfile

There are many commands available, and there is a great documentation on the website. A package is available on debian.

0 votes
answered Feb 21, 2011 by sehe

Have a look at git bup which is a git extension to smartly store large binaries in a git repo.

You'd want to have it as a submodule but you won't have to worry about the repo getting hard to handle. One of their sample use cases is storing VM images in git.

I haven't actually seen better compression rates but my repos don't have really large binaries in them.

YMMV

0 votes
answered Feb 26, 2013 by carl

You can also use git-fat. What I like about it is that it only depends on stock python and rsync. It also supports the usual git workflow, with the following self explanatory commands:

git fat init
git fat push
git fat pull

In addition, you need to check in a .gitfat file into your repo and modify your .gitattributes to specify the file extensions you want git fat to manage.

You add a binary using the normal git add, which in turn invokes git fat based on your gitattributes rules.

Finally, it has the advantage that the location where your binaries are actually stored can be shared across repositories and users and supports anything rsync does.

UPDATE: Do not use git-fat if you're using a git-svn bridge. It will end up removing the binary files from your subversion repository. However, if you're using a pure git repository, it works beautifully.

0 votes
answered Feb 3, 2014 by hernan

Have you looked at camlistore It is not really git based, but I find it more appropriate for what you have to do.

0 votes
answered Feb 9, 2015 by vonc

Another solution, since April 2015 is Git Large File Storage (LFS) (by GitHub).

It uses git-lfs (see git-lfs.github.com) and tested with a server supporting it: lfs-test-server:
You can store metadata only in the git repo, and the large file elsewhere.

https://cloud.githubusercontent.com/assets/1319791/7051226/c4570828-ddf4-11e4-87eb-8fc165e5ece4.gif

0 votes
answered Feb 13, 2015 by adam-kurkiewicz

The solution I'd like to propose is based on orphan branches and a slight abuse of the tag mechanism, henceforth referred to as *Orphan Tags Binary Storage (OTABS)

TL;DR 12-01-2017 If you can use github's LFS or some other 3rd party, by all means you should. If you can't, then read on. Be warned, this solution is a hack and should be treated as such.

Desirable properties of OTABS

  • it is a pure git and git only solution -- it gets the job done without any 3rd party software (like git-annex) or 3rd party infrastructure (like github's LFS).
  • it stores the binary files efficiently, i.e. it doesn't bloat the history of your repository.
  • git pull and git fetch, including git fetch --all are still bandwidth efficient, i.e. not all large binaries are pulled from the remote by default.
  • it works on Windows.
  • it stores everything in a single git repository.
  • it allows for deletion of outdated binaries (unlike bup).

Undesirable properties of OTABS

  • it makes git clone potentially inefficient (but not necessarily, depending on your usage). If you deploy this solution you might have to advice your colleagues to use git clone -b master --single-branch <url> instead of git clone. This is because git clone by default literally clones entire repository, including things you wouldn't normally want to waste your bandwidth on, like unreferenced commits. Taken from SO 4811434.
  • it makes git fetch <remote> --tags bandwidth inefficient, but not necessarily storage inefficient. You can can always advise your colleagues not to use it.
  • you'll have to periodically use a git gc trick to clean your repository from any files you don't want any more.
  • it is not as efficient as bup or git-bigfiles. But it's respectively more suitable for what you're trying to do and more off-the-shelf. You are likely to run into trouble with hundreds of thousands of small files or with files in range of gigabytes, but read on for workarounds.

Adding the Binary Files

Before you start make sure that you've committed all your changes, your working tree is up to date and your index doesn't contain any uncommitted changes. It might be a good idea to push all your local branches to your remote (github etc.) in case any disaster should happen.

  1. Create a new orphan branch. git checkout --orphan binaryStuff will do the trick. This produces a branch that is entirely disconnected from any other branch, and the first commit you'll make in this branch will have no parent, which will make it a root commit.
  2. Clean your index using git rm --cached * .gitignore.
  3. Take a deep breath and delete entire working tree using rm -fr * .gitignore. Internal .git directory will stay untouched, because the * wildcard doesn't match it.
  4. Copy in your VeryBigBinary.exe, or your VeryHeavyDirectory/.
  5. Add it && commit it.
  6. Now it becomes tricky -- if you push it into the remote as a branch all your developers will download it the next time they invoke git fetch clogging their connection. You can avoid this by pushing a tag instead of a branch. This can still impact your colleague's bandwidth and filesystem storage if they have a habit of typing git fetch <remote> --tags, but read on for a workaround. Go ahead and git tag 1.0.0bin
  7. Push your orphan tag git push <remote> 1.0.0bin.
  8. Just so you never push your binary branch by accident, you can delete it git branch -D binaryStuff. Your commit will not be marked for garbage collection, because an orphan tag pointing on it 1.0.0bin is enough to keep it alive.

Checking out the Binary File

  1. How do I (or my colleagues) get the VeryBigBinary.exe checked out into the current working tree? If your current working branch is for example master you can simply git checkout 1.0.0bin -- VeryBigBinary.exe.
  2. This will fail if you don't have the orphan tag 1.0.0bin downloaded, in which case you'll have to git fetch <remote> 1.0.0bin beforehand.
  3. You can add the VeryBigBinary.exe into your master's .gitignore, so that no-one on your team will pollute the main history of the project with the binary by accident.

Completely Deleting the Binary File

If you decide to completely purge VeryBigBinary.exe from your local repository, your remote repository and your colleague's repositories you can just:

  1. Delete the orphan tag on the remote git push <remote> :refs/tags/1.0.0bin
  2. Delete the orphan tag locally (deletes all other unreferenced tags) git tag -l | xargs git tag -d && git fetch --tags. Taken from SO 1841341 with slight modification.
  3. Use a git gc trick to delete your now unreferenced commit locally. git -c gc.reflogExpire=0 -c gc.reflogExpireUnreachable=0 -c gc.rerereresolved=0 -c gc.rerereunresolved=0 -c gc.pruneExpire=now gc "$@". It will also delete all other unreferenced commits. Taken from SO 1904860
  4. If possible, repeat the git gc trick on the remote. It is possible if you're self-hosting your repository and might not be possible with some git providers, like github or in some corporate environments. If you're hosting with a provider that doesn't give you ssh access to the remote just let it be. It is possible that your provider's infrastructure will clean your unreferenced commit in their own sweet time. If you're in a corporate environment you can advice your IT to run a cron job garbage collecting your remote once per week or so. Whether they do or don't will not have any impact on your team in terms of bandwidth and storage, as long as you advise your colleagues to always git clone -b master --single-branch <url> instead of git clone.
  5. All your colleagues who want to get rid of outdated orphan tags need only to apply steps 2-3.
  6. You can then repeat the steps 1-8 of Adding the Binary Files to create a new orphan tag 2.0.0bin. If you're worried about your colleagues typing git fetch <remote> --tags you can actually name it again 1.0.0bin. This will make sure that the next time they fetch all the tags the old 1.0.0bin will be unreferenced and marked for subsequent garbage collection (using step 3). When you try to overwrite a tag on the remote you have to use -f like this: git push -f <remote> <tagname>

Afterword

  • OTABS doesn't touch your master or any other source code/development branches. The commit hashes, all of the history, and small size of these branches is unaffected. If you've already bloated your source code history with binary files you'll have to clean it up as a separate piece of

Welcome to Q&A, where you can ask questions and receive answers from other members of the community.
Website Online Counter

...