Git offers many advantages over older source control systems and has become the de facto industry standard. However it does not offer fine-grained access controls, instead assuming that developers should have access to the whole repository or none of it. In this post we look at how mirroring can be used to maintain a separate repository that provides public access to part of the repository, while ensuring the remainder is private.
It should be said that the examples in this post are all Java/Maven related, but the same principles apply to other technologies too.
Historically Alfresco used SVN for source control and one of the main blockers to migration was that we wanted to continue to use SVN's path-based access controls. Now our code is in git, and for the most part we have side-stepped the issue of path-based access controls by splitting the old monolithic codebase into a multitude of smaller repositories.
However for some repositories then the public and private code is very tightly coupled, with changes to private code frequently needing a small change to public code too. In SVN this was simple because a single commit could contain changes to all affected directories, however with separate Git repositories we would now need a commit per repository as well as a library upgrade. This library upgrade can be avoided by working with SNAPSHOT releases, however SNAPSHOT dependencies of external projects cause tracibility issues (particularly in CI systems). This can be easily seen by thinking about two feature branches that affect the public and private repositories.
Example showing a private repository depending on a snapshot release from a public repository, and two feature branches getting into a race condition during the build.
Since it's not possible to predict whether the public-SNAPSHOT.jar contains the code from Feature A or Feature B (or maybe even another branch like master) then the build results are unpredictable, and even if they pass then there is no confidence that they verified the right code. These builds are likely to fail frequently as required changes in the public repository are not used when building the private repository.
It's possible to work around this issue by using version modifiers on feature branches of libraries and updating the private repository to use these custom versions. However this approach is not particularly practical in projects like Insight Engine and Search Services, where almost half of all recent feature branches have made changes to both projects.
If you're reading this post because you have the same problem with your own codebase then feel free to skip this section. I've included it here partially to document what approaches we tried, but mainly because I found it interesting to delve into some lesser used features of git.
The first solution we came up with was to mirror the changes by effectively copy/pasting the filtered source code to the target repository and then creating a commit with the correct metadata. This used a persistent clone of the target repository, which was created like this:
git remote add origin [COMMUNITY_REPO]
# Ensure dangling references aren't garbage collected.
git config --local gc.auto 0
Once that is done then new commits can be added to the community repository with:
# Preserve the .git directory and delete everything else
mv targetRepo/.git temp/.git
rm -rf targetRepo
# Copy the filtered code across, but not the .git directory
cp -R sourceRepo targetRepo
rm -rf targetRepo/.git
mv temp/.git targetRepo/.git
# Create the tree object.
git add .
git add -u
# Create the actual commit.
git commit-tree -m [COMMIT_MESSAGE] [PARENTS_LIST] $treeId
The interesting part of this is the write-tree and commit-tree commands. By using the write-tree and commit-tree commands (rather than simply commiting) we can create commits with any number of parents in a consistent way. That is - we can use the same commands to create normal commits and to create merge commits.
Also noteworthy is the use of gc.auto which ensures that git doesn't delete anything while we're working.
Although this produces the repository that we want, it has a downside of being long and a little complex. It's been used in the RM project to mirror code for many years without an issue, but shuffling the .git directories around and copying the code is not a neat solution.
It turns out that git provides a filtering mechanic already in the form of git filter-branch. This can be used to rewrite the history of a branch and update every commit within it. In our case then we use it to delete the private code from the branch history before pushing the result to the community mirror.
The --prune-empty option can be used to ensure that commits that only reference private code are not polluting the community codebase. The --index-filter option can be used to make the filtering much faster.
The final command we get looks like:
git filter-branch -f --prune-empty --index-filter 'git rm -r --cached --ignore-unmatch [DIRECTORY_TO_EXCLUDE]'
We run this command using our CI every time a change is merged to the mirrored branch of the source repository. The filter-branch command produces reproducible output - i.e. it will always produce the same commit ids for older code. As a consequence we are repeating a lot of filtering, but this is very quick compared to the duration of our tests. If there was an issue with the timings then the community mirror could be run in parallel with the tests to reduce the elasped time.
Git Submodule and Git Subtree
Those of you more familiar with git might be shouting at the monitor "Why aren't you using submodules?", or if you're more familiar still then "Why aren't you using subtrees?" Submodule and subtree are git methods of including one repository inside another.
When we discussed mirroring options as a team then we ruled out using submodules fairly quickly because they come with some serious issues. Firstly we had existing codebases, and so we only wanted to include the public library code in some branches. This is possible with submodules, but when checking out the older branch then git submodule will not remove the submodule for you. This can easily lead to a situation where you accidentally commit newer code to a legacy branch. Secondly submodules require you to issue an extra update command every time you change branch. If you forget to do this then your project will build with the wrong version of the library - even though your .gitmodules file already says what version of the library to use.
We spent more time looking into using git subtree and overall it seemed far more appropriate for our use case. Rather than including a reference to the library code then subtree includes the entire library codebase in the private repository. This might sound wasteful, but our source code is not really very large, and since git compresses everything anyway then it's negligable. Having all the source code present in every commit makes searching through the history far easier too.
Using git subtree to include a project as a subtree is very possible. Here is an example of how "git subtree pull" might be used to include an existing public library in the codebase for a private repository (solving the SNAPSHOT dependency issue).
Using git subtree to include an existing public library in a private repository. With this workflow then all changes to public code must be made in the public repository.
Git subtree also has a push option which allows for a mirroring workflow. If there are pull requests made directly in the public repository then they can be incorporated in the private repository by pulling as before. This would look like:A mirror workflow using git subtree. Changes can be pulled from the library to the private repository, and pushed from the private repository back.
The second of these approaches is incredibly close to the filter-branch mirroring we're currently using, however git subtree (and submodule) has a limitation that you can only include repositories as whole directories. It's not possible to mirror/include a single file in the root directory of your project. Unfortunately we wanted our private repository to be structured as a multimodule maven project, and to share a root pom file across the two projects. This requires a public root pom file and so is not possible with either approach.
There are other third-party tools too that you might be shouting about - for example git subrepo and git slave. We looked into subrepo as an option, but the main issue was needing to install it on our CI system. This would make the builds less portable, which is something we've tried to avoid.
Future investigation - git filter-repo
Assuming we can work around the issue of having a third-party tool available to our CI, then filter-repo looks like a very interesting replacement for filter-tree. It provides some features that we could definitely make use of:
- Specify paths to include (in our case quite a small list), rather than paths to exclude.
- Specify paths to files as well as directories (not possible with submodule/subtree).
- Automatically remove old cruft and repack the repository for the user after filtering, potentially resulting in a smaller output project.
- Avoid confusing users (and prevent accidental re-pushing of old stuff) due to mixing old repo and rewritten repo together.
- Provide the user the ability to extend the tool.
- Provide a way for users to use old commit IDs with the new repository using /refs/replace references.
- Rewrite any commit messages that refer to commit ids.
- Automatically create new root commits if necessary.
On top of these then filter-repo promises to be faster than filter-branch.
We have looked at several ways to maintain a codebase that consists of public and private parts. The simplest of these is to treat the public library and private project as two separate repositories, but this requires a large amount of switching between projects and runs into issues when working with SNAPSHOTs and feature branches.
We then discussed an automated approach using some plumbing commands to mirror the appropriate commits. We looked at how this can be greatly simplified by using the git filter-tree command.
We compared our approach with git submodule and git subtree, and saw how this ran into issues when using a multimodule maven project.
Finally we looked at potential improvements that might be had by using git filter-repo.