33.35 Git Remove all Traces of a File

20240225

Github maintains archival copies of everything and it takes a little effort to remove something completely from a repository. You can not guarantee to remove all traces of any file you uploaded to a git repository unless it is a private repository to which only you have access. For public repositories someone may have already downloaded the files or cloned the repository, or forked your repository. However, if you are quick enough you can limit the risk. The typical use case is when you have accidentally uploaded a file containing secret information, such as a password. After removing the file from your git repository it is still a good idea to change any exposed passwords.

To begin with, be sure to backup your local copy of the cloned repository in case of any failures in the process.

The current [20240225] suggested approach comes from the github docs using the BFG Repo-Cleaner, a utility which supports quite some flexibility for removing files from a repository. For example, it can also remove folders or files above a certain size.

To remove files and the history of files from a repository follow the steps below. Make sure to delete any branches on origin that are no longer needed as the rewrite history will not apply to PRs and already checked out branches or forks. Merge any PRs first for the same reason.

First delete the files (e.g., secrets.csv and the folder secrets) from your repository.

cd myrepo
git rm assets/secrets.csv
git rm -r secrets/
git commit -am 'Cleanup'
git push

Now clone the bare repository;

git clone --mirror myuser/myrepo

Then rewrite history from the main branch of the local repository mirror with:

cd myrepo.git
bfg --delete-files secret.csv secrets/ ...

Wildcards can be passed to define multiple files to delete in this call.

Overwrite the commit history with:

git reflog expire --expire=now --all
git gc --prune=now --aggressive

Forcefully push to remote:

git push --force

Follow through to avoid accidentally reintroducing the removed files back into the repository.

Now destroy your repository and take a fresh clone. Also tell collaborators to take a fresh clone of the repository, or to rebase their branch before submitting a pull request. They can then take a fresh clone of the repository.

ToDo Github is currently (20210830) responding that git-filter-repo should be used over the git-filter-branch. THIS NEEDS TO BE UPDATED AND IS NOT RECOMMENDED IN ITS CURRENT FORM

Here we demonstrate the removal of all traces of a file named test/private.py from a git repository.


DO NOT USE THIS UNTIL THE RECIPE IS UPDATED

WARNING: git-filter-branch has a glut of gotchas generating mangled history
         rewrites.  Hit Ctrl-C before proceeding to abort, then use an
         alternative filtering tool such as 'git filter-repo'
         (https://github.com/newren/git-filter-repo/) instead.  See the
         filter-branch manual page for more details; to squelch this warning,
         set FILTER_BRANCH_SQUELCH_WARNING=1.
Proceeding with filter-branch...


$ git rm test/private.py
$ git commit -m "Permanently remove this file."
$ git push
$ git filter-branch --force --index-filter \
  'git rm --cached --ignore-unmatch test/private.py' \
  --prune-empty --tag-name-filter cat -- --all

Rewrite 51c5....a070 (143/204) (...)    rm 'test/private.py'
Rewrite caf4....9d47 (143/204) (...)    rm 'test/private.py'
[...]

$ git push --all --force

In this process we (optionally) first rm the file of interest, then commit that change and push the change to the remote repository. A filter-branch command then does the actual work to find all commits that include the file to remove, then removes the file from those commits with a rm.

After this process all traces of the contents of test/private.py are removed from the repository. They will not be removed from any clones or forks.



Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0