diff --git a/content/posts/renames-in-git-explained.md b/content/posts/renames-in-git-explained.md new file mode 100644 index 0000000..9bdcd12 --- /dev/null +++ b/content/posts/renames-in-git-explained.md @@ -0,0 +1,219 @@ +--- +title: "Renames in Git explained" +date: 2020-11-28T12:07:00Z +draft: false +toc: true +tags: ['tech', 'git', 'rename'] +--- + +## Introduction + +One of the questions I'm often asked when teaching or explaining Git is how Git +handles file and/or directory renames. The short answer to this is: **It** +**doesn't**. + +The slightly longer answer is: **It does, but probably not in the way you** +**envision it**. + +To help you understand this topic a bit more, we first have to go back to the +basics: What actually is a file or directory name? The answer to this question +is highly dependent on the underlying file system, but in general it can be +boiled down to this: + +> A file (or directory) name is an index used by the file system to look up the +> contents of the file. (Note: from now on I will only refer to file names, but +> the same applies to directory names as well.) + +What you should note from this is that a filename is actually not a property of +the file content itself, but part of the meta-data regarding the content. In +Linux, for instance, the filename of a file is stored in the directory, which +is basically a associative array which maps filenames to inodes (the object +which stores the meta-data of a file). + +When renaming a file, what you are actually doing is updating a look up table. +In Linux, this would be updating the associative array of the directory. If you +move a file, then you remove the element from one directory and add it to +another directory. + +How this all works internally depends on the OS and the underlying file system, +but more importantly is seldom related to the content of a file. Which brings us +to the next chapter. + +## Git stores content, not files + +When you commit to a Git repository it basically does the following: + +For each directory (including the top one), create a **tree** object. This is +done by looking at every file and directory to be commited and create **blob** +objects for the files and tree objects for the directories. The hash of each +such object is added to this tree object together with the filename if the +object type is blob and the directory name if the object type is tree. This is +then prepended with a header and compressed. The SHA-1 hash is calculated and +the object is stored in the object store (.git/objects), using the first two +characters as a directory and the rest as filename. + +It then creates a commit object which points to the top level tree's hash. + +> Note: it of course only really does this for files which were part of the +> staging area. That's the most efficient. Of course if the content of a file +> was changed, it hash will change and thus the tree object it was part of will +> change and its hash will also change and so on until the top level tree +> object. + +As an example, suppose you have the following structure: + +```bash +. +├── README.md +├── bar +│   ├── bar.md +│   └── baz +│   └── baz.md +└── foo + └── foo.md +``` + +If you were to commit this structure to git, you will have (simplified): + +- 4 blob objects (README.md, bar.md, foo.md, baz.md) +- 4 tree objects (., ./foo, ./bar and ./bar/baz) +- 1 commit object + +In my case: + +```bash +gael@Aviendha:~/git/tmp$ git commit -m "First commit" +[master (root-commit) 8be3cf0] First commit + 4 files changed, 4 insertions(+) + create mode 100644 README.md + create mode 100644 bar/bar.md + create mode 100644 bar/baz/baz.md + create mode 100644 foo/foo.md +gael@Aviendha:~/git/tmp$ find .git/objects/ -type f +.git/objects/52/01cdd884658a103819d66f910ea25ba1dad2e0 +.git/objects/be/e527307ae70706c20eb89f205f444c3bb385e9 +.git/objects/6b/dd34e3e9ab26062ab881adb1024923923b5f8e +.git/objects/8b/e3cf05d01320a124991a8e7c10fe83ec9cd5e3 +.git/objects/25/7cc5642cb1a054f08cc83f2d943e56fd3ebe99 +.git/objects/57/16ca5987cbf97d6bb54920bea6adde242d87e6 +.git/objects/f9/07d059fcdc9b594c6e14dc0c3826f26ab47832 +.git/objects/e8/45566c06f9bf557d35e8292c37cf05d97a9769 +.git/objects/0c/7d27db1f575263efdcab3dc650f4502a2dbcbf +``` + +To get the top level tree object, just look at the commit: + +```bash +gael@Aviendha:~/git/tmp$ git cat-file -p 8be3cf0 +tree 5201cdd884658a103819d66f910ea25ba1dad2e0 +author Gaël Depreeuw 1606569688 +0100 +committer Gaël Depreeuw 1606569688 +0100 + +First commit +``` + +And if we look at the tree: + +```bash +gael@Aviendha:~/git/tmp$ git cat-file -p 5201cdd +100644 blob e845566c06f9bf557d35e8292c37cf05d97a9769 README.md +040000 tree f907d059fcdc9b594c6e14dc0c3826f26ab47832 bar +040000 tree 0c7d27db1f575263efdcab3dc650f4502a2dbcbf foo +``` + +The contents of `README.md` is: + +```bash +gael@Aviendha:~/git/tmp$ git cat-file -p e845566 +README +``` + +So what does this all mean, when we rename a file? + +## Renaming file + +If we're just looking at renaming a file, then the contents of the file will +not change. This means the blob object representing the file does not change. +What does change is: + +1. The old file's name is removed from the tree object it belong to. +2. The new file's name is added to the tree object it belongs to (with the same + hash in this case). + +As such Git is not aware of any name changes. This is why the short answer is: +Git doesn't handle file renames. The repository itself has no notion of this +action. It's just has content and a structure for that content. + +However, that does not mean you lose your history when you rename a file. + +### How to see history of a renamed file + +When you remove and add a file (which is what a rename is for Git), Git will +analyze this and when the files are X% alike (with X being defaulted to 50), +it will assume a rename occured. You can show the log of a file including +renames using: + +```bash +git log --follow -- +``` + +If you want to adjust the treshold you can use the `-MX%` option, where X is the +percentage you want (0-100). + +Because there is a percentage treshold, the recommendation is that you do not +combine renaming a file, with modifying a file. If files are 100% identical when +adding/removing it makes it much easier to see them as renames. If on the other +hand, you rename a file and start modifying it heavily, Git might not detect +this as a rename, unless you lower the treshold. + +You can also turn off rename detection by doing `--no-renames` + +### Can I fix my commit if I did change a lot of content after renaming + +First, to prevent this: always check using `git status` whether are not the +rename is being detected. Now, how to solve it? + +It depends. If your commit is local only and it is the last commit, then you can +fix this easily. There are many ways to to it, but a couple options are: + +```bash +git mv # Undo the file rename +git commit --amend # Commit the changes to the file +git mv # Rename the file +git commit # Commit the rename +``` + +If the commit is already a couple of commits ago, you can do the same with an +interactive rebase and amending the commit at the right time. + +If you already pushed your commits you will have to check with the team if you +can rewrite the history and push it. If this is not possible, you might need to +find the right treshold to have Git mark it as a rename. + +### Why did Git do it this way + +This is pure speculation but dealing with renames is not as easy as it first +looks. For instance you could add a git command to do a rename (like subversion +has), which could create a new type of object a rename object which links two +objects (old and new). But what if the user forgets to do this and just uses +`mv fileA fileB` and commits this? Should Git automatically assume this is a +rename? It could use the same treshold discused earlier to determine so. That +would make it easier. But then what is the point of having a dedicated rename +command? I think for easy of use, they just decided not to add such a command, +because it is not a solution for all instances. Instead, the rename detection +works good enough for everything and they leave it up to the commiter to make +sure his renames are detected properly. + +## Summary + +So in summary: no, Git does not store renames in its repository. Instead, it +for every add/delete pair part of a commit, Git will do a likeness analysis and +when they are X% alike (default 50%), it will assume a rename occured. + +Some commands influenced by this are: git log, git diff and git merge. Options +related to renames are: + +```txt +-M=, --find-renames= # where n is the treshold percentage. +--no-renames # don't do any rename detection +```