Add "Renames in Git explained"

5 years ago · cb0178ee1d
parent 34c505640f
commit cb0178ee1d
1 changed files with 219 additions and 0 deletions
--- a/content/posts/renames-in-git-explained.md
+++ b/content/posts/renames-in-git-explained.md
@ -0,0 +1,219 @@
+---
+title: "Renames in Git explained"
+date: 2020-11-28T12:07:00Z
+draft: false
+toc: true
+tags: ['tech', 'git', 'rename']
+---
+
+## Introduction
+
+One of the questions I'm often asked when teaching or explaining Git is how Git
+handles file and/or directory renames. The short answer to this is: **It**
+**doesn't**.
+
+The slightly longer answer is: **It does, but probably not in the way you**
+**envision it**.
+
+To help you understand this topic a bit more, we first have to go back to the
+basics: What actually is a file or directory name? The answer to this question
+is highly dependent on the underlying file system, but in general it can be
+boiled down to this:
+
+> A file (or directory) name is an index used by the file system to look up the
+> contents of the file. (Note: from now on I will only refer to file names, but
+> the same applies to directory names as well.)
+
+What you should note from this is that a filename is actually not a property of
+the file content itself, but part of the meta-data regarding the content. In
+Linux, for instance, the filename of a file is stored in the directory, which
+is basically a associative array which maps filenames to inodes (the object
+which stores the meta-data of a file).
+
+When renaming a file, what you are actually doing is updating a look up table.
+In Linux, this would be updating the associative array of the directory. If you
+move a file, then you remove the element from one directory and add it to
+another directory.
+
+How this all works internally depends on the OS and the underlying file system,
+but more importantly is seldom related to the content of a file. Which brings us
+to the next chapter.
+
+## Git stores content, not files
+
+When you commit to a Git repository it basically does the following:
+
+For each directory (including the top one), create a **tree** object. This is
+done by looking at every file and directory to be commited and create **blob**
+objects for the files and tree objects for the directories. The hash of each
+such object is added to this tree object together with the filename if the
+object type is blob and the directory name if the object type is tree. This is
+then prepended with a header and compressed. The SHA-1 hash is calculated and
+the object is stored in the object store (.git/objects), using the first two
+characters as a directory and the rest as filename.
+
+It then creates a commit object which points to the top level tree's hash.
+
+> Note: it of course only really does this for files which were part of the
+> staging area. That's the most efficient. Of course if the content of a file
+> was changed, it hash will change and thus the tree object it was part of will
+> change and its hash will also change and so on until the top level tree
+> object.
+
+As an example, suppose you have the following structure:
+
+```bash
+.
+├── README.md
+├── bar
+│   ├── bar.md
+│   └── baz
+│       └── baz.md
+└── foo
+    └── foo.md
+```
+
+If you were to commit this structure to git, you will have (simplified):
+
+- 4 blob objects (README.md, bar.md, foo.md, baz.md)
+- 4 tree objects (., ./foo, ./bar and ./bar/baz)
+- 1 commit object
+
+In my case:
+
+```bash
+gael@Aviendha:~/git/tmp$ git commit -m "First commit"
+[master (root-commit) 8be3cf0] First commit
+ 4 files changed, 4 insertions(+)
+ create mode 100644 README.md
+ create mode 100644 bar/bar.md
+ create mode 100644 bar/baz/baz.md
+ create mode 100644 foo/foo.md
+gael@Aviendha:~/git/tmp$ find .git/objects/ -type f
+.git/objects/52/01cdd884658a103819d66f910ea25ba1dad2e0
+.git/objects/be/e527307ae70706c20eb89f205f444c3bb385e9
+.git/objects/6b/dd34e3e9ab26062ab881adb1024923923b5f8e
+.git/objects/8b/e3cf05d01320a124991a8e7c10fe83ec9cd5e3
+.git/objects/25/7cc5642cb1a054f08cc83f2d943e56fd3ebe99
+.git/objects/57/16ca5987cbf97d6bb54920bea6adde242d87e6
+.git/objects/f9/07d059fcdc9b594c6e14dc0c3826f26ab47832
+.git/objects/e8/45566c06f9bf557d35e8292c37cf05d97a9769
+.git/objects/0c/7d27db1f575263efdcab3dc650f4502a2dbcbf
+```
+
+To get the top level tree object, just look at the commit:
+
+```bash
+gael@Aviendha:~/git/tmp$ git cat-file -p 8be3cf0
+tree 5201cdd884658a103819d66f910ea25ba1dad2e0
+author Gaël Depreeuw <gael@depreeuw.dev> 1606569688 +0100
+committer Gaël Depreeuw <gael@depreeuw.dev> 1606569688 +0100
+
+First commit
+```
+
+And if we look at the tree:
+
+```bash
+gael@Aviendha:~/git/tmp$ git cat-file -p 5201cdd
+100644 blob e845566c06f9bf557d35e8292c37cf05d97a9769    README.md
+040000 tree f907d059fcdc9b594c6e14dc0c3826f26ab47832    bar
+040000 tree 0c7d27db1f575263efdcab3dc650f4502a2dbcbf    foo
+```
+
+The contents of `README.md` is:
+
+```bash
+gael@Aviendha:~/git/tmp$ git cat-file -p e845566
+README
+```
+
+So what does this all mean, when we rename a file?
+
+## Renaming file
+
+If we're just looking at renaming a file, then the contents of the file will
+not change. This means the blob object representing the file does not change.
+What does change is:
+
+1. The old file's name is removed from the tree object it belong to.
+2. The new file's name is added to the tree object it belongs to (with the same
+   hash in this case).
+
+As such Git is not aware of any name changes. This is why the short answer is:
+Git doesn't handle file renames. The repository itself has no notion of this
+action. It's just has content and a structure for that content.
+
+However, that does not mean you lose your history when you rename a file.
+
+### How to see history of a renamed file
+
+When you remove and add a file (which is what a rename is for Git), Git will
+analyze this and when the files are X% alike (with X being defaulted to 50),
+it will assume a rename occured. You can show the log of a file including
+renames using:
+
+```bash
+git log --follow -- <file>
+```
+
+If you want to adjust the treshold you can use the `-MX%` option, where X is the
+percentage you want (0-100).
+
+Because there is a percentage treshold, the recommendation is that you do not
+combine renaming a file, with modifying a file. If files are 100% identical when
+adding/removing it makes it much easier to see them as renames. If on the other
+hand, you rename a file and start modifying it heavily, Git might not detect
+this as a rename, unless you lower the treshold.
+
+You can also turn off rename detection by doing `--no-renames`
+
+### Can I fix my commit if I did change a lot of content after renaming
+
+First, to prevent this: always check using `git status` whether are not the
+rename is being detected. Now, how to solve it?
+
+It depends. If your commit is local only and it is the last commit, then you can
+fix this easily. There are many ways to to it, but a couple options are:
+
+```bash
+git mv <newname> <oldname> # Undo the file rename
+git commit --amend # Commit the changes to the file
+git mv <oldname> <newname> # Rename the file
+git commit # Commit the rename
+```
+
+If the commit is already a couple of commits ago, you can do the same with an
+interactive rebase and amending the commit at the right time.
+
+If you already pushed your commits you will have to check with the team if you
+can rewrite the history and push it. If this is not possible, you might need to
+find the right treshold to have Git mark it as a rename.
+
+### Why did Git do it this way
+
+This is pure speculation but dealing with renames is not as easy as it first
+looks. For instance you could add a git command to do a rename (like subversion
+has), which could create a new type of object a rename object which links two
+objects (old and new). But what if the user forgets to do this and just uses
+`mv fileA fileB` and commits this? Should Git automatically assume this is a
+rename? It could use the same treshold discused earlier to determine so. That
+would make it easier. But then what is the point of having a dedicated rename
+command? I think for easy of use, they just decided not to add such a command,
+because it is not a solution for all instances. Instead, the rename detection
+works good enough for everything and they leave it up to the commiter to make
+sure his renames are detected properly.
+
+## Summary
+
+So in summary: no, Git does not store renames in its repository. Instead, it
+for every add/delete pair part of a commit, Git will do a likeness analysis and
+when they are X% alike (default 50%), it will assume a rename occured.
+
+Some commands influenced by this are: git log, git diff and git merge. Options
+related to renames are:
+
+```txt
+-M=<n>, --find-renames=<n> # where n is the treshold percentage.
+--no-renames # don't do any rename detection
+```