Improve renames in git blog post

5 years ago · 2d13360bd7
parent 2d2b9afe6c
commit 2d13360bd7
1 changed files with 142 additions and 131 deletions
--- a/content/post/renames-in-git-explained.md
+++ b/content/post/renames-in-git-explained.md
@ -14,168 +14,176 @@ handles file and/or directory renames. The short answer to this is: **It**
 **doesn't**.

 The slightly longer answer is: **It does, but probably not in the way you**
-**envision it**.
+**envision it?**.

-To help you understand this topic a bit more, we first have to go back to the
-basics: What actually is a file or directory name? The answer to this question
-is highly dependent on the underlying file system, but in general it can be
-boiled down to this:
-
-> A file (or directory) name is an index used by the file system to look up the
-> contents of the file. (Note: from now on I will only refer to file names, but
-> the same applies to directory names as well.)
-
-What you should note from this is that a filename is actually not a property of
-the file content itself, but part of the meta-data regarding the content. In
-Linux, for instance, the filename of a file is stored in the directory, which
-is basically a associative array which maps filenames to inodes (the object
-which stores the meta-data of a file).
-
-When renaming a file, what you are actually doing is updating a look up table.
-In Linux, this would be updating the associative array of the directory. If you
-move a file, then you remove the element from one directory and add it to
-another directory.
-
-How this all works internally depends on the OS and the underlying file system,
-but more importantly is seldom related to the content of a file. Which brings us
-to the next chapter.
+Let's first take a look at how Git works internally. If you don't quite
+understand everything which follows, I can recommend reading chapter 10 of
+the [Git Pro Book 2nd. Edition](https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain).

 ## Git stores content, not files

 When you commit to a Git repository it basically does the following:

-For each directory (including the top one), create a **tree** object. This is
-done by looking at every file and directory to be commited and create **blob**
-objects for the files and tree objects for the directories. The hash of each
-such object is added to this tree object together with the filename if the
-object type is blob and the directory name if the object type is tree. This is
-then prepended with a header and compressed. The SHA-1 hash is calculated and
-the object is stored in the object store (.git/objects), using the first two
-characters as a directory and the rest as filename.
+Create a **blob** object for every file in the index (a.k.a. the staging area).
+A blob object is created by taking the content of the file, prepending a header
+and compressing the result. A SHA-1 hash is then calculated for this object
+which will be used to identify the object. The object is stored in the aptly
+named object store (found in `.git/objects`). The first 2 characters of the
+hash (in hex format) are used as a directory within this store, while the
+remaining characters are the filename of the blob object.

-It then creates a commit object which points to the top level tree's hash.
-
-> Note: it of course only really does this for files which were part of the
-> staging area. That's the most efficient. Of course if the content of a file
-> was changed, it hash will change and thus the tree object it was part of will
-> change and its hash will also change and so on until the top level tree
-> object.
-
-As an example, suppose you have the following structure:
+Let's look at an example. Create a git repo somewhere and create a file.

 ```bash
-.
-├── README.md
-├── bar
-│   ├── bar.md
-│   └── baz
-│       └── baz.md
-└── foo
-    └── foo.md
+git init foo
+cd foo
+echo "foo" >> foo.txt
 ```

-If you were to commit this structure to git, you will have (simplified):
+If you look into your `.git/objects` directory, it will be empty, aside from
+two empty subdirectories. Let's create a blob out of this file now.

- 4 blob objects (README.md, bar.md, foo.md, baz.md)
- 4 tree objects (., ./foo, ./bar and ./bar/baz)
- 1 commit object
+```bash
+$ git hash-object -w foo.txt
+257cc5642cb1a054f08cc83f2d943e56fd3ebe99
+```

-In my case:
+You will now find an object in the store at:

 ```bash
-gael@Aviendha:~/git/tmp$ git commit -m "First commit"
-[master (root-commit) 8be3cf0] First commit
- 4 files changed, 4 insertions(+)
- create mode 100644 README.md
- create mode 100644 bar/bar.md
- create mode 100644 bar/baz/baz.md
- create mode 100644 foo/foo.md
-gael@Aviendha:~/git/tmp$ find .git/objects/ -type f
-.git/objects/52/01cdd884658a103819d66f910ea25ba1dad2e0
-.git/objects/be/e527307ae70706c20eb89f205f444c3bb385e9
-.git/objects/6b/dd34e3e9ab26062ab881adb1024923923b5f8e
-.git/objects/8b/e3cf05d01320a124991a8e7c10fe83ec9cd5e3
+$ find .git/objects -type f
 .git/objects/25/7cc5642cb1a054f08cc83f2d943e56fd3ebe99
-.git/objects/57/16ca5987cbf97d6bb54920bea6adde242d87e6
-.git/objects/f9/07d059fcdc9b594c6e14dc0c3826f26ab47832
-.git/objects/e8/45566c06f9bf557d35e8292c37cf05d97a9769
-.git/objects/0c/7d27db1f575263efdcab3dc650f4502a2dbcbf
 ```

-To get the top level tree object, just look at the commit:
+Here's an interesting exercise: what happens if you rename the file and create
+the blob with the renamed file?

 ```bash
-gael@Aviendha:~/git/tmp$ git cat-file -p 8be3cf0
-tree 5201cdd884658a103819d66f910ea25ba1dad2e0
-author Gaël Depreeuw <gael@depreeuw.dev> 1606569688 +0100
-committer Gaël Depreeuw <gael@depreeuw.dev> 1606569688 +0100
-
-First commit
+$ mv foo.txt bar.txt
+$ git hash-object -w bar.txt
+257cc5642cb1a054f08cc83f2d943e56fd3ebe99
 ```

-And if we look at the tree:
+That's right, nothing changed, this makes sense as we're only adding the content
+to the object store! So how does Git remember the file names?

-```bash
-gael@Aviendha:~/git/tmp$ git cat-file -p 5201cdd
-100644 blob e845566c06f9bf557d35e8292c37cf05d97a9769    README.md
-040000 tree f907d059fcdc9b594c6e14dc0c3826f26ab47832    bar
-040000 tree 0c7d27db1f575263efdcab3dc650f4502a2dbcbf    foo
+## Filenames are part of tree objects
+
+Aside from **blob** objects, Git also creates **tree** objects. You can sort of
+compare it to the directories in your worktree, i.e. for each directory in your
+worktree, you will have a tree object. A tree object's content looks like:
+
+```code
+<mode> <type> <hash>    <name>
 ```

-The contents of `README.md` is:
+You can create this one for yourself by doing:

 ```bash
-gael@Aviendha:~/git/tmp$ git cat-file -p e845566
-README
+$ git update-index --add --cacheinfo 100644 \
+  257cc5642cb1a054f08cc83f2d943e56fd3ebe99 foo.txt
+$ git write-tree
+fcf0be4d7e45f0ef9592682ad68e42270b0366b4
+$ git cat-file -p fcf0be4d7e45f0ef9592682ad68e42270b0366b4
+100644 blob 257cc5642cb1a054f08cc83f2d943e56fd3ebe99    foo.txt
 ```

-So what does this all mean, when we rename a file?
+There are 3 different types (that I know of) which can be referred to in a tree
+object: blob, tree, commit. Blobs represent file content, tree represent other
+tree (i.e. subdirectories) and commits represent submodules (i.e the commit
+at which they are included). A commit is a type of object which is also present
+outside the tree objects. They contain the top tree object (representing the
+top level of your repository), a link to one or more parent commits and some
+meta data (author, commit msg, date, ...). Finally and for completion's sake,
+there is also an object for annotated tags, which contain the commit it is
+pointing too as well as some meta data.

-## Renaming file
+## Renaming

-If we're just looking at renaming a file, then the contents of the file will
-not change. This means the blob object representing the file does not change.
-What does change is:
+Armed with the knowledge about trees and blobs, it should be fairly easy to
+understand what happens if you rename a file. To make not make it easier to
+understand, consider a simple example: we just rename a file at the top level.

-1. The old file's name is removed from the tree object it belong to.
-2. The new file's name is added to the tree object it belongs to (with the same
-   hash in this case).
+> Note: more complex examples are just more time consuming to explain, but
+> not to understand. The same principles apply.

-As such Git is not aware of any name changes. This is why the short answer is:
-Git doesn't handle file renames. The repository itself has no notion of this
-action. It's just has content and a structure for that content.
+In case of such a rename, when you commit this rename, your repository will
+be impacted as follows:
+
+- The blob representing the file remains unchanged.
+- The top level tree object changes as it now has a different file name.
+- The commit object will point to the new tree. (It's parent will point to the
+  old tree.)
+
+Nowhere is there any special mention of a rename occuring. Remember, we're just
+storing content! As such Git is not aware of any name changes. This is why the
+short answer was: Git doesn't handle file renames. The repository itself has no
+notion of this action. It's just has content and a structure for that content.

 However, that does not mean you lose your history when you rename a file.

 ### How to see history of a renamed file

-When you remove and add a file (which is what a rename is for Git), Git will
-analyze this and when the files are X% alike (with X being defaulted to 50),
-it will assume a rename occured. You can show the log of a file including
-renames using:
+Git might not store information on renames in it repository but it does come
+packed with an algorithm that detects file renames. For every add/delete pair
+added to the index, it determines how alike the paired files are. If they are
+at least 50% alike, it considered the pair to have been a rename. If there
+are multiple possibilities it takes the highest percentage one. If multipe files
+have the same percentage, it picks one depending on the implementation.

-```bash
-git log --follow -- <file>
-```
+> **Note**: I believe, but am not sure, it basicaly takes the first
+> alphabeticaly match in the last case.

-If you want to adjust the treshold you can use the `-MX%` option, where X is the
-percentage you want (0-100).
+By default `git log -- <file>` does not track accross renames. If you want to
+do see the history across renames, you will need to add the `--follow` option.

-Because there is a percentage treshold, the recommendation is that you do not
-combine renaming a file, with modifying a file. If files are 100% identical when
-adding/removing it makes it much easier to see them as renames. If on the other
-hand, you rename a file and start modifying it heavily, Git might not detect
-this as a rename, unless you lower the treshold.
+You can also define the treshold percentage to be different from 50%. This is
+done via the `-M<n>` or `--find-renames=<n>` option. See the git documentation
+for the correct syntax.

 You can also turn off rename detection by doing `--no-renames`

+### Rename best practice
+
+Because of the treshold and the  cheapness of commits, it is recommended that
+when you rename a file/directory. You commit those renames first, before you
+continue working on the renamed file. This basically makes it so you can use
+a treshold of 100% all the time.
+
+### Why did Git do it this way
+
+This is pure speculation, but here's my thoughts on it:
+
+Filenames are actually part of the underlying file systems, so for a version
+controls system to support multiple file system they have to handle filenames
+in their own way. This includes renames. If you think about what this would
+require for Git, it would not be very straightforward: Git could have chosen to
+provide a command to store rename data, let's say: `git rename fileA fileB`, but
+what should this command do? We can image it could create new '**rename**
+object, which would hold the blob hash and the name of the previous file. Now,
+every time you would walk through history, when you encounter this object type,
+you would need to remember this redirections. There's probably a lot of little
+nuances which are not immediately apparent though and it does not deal with one
+of the major drawbacks of this new command: What happens if the user forgets it
+and just does `mv fileA fileB`?
+
+Well, we'd actually want to have some mechanism to detect this as a rename as
+once this is commited it becomes more difficult to undo this change
+(especially if we already pushed the commit!). So it sure would be nice if Git
+could somehow figure out that it was a rename. Which is exactly what they did.
+But now that we have this functionality, what actually is the point of the
+new command we wanted to implement? This is probably highly subjective, but to
+me it seems completely irrelevant now. Instead of having a command which can be
+forgotten and for which we need contigency, just use the contigency as the
+solution! It makes the behaviour a lot more consistent!
+
 ### Can I fix my commit if I did change a lot of content after renaming

 First, to prevent this: always check using `git status` whether are not the
 rename is being detected. Now, how to solve it?

 It depends. If your commit is local only and it is the last commit, then you can
-fix this easily. There are many ways to to it, but a couple options are:
+fix this easily. There are many ways to to it, but one option is:

 ```bash
 git mv <newname> <oldname> # Undo the file rename
@ -184,37 +192,40 @@ git mv <oldname> <newname> # Rename the file
 git commit # Commit the rename
 ```

+If you want to rename first and the changes second you can also do this, but
+it is a bit more complex:
+
+```bash
+git reset --soft HEAD~ # Go back one commit, but keep the changes
+git restore --staged <oldname> <newname> # unstage the deletion and addition
+git restore <oldname> # undelete the old file
+mv <newname> <newname.tmp> # make a temp backup of the new file
+git mv <oldname> <newname> # Rename the old file
+git commit # commit the rename
+cp <newname.tmp> <newname> # apply the new changes
+git commit -a # Commit the changes
+```
+
 If the commit is already a couple of commits ago, you can do the same with an
-interactive rebase and amending the commit at the right time.
+interactive rebase and doing either of the above at the correct time.

 If you already pushed your commits you will have to check with the team if you
 can rewrite the history and push it. If this is not possible, you might need to
 find the right treshold to have Git mark it as a rename.

-### Why did Git do it this way
-
-This is pure speculation but dealing with renames is not as easy as it first
-looks. For instance you could add a git command to do a rename (like subversion
-has), which could create a new type of object a rename object which links two
-objects (old and new). But what if the user forgets to do this and just uses
-`mv fileA fileB` and commits this? Should Git automatically assume this is a
-rename? It could use the same treshold discused earlier to determine so. That
-would make it easier. But then what is the point of having a dedicated rename
-command? I think for easy of use, they just decided not to add such a command,
-because it is not a solution for all instances. Instead, the rename detection
-works good enough for everything and they leave it up to the commiter to make
-sure his renames are detected properly.
-
 ## Summary

-So in summary: no, Git does not store renames in its repository. Instead, it
-for every add/delete pair part of a commit, Git will do a likeness analysis and
+So in summary: no, Git does not store renames in its repository. Instead, for
+every add/delete pair in a commit, Git will do an similarity analysis and
 when they are X% alike (default 50%), it will assume a rename occured.

-Some commands influenced by this are: git log, git diff and git merge. Options
-related to renames are:
+Some commands influenced by this are: `git log`, `git diff` and `git merge`.
+Options related to renames are:

 ```txt
 -M=<n>, --find-renames=<n> # where n is the treshold percentage.
 --no-renames # don't do any rename detection
 ```
+
+It is best practise to handle renames in their own commits. Try to avoid
+renaming and modifying a file within the same commit.