main-site

8.3 KiB

Raw Blame History Unescape Escape

title	date	draft	toc	tags
Renames in Git explained	2020-11-28T12:07:00Z	false	true	[tech git rename]

Introduction

One of the questions I'm often asked when teaching or explaining Git is how Git handles file and/or directory renames. The short answer to this is: It doesn't.

The slightly longer answer is: It does, but probably not in the way you envision it.

To help you understand this topic a bit more, we first have to go back to the basics: What actually is a file or directory name? The answer to this question is highly dependent on the underlying file system, but in general it can be boiled down to this:

A file (or directory) name is an index used by the file system to look up the contents of the file. (Note: from now on I will only refer to file names, but the same applies to directory names as well.)

What you should note from this is that a filename is actually not a property of the file content itself, but part of the meta-data regarding the content. In Linux, for instance, the filename of a file is stored in the directory, which is basically a associative array which maps filenames to inodes (the object which stores the meta-data of a file).

When renaming a file, what you are actually doing is updating a look up table. In Linux, this would be updating the associative array of the directory. If you move a file, then you remove the element from one directory and add it to another directory.

How this all works internally depends on the OS and the underlying file system, but more importantly is seldom related to the content of a file. Which brings us to the next chapter.

Git stores content, not files

When you commit to a Git repository it basically does the following:

For each directory (including the top one), create a tree object. This is done by looking at every file and directory to be commited and create blob objects for the files and tree objects for the directories. The hash of each such object is added to this tree object together with the filename if the object type is blob and the directory name if the object type is tree. This is then prepended with a header and compressed. The SHA-1 hash is calculated and the object is stored in the object store (.git/objects), using the first two characters as a directory and the rest as filename.

It then creates a commit object which points to the top level tree's hash.

Note: it of course only really does this for files which were part of the staging area. That's the most efficient. Of course if the content of a file was changed, it hash will change and thus the tree object it was part of will change and its hash will also change and so on until the top level tree object.

As an example, suppose you have the following structure:

.
├── README.md
├── bar
│   ├── bar.md
│   └── baz
│       └── baz.md
└── foo
    └── foo.md

If you were to commit this structure to git, you will have (simplified):

4 blob objects (README.md, bar.md, foo.md, baz.md)
4 tree objects (., ./foo, ./bar and ./bar/baz)
1 commit object

In my case:

gael@Aviendha:~/git/tmp$ git commit -m "First commit"
[master (root-commit) 8be3cf0] First commit
 4 files changed, 4 insertions(+)
 create mode 100644 README.md
 create mode 100644 bar/bar.md
 create mode 100644 bar/baz/baz.md
 create mode 100644 foo/foo.md
gael@Aviendha:~/git/tmp$ find .git/objects/ -type f
.git/objects/52/01cdd884658a103819d66f910ea25ba1dad2e0
.git/objects/be/e527307ae70706c20eb89f205f444c3bb385e9
.git/objects/6b/dd34e3e9ab26062ab881adb1024923923b5f8e
.git/objects/8b/e3cf05d01320a124991a8e7c10fe83ec9cd5e3
.git/objects/25/7cc5642cb1a054f08cc83f2d943e56fd3ebe99
.git/objects/57/16ca5987cbf97d6bb54920bea6adde242d87e6
.git/objects/f9/07d059fcdc9b594c6e14dc0c3826f26ab47832
.git/objects/e8/45566c06f9bf557d35e8292c37cf05d97a9769
.git/objects/0c/7d27db1f575263efdcab3dc650f4502a2dbcbf

To get the top level tree object, just look at the commit:

gael@Aviendha:~/git/tmp$ git cat-file -p 8be3cf0
tree 5201cdd884658a103819d66f910ea25ba1dad2e0
author Gaël Depreeuw <gael@depreeuw.dev> 1606569688 +0100
committer Gaël Depreeuw <gael@depreeuw.dev> 1606569688 +0100

First commit

And if we look at the tree:

gael@Aviendha:~/git/tmp$ git cat-file -p 5201cdd
100644 blob e845566c06f9bf557d35e8292c37cf05d97a9769    README.md
040000 tree f907d059fcdc9b594c6e14dc0c3826f26ab47832    bar
040000 tree 0c7d27db1f575263efdcab3dc650f4502a2dbcbf    foo

The contents of README.md is:

gael@Aviendha:~/git/tmp$ git cat-file -p e845566
README

So what does this all mean, when we rename a file?

Renaming file

If we're just looking at renaming a file, then the contents of the file will not change. This means the blob object representing the file does not change. What does change is:

The old file's name is removed from the tree object it belong to.
The new file's name is added to the tree object it belongs to (with the same hash in this case).

As such Git is not aware of any name changes. This is why the short answer is: Git doesn't handle file renames. The repository itself has no notion of this action. It's just has content and a structure for that content.

However, that does not mean you lose your history when you rename a file.

How to see history of a renamed file

When you remove and add a file (which is what a rename is for Git), Git will analyze this and when the files are X% alike (with X being defaulted to 50), it will assume a rename occured. You can show the log of a file including renames using:

git log --follow -- <file>

If you want to adjust the treshold you can use the -MX% option, where X is the percentage you want (0-100).

Because there is a percentage treshold, the recommendation is that you do not combine renaming a file, with modifying a file. If files are 100% identical when adding/removing it makes it much easier to see them as renames. If on the other hand, you rename a file and start modifying it heavily, Git might not detect this as a rename, unless you lower the treshold.

You can also turn off rename detection by doing --no-renames

Can I fix my commit if I did change a lot of content after renaming

First, to prevent this: always check using git status whether are not the rename is being detected. Now, how to solve it?

It depends. If your commit is local only and it is the last commit, then you can fix this easily. There are many ways to to it, but a couple options are:

git mv <newname> <oldname> # Undo the file rename
git commit --amend # Commit the changes to the file
git mv <oldname> <newname> # Rename the file
git commit # Commit the rename

If the commit is already a couple of commits ago, you can do the same with an interactive rebase and amending the commit at the right time.

If you already pushed your commits you will have to check with the team if you can rewrite the history and push it. If this is not possible, you might need to find the right treshold to have Git mark it as a rename.

Why did Git do it this way

This is pure speculation but dealing with renames is not as easy as it first looks. For instance you could add a git command to do a rename (like subversion has), which could create a new type of object a rename object which links two objects (old and new). But what if the user forgets to do this and just uses mv fileA fileB and commits this? Should Git automatically assume this is a rename? It could use the same treshold discused earlier to determine so. That would make it easier. But then what is the point of having a dedicated rename command? I think for easy of use, they just decided not to add such a command, because it is not a solution for all instances. Instead, the rename detection works good enough for everything and they leave it up to the commiter to make sure his renames are detected properly.

Summary

So in summary: no, Git does not store renames in its repository. Instead, it for every add/delete pair part of a commit, Git will do a likeness analysis and when they are X% alike (default 50%), it will assume a rename occured.

Some commands influenced by this are: git log, git diff and git merge. Options related to renames are:

-M=<n>, --find-renames=<n> # where n is the treshold percentage.
--no-renames # don't do any rename detection

8.3 KiB Raw Blame History Unescape Escape