Add "Renames in Git explained"

pull/1/head
Gaël Depreeuw 4 years ago
parent 34c505640f
commit cb0178ee1d
Signed by: Mithror
GPG Key ID: 8AB218ABA4867F78
  1. 219
      content/posts/renames-in-git-explained.md

@ -0,0 +1,219 @@
---
title: "Renames in Git explained"
date: 2020-11-28T12:07:00Z
draft: false
toc: true
tags: ['tech', 'git', 'rename']
---
## Introduction
One of the questions I'm often asked when teaching or explaining Git is how Git
handles file and/or directory renames. The short answer to this is: **It**
**doesn't**.
The slightly longer answer is: **It does, but probably not in the way you**
**envision it**.
To help you understand this topic a bit more, we first have to go back to the
basics: What actually is a file or directory name? The answer to this question
is highly dependent on the underlying file system, but in general it can be
boiled down to this:
> A file (or directory) name is an index used by the file system to look up the
> contents of the file. (Note: from now on I will only refer to file names, but
> the same applies to directory names as well.)
What you should note from this is that a filename is actually not a property of
the file content itself, but part of the meta-data regarding the content. In
Linux, for instance, the filename of a file is stored in the directory, which
is basically a associative array which maps filenames to inodes (the object
which stores the meta-data of a file).
When renaming a file, what you are actually doing is updating a look up table.
In Linux, this would be updating the associative array of the directory. If you
move a file, then you remove the element from one directory and add it to
another directory.
How this all works internally depends on the OS and the underlying file system,
but more importantly is seldom related to the content of a file. Which brings us
to the next chapter.
## Git stores content, not files
When you commit to a Git repository it basically does the following:
For each directory (including the top one), create a **tree** object. This is
done by looking at every file and directory to be commited and create **blob**
objects for the files and tree objects for the directories. The hash of each
such object is added to this tree object together with the filename if the
object type is blob and the directory name if the object type is tree. This is
then prepended with a header and compressed. The SHA-1 hash is calculated and
the object is stored in the object store (.git/objects), using the first two
characters as a directory and the rest as filename.
It then creates a commit object which points to the top level tree's hash.
> Note: it of course only really does this for files which were part of the
> staging area. That's the most efficient. Of course if the content of a file
> was changed, it hash will change and thus the tree object it was part of will
> change and its hash will also change and so on until the top level tree
> object.
As an example, suppose you have the following structure:
```bash
.
├── README.md
├── bar
   ├── bar.md
   └── baz
   └── baz.md
└── foo
└── foo.md
```
If you were to commit this structure to git, you will have (simplified):
- 4 blob objects (README.md, bar.md, foo.md, baz.md)
- 4 tree objects (., ./foo, ./bar and ./bar/baz)
- 1 commit object
In my case:
```bash
gael@Aviendha:~/git/tmp$ git commit -m "First commit"
[master (root-commit) 8be3cf0] First commit
4 files changed, 4 insertions(+)
create mode 100644 README.md
create mode 100644 bar/bar.md
create mode 100644 bar/baz/baz.md
create mode 100644 foo/foo.md
gael@Aviendha:~/git/tmp$ find .git/objects/ -type f
.git/objects/52/01cdd884658a103819d66f910ea25ba1dad2e0
.git/objects/be/e527307ae70706c20eb89f205f444c3bb385e9
.git/objects/6b/dd34e3e9ab26062ab881adb1024923923b5f8e
.git/objects/8b/e3cf05d01320a124991a8e7c10fe83ec9cd5e3
.git/objects/25/7cc5642cb1a054f08cc83f2d943e56fd3ebe99
.git/objects/57/16ca5987cbf97d6bb54920bea6adde242d87e6
.git/objects/f9/07d059fcdc9b594c6e14dc0c3826f26ab47832
.git/objects/e8/45566c06f9bf557d35e8292c37cf05d97a9769
.git/objects/0c/7d27db1f575263efdcab3dc650f4502a2dbcbf
```
To get the top level tree object, just look at the commit:
```bash
gael@Aviendha:~/git/tmp$ git cat-file -p 8be3cf0
tree 5201cdd884658a103819d66f910ea25ba1dad2e0
author Gaël Depreeuw <gael@depreeuw.dev> 1606569688 +0100
committer Gaël Depreeuw <gael@depreeuw.dev> 1606569688 +0100
First commit
```
And if we look at the tree:
```bash
gael@Aviendha:~/git/tmp$ git cat-file -p 5201cdd
100644 blob e845566c06f9bf557d35e8292c37cf05d97a9769 README.md
040000 tree f907d059fcdc9b594c6e14dc0c3826f26ab47832 bar
040000 tree 0c7d27db1f575263efdcab3dc650f4502a2dbcbf foo
```
The contents of `README.md` is:
```bash
gael@Aviendha:~/git/tmp$ git cat-file -p e845566
README
```
So what does this all mean, when we rename a file?
## Renaming file
If we're just looking at renaming a file, then the contents of the file will
not change. This means the blob object representing the file does not change.
What does change is:
1. The old file's name is removed from the tree object it belong to.
2. The new file's name is added to the tree object it belongs to (with the same
hash in this case).
As such Git is not aware of any name changes. This is why the short answer is:
Git doesn't handle file renames. The repository itself has no notion of this
action. It's just has content and a structure for that content.
However, that does not mean you lose your history when you rename a file.
### How to see history of a renamed file
When you remove and add a file (which is what a rename is for Git), Git will
analyze this and when the files are X% alike (with X being defaulted to 50),
it will assume a rename occured. You can show the log of a file including
renames using:
```bash
git log --follow -- <file>
```
If you want to adjust the treshold you can use the `-MX%` option, where X is the
percentage you want (0-100).
Because there is a percentage treshold, the recommendation is that you do not
combine renaming a file, with modifying a file. If files are 100% identical when
adding/removing it makes it much easier to see them as renames. If on the other
hand, you rename a file and start modifying it heavily, Git might not detect
this as a rename, unless you lower the treshold.
You can also turn off rename detection by doing `--no-renames`
### Can I fix my commit if I did change a lot of content after renaming
First, to prevent this: always check using `git status` whether are not the
rename is being detected. Now, how to solve it?
It depends. If your commit is local only and it is the last commit, then you can
fix this easily. There are many ways to to it, but a couple options are:
```bash
git mv <newname> <oldname> # Undo the file rename
git commit --amend # Commit the changes to the file
git mv <oldname> <newname> # Rename the file
git commit # Commit the rename
```
If the commit is already a couple of commits ago, you can do the same with an
interactive rebase and amending the commit at the right time.
If you already pushed your commits you will have to check with the team if you
can rewrite the history and push it. If this is not possible, you might need to
find the right treshold to have Git mark it as a rename.
### Why did Git do it this way
This is pure speculation but dealing with renames is not as easy as it first
looks. For instance you could add a git command to do a rename (like subversion
has), which could create a new type of object a rename object which links two
objects (old and new). But what if the user forgets to do this and just uses
`mv fileA fileB` and commits this? Should Git automatically assume this is a
rename? It could use the same treshold discused earlier to determine so. That
would make it easier. But then what is the point of having a dedicated rename
command? I think for easy of use, they just decided not to add such a command,
because it is not a solution for all instances. Instead, the rename detection
works good enough for everything and they leave it up to the commiter to make
sure his renames are detected properly.
## Summary
So in summary: no, Git does not store renames in its repository. Instead, it
for every add/delete pair part of a commit, Git will do a likeness analysis and
when they are X% alike (default 50%), it will assume a rename occured.
Some commands influenced by this are: git log, git diff and git merge. Options
related to renames are:
```txt
-M=<n>, --find-renames=<n> # where n is the treshold percentage.
--no-renames # don't do any rename detection
```
Loading…
Cancel
Save