|
|
|
@ -14,168 +14,176 @@ handles file and/or directory renames. The short answer to this is: **It** |
|
|
|
|
**doesn't**. |
|
|
|
|
|
|
|
|
|
The slightly longer answer is: **It does, but probably not in the way you** |
|
|
|
|
**envision it**. |
|
|
|
|
**envision it?**. |
|
|
|
|
|
|
|
|
|
To help you understand this topic a bit more, we first have to go back to the |
|
|
|
|
basics: What actually is a file or directory name? The answer to this question |
|
|
|
|
is highly dependent on the underlying file system, but in general it can be |
|
|
|
|
boiled down to this: |
|
|
|
|
|
|
|
|
|
> A file (or directory) name is an index used by the file system to look up the |
|
|
|
|
> contents of the file. (Note: from now on I will only refer to file names, but |
|
|
|
|
> the same applies to directory names as well.) |
|
|
|
|
|
|
|
|
|
What you should note from this is that a filename is actually not a property of |
|
|
|
|
the file content itself, but part of the meta-data regarding the content. In |
|
|
|
|
Linux, for instance, the filename of a file is stored in the directory, which |
|
|
|
|
is basically a associative array which maps filenames to inodes (the object |
|
|
|
|
which stores the meta-data of a file). |
|
|
|
|
|
|
|
|
|
When renaming a file, what you are actually doing is updating a look up table. |
|
|
|
|
In Linux, this would be updating the associative array of the directory. If you |
|
|
|
|
move a file, then you remove the element from one directory and add it to |
|
|
|
|
another directory. |
|
|
|
|
|
|
|
|
|
How this all works internally depends on the OS and the underlying file system, |
|
|
|
|
but more importantly is seldom related to the content of a file. Which brings us |
|
|
|
|
to the next chapter. |
|
|
|
|
Let's first take a look at how Git works internally. If you don't quite |
|
|
|
|
understand everything which follows, I can recommend reading chapter 10 of |
|
|
|
|
the [Git Pro Book 2nd. Edition](https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain). |
|
|
|
|
|
|
|
|
|
## Git stores content, not files |
|
|
|
|
|
|
|
|
|
When you commit to a Git repository it basically does the following: |
|
|
|
|
|
|
|
|
|
For each directory (including the top one), create a **tree** object. This is |
|
|
|
|
done by looking at every file and directory to be commited and create **blob** |
|
|
|
|
objects for the files and tree objects for the directories. The hash of each |
|
|
|
|
such object is added to this tree object together with the filename if the |
|
|
|
|
object type is blob and the directory name if the object type is tree. This is |
|
|
|
|
then prepended with a header and compressed. The SHA-1 hash is calculated and |
|
|
|
|
the object is stored in the object store (.git/objects), using the first two |
|
|
|
|
characters as a directory and the rest as filename. |
|
|
|
|
Create a **blob** object for every file in the index (a.k.a. the staging area). |
|
|
|
|
A blob object is created by taking the content of the file, prepending a header |
|
|
|
|
and compressing the result. A SHA-1 hash is then calculated for this object |
|
|
|
|
which will be used to identify the object. The object is stored in the aptly |
|
|
|
|
named object store (found in `.git/objects`). The first 2 characters of the |
|
|
|
|
hash (in hex format) are used as a directory within this store, while the |
|
|
|
|
remaining characters are the filename of the blob object. |
|
|
|
|
|
|
|
|
|
It then creates a commit object which points to the top level tree's hash. |
|
|
|
|
|
|
|
|
|
> Note: it of course only really does this for files which were part of the |
|
|
|
|
> staging area. That's the most efficient. Of course if the content of a file |
|
|
|
|
> was changed, it hash will change and thus the tree object it was part of will |
|
|
|
|
> change and its hash will also change and so on until the top level tree |
|
|
|
|
> object. |
|
|
|
|
|
|
|
|
|
As an example, suppose you have the following structure: |
|
|
|
|
Let's look at an example. Create a git repo somewhere and create a file. |
|
|
|
|
|
|
|
|
|
```bash |
|
|
|
|
. |
|
|
|
|
├── README.md |
|
|
|
|
├── bar |
|
|
|
|
│ ├── bar.md |
|
|
|
|
│ └── baz |
|
|
|
|
│ └── baz.md |
|
|
|
|
└── foo |
|
|
|
|
└── foo.md |
|
|
|
|
git init foo |
|
|
|
|
cd foo |
|
|
|
|
echo "foo" >> foo.txt |
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
If you were to commit this structure to git, you will have (simplified): |
|
|
|
|
If you look into your `.git/objects` directory, it will be empty, aside from |
|
|
|
|
two empty subdirectories. Let's create a blob out of this file now. |
|
|
|
|
|
|
|
|
|
- 4 blob objects (README.md, bar.md, foo.md, baz.md) |
|
|
|
|
- 4 tree objects (., ./foo, ./bar and ./bar/baz) |
|
|
|
|
- 1 commit object |
|
|
|
|
```bash |
|
|
|
|
$ git hash-object -w foo.txt |
|
|
|
|
257cc5642cb1a054f08cc83f2d943e56fd3ebe99 |
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
In my case: |
|
|
|
|
You will now find an object in the store at: |
|
|
|
|
|
|
|
|
|
```bash |
|
|
|
|
gael@Aviendha:~/git/tmp$ git commit -m "First commit" |
|
|
|
|
[master (root-commit) 8be3cf0] First commit |
|
|
|
|
4 files changed, 4 insertions(+) |
|
|
|
|
create mode 100644 README.md |
|
|
|
|
create mode 100644 bar/bar.md |
|
|
|
|
create mode 100644 bar/baz/baz.md |
|
|
|
|
create mode 100644 foo/foo.md |
|
|
|
|
gael@Aviendha:~/git/tmp$ find .git/objects/ -type f |
|
|
|
|
.git/objects/52/01cdd884658a103819d66f910ea25ba1dad2e0 |
|
|
|
|
.git/objects/be/e527307ae70706c20eb89f205f444c3bb385e9 |
|
|
|
|
.git/objects/6b/dd34e3e9ab26062ab881adb1024923923b5f8e |
|
|
|
|
.git/objects/8b/e3cf05d01320a124991a8e7c10fe83ec9cd5e3 |
|
|
|
|
$ find .git/objects -type f |
|
|
|
|
.git/objects/25/7cc5642cb1a054f08cc83f2d943e56fd3ebe99 |
|
|
|
|
.git/objects/57/16ca5987cbf97d6bb54920bea6adde242d87e6 |
|
|
|
|
.git/objects/f9/07d059fcdc9b594c6e14dc0c3826f26ab47832 |
|
|
|
|
.git/objects/e8/45566c06f9bf557d35e8292c37cf05d97a9769 |
|
|
|
|
.git/objects/0c/7d27db1f575263efdcab3dc650f4502a2dbcbf |
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
To get the top level tree object, just look at the commit: |
|
|
|
|
Here's an interesting exercise: what happens if you rename the file and create |
|
|
|
|
the blob with the renamed file? |
|
|
|
|
|
|
|
|
|
```bash |
|
|
|
|
gael@Aviendha:~/git/tmp$ git cat-file -p 8be3cf0 |
|
|
|
|
tree 5201cdd884658a103819d66f910ea25ba1dad2e0 |
|
|
|
|
author Gaël Depreeuw <gael@depreeuw.dev> 1606569688 +0100 |
|
|
|
|
committer Gaël Depreeuw <gael@depreeuw.dev> 1606569688 +0100 |
|
|
|
|
|
|
|
|
|
First commit |
|
|
|
|
$ mv foo.txt bar.txt |
|
|
|
|
$ git hash-object -w bar.txt |
|
|
|
|
257cc5642cb1a054f08cc83f2d943e56fd3ebe99 |
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
And if we look at the tree: |
|
|
|
|
That's right, nothing changed, this makes sense as we're only adding the content |
|
|
|
|
to the object store! So how does Git remember the file names? |
|
|
|
|
|
|
|
|
|
```bash |
|
|
|
|
gael@Aviendha:~/git/tmp$ git cat-file -p 5201cdd |
|
|
|
|
100644 blob e845566c06f9bf557d35e8292c37cf05d97a9769 README.md |
|
|
|
|
040000 tree f907d059fcdc9b594c6e14dc0c3826f26ab47832 bar |
|
|
|
|
040000 tree 0c7d27db1f575263efdcab3dc650f4502a2dbcbf foo |
|
|
|
|
## Filenames are part of tree objects |
|
|
|
|
|
|
|
|
|
Aside from **blob** objects, Git also creates **tree** objects. You can sort of |
|
|
|
|
compare it to the directories in your worktree, i.e. for each directory in your |
|
|
|
|
worktree, you will have a tree object. A tree object's content looks like: |
|
|
|
|
|
|
|
|
|
```code |
|
|
|
|
<mode> <type> <hash> <name> |
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
The contents of `README.md` is: |
|
|
|
|
You can create this one for yourself by doing: |
|
|
|
|
|
|
|
|
|
```bash |
|
|
|
|
gael@Aviendha:~/git/tmp$ git cat-file -p e845566 |
|
|
|
|
README |
|
|
|
|
$ git update-index --add --cacheinfo 100644 \ |
|
|
|
|
257cc5642cb1a054f08cc83f2d943e56fd3ebe99 foo.txt |
|
|
|
|
$ git write-tree |
|
|
|
|
fcf0be4d7e45f0ef9592682ad68e42270b0366b4 |
|
|
|
|
$ git cat-file -p fcf0be4d7e45f0ef9592682ad68e42270b0366b4 |
|
|
|
|
100644 blob 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 foo.txt |
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
So what does this all mean, when we rename a file? |
|
|
|
|
There are 3 different types (that I know of) which can be referred to in a tree |
|
|
|
|
object: blob, tree, commit. Blobs represent file content, tree represent other |
|
|
|
|
tree (i.e. subdirectories) and commits represent submodules (i.e the commit |
|
|
|
|
at which they are included). A commit is a type of object which is also present |
|
|
|
|
outside the tree objects. They contain the top tree object (representing the |
|
|
|
|
top level of your repository), a link to one or more parent commits and some |
|
|
|
|
meta data (author, commit msg, date, ...). Finally and for completion's sake, |
|
|
|
|
there is also an object for annotated tags, which contain the commit it is |
|
|
|
|
pointing too as well as some meta data. |
|
|
|
|
|
|
|
|
|
## Renaming file |
|
|
|
|
## Renaming |
|
|
|
|
|
|
|
|
|
If we're just looking at renaming a file, then the contents of the file will |
|
|
|
|
not change. This means the blob object representing the file does not change. |
|
|
|
|
What does change is: |
|
|
|
|
Armed with the knowledge about trees and blobs, it should be fairly easy to |
|
|
|
|
understand what happens if you rename a file. To make not make it easier to |
|
|
|
|
understand, consider a simple example: we just rename a file at the top level. |
|
|
|
|
|
|
|
|
|
1. The old file's name is removed from the tree object it belong to. |
|
|
|
|
2. The new file's name is added to the tree object it belongs to (with the same |
|
|
|
|
hash in this case). |
|
|
|
|
> Note: more complex examples are just more time consuming to explain, but |
|
|
|
|
> not to understand. The same principles apply. |
|
|
|
|
|
|
|
|
|
As such Git is not aware of any name changes. This is why the short answer is: |
|
|
|
|
Git doesn't handle file renames. The repository itself has no notion of this |
|
|
|
|
action. It's just has content and a structure for that content. |
|
|
|
|
In case of such a rename, when you commit this rename, your repository will |
|
|
|
|
be impacted as follows: |
|
|
|
|
|
|
|
|
|
- The blob representing the file remains unchanged. |
|
|
|
|
- The top level tree object changes as it now has a different file name. |
|
|
|
|
- The commit object will point to the new tree. (It's parent will point to the |
|
|
|
|
old tree.) |
|
|
|
|
|
|
|
|
|
Nowhere is there any special mention of a rename occuring. Remember, we're just |
|
|
|
|
storing content! As such Git is not aware of any name changes. This is why the |
|
|
|
|
short answer was: Git doesn't handle file renames. The repository itself has no |
|
|
|
|
notion of this action. It's just has content and a structure for that content. |
|
|
|
|
|
|
|
|
|
However, that does not mean you lose your history when you rename a file. |
|
|
|
|
|
|
|
|
|
### How to see history of a renamed file |
|
|
|
|
|
|
|
|
|
When you remove and add a file (which is what a rename is for Git), Git will |
|
|
|
|
analyze this and when the files are X% alike (with X being defaulted to 50), |
|
|
|
|
it will assume a rename occured. You can show the log of a file including |
|
|
|
|
renames using: |
|
|
|
|
Git might not store information on renames in it repository but it does come |
|
|
|
|
packed with an algorithm that detects file renames. For every add/delete pair |
|
|
|
|
added to the index, it determines how alike the paired files are. If they are |
|
|
|
|
at least 50% alike, it considered the pair to have been a rename. If there |
|
|
|
|
are multiple possibilities it takes the highest percentage one. If multipe files |
|
|
|
|
have the same percentage, it picks one depending on the implementation. |
|
|
|
|
|
|
|
|
|
```bash |
|
|
|
|
git log --follow -- <file> |
|
|
|
|
``` |
|
|
|
|
> **Note**: I believe, but am not sure, it basicaly takes the first |
|
|
|
|
> alphabeticaly match in the last case. |
|
|
|
|
|
|
|
|
|
If you want to adjust the treshold you can use the `-MX%` option, where X is the |
|
|
|
|
percentage you want (0-100). |
|
|
|
|
By default `git log -- <file>` does not track accross renames. If you want to |
|
|
|
|
do see the history across renames, you will need to add the `--follow` option. |
|
|
|
|
|
|
|
|
|
Because there is a percentage treshold, the recommendation is that you do not |
|
|
|
|
combine renaming a file, with modifying a file. If files are 100% identical when |
|
|
|
|
adding/removing it makes it much easier to see them as renames. If on the other |
|
|
|
|
hand, you rename a file and start modifying it heavily, Git might not detect |
|
|
|
|
this as a rename, unless you lower the treshold. |
|
|
|
|
You can also define the treshold percentage to be different from 50%. This is |
|
|
|
|
done via the `-M<n>` or `--find-renames=<n>` option. See the git documentation |
|
|
|
|
for the correct syntax. |
|
|
|
|
|
|
|
|
|
You can also turn off rename detection by doing `--no-renames` |
|
|
|
|
|
|
|
|
|
### Rename best practice |
|
|
|
|
|
|
|
|
|
Because of the treshold and the cheapness of commits, it is recommended that |
|
|
|
|
when you rename a file/directory. You commit those renames first, before you |
|
|
|
|
continue working on the renamed file. This basically makes it so you can use |
|
|
|
|
a treshold of 100% all the time. |
|
|
|
|
|
|
|
|
|
### Why did Git do it this way |
|
|
|
|
|
|
|
|
|
This is pure speculation, but here's my thoughts on it: |
|
|
|
|
|
|
|
|
|
Filenames are actually part of the underlying file systems, so for a version |
|
|
|
|
controls system to support multiple file system they have to handle filenames |
|
|
|
|
in their own way. This includes renames. If you think about what this would |
|
|
|
|
require for Git, it would not be very straightforward: Git could have chosen to |
|
|
|
|
provide a command to store rename data, let's say: `git rename fileA fileB`, but |
|
|
|
|
what should this command do? We can image it could create new '**rename** |
|
|
|
|
object, which would hold the blob hash and the name of the previous file. Now, |
|
|
|
|
every time you would walk through history, when you encounter this object type, |
|
|
|
|
you would need to remember this redirections. There's probably a lot of little |
|
|
|
|
nuances which are not immediately apparent though and it does not deal with one |
|
|
|
|
of the major drawbacks of this new command: What happens if the user forgets it |
|
|
|
|
and just does `mv fileA fileB`? |
|
|
|
|
|
|
|
|
|
Well, we'd actually want to have some mechanism to detect this as a rename as |
|
|
|
|
once this is commited it becomes more difficult to undo this change |
|
|
|
|
(especially if we already pushed the commit!). So it sure would be nice if Git |
|
|
|
|
could somehow figure out that it was a rename. Which is exactly what they did. |
|
|
|
|
But now that we have this functionality, what actually is the point of the |
|
|
|
|
new command we wanted to implement? This is probably highly subjective, but to |
|
|
|
|
me it seems completely irrelevant now. Instead of having a command which can be |
|
|
|
|
forgotten and for which we need contigency, just use the contigency as the |
|
|
|
|
solution! It makes the behaviour a lot more consistent! |
|
|
|
|
|
|
|
|
|
### Can I fix my commit if I did change a lot of content after renaming |
|
|
|
|
|
|
|
|
|
First, to prevent this: always check using `git status` whether are not the |
|
|
|
|
rename is being detected. Now, how to solve it? |
|
|
|
|
|
|
|
|
|
It depends. If your commit is local only and it is the last commit, then you can |
|
|
|
|
fix this easily. There are many ways to to it, but a couple options are: |
|
|
|
|
fix this easily. There are many ways to to it, but one option is: |
|
|
|
|
|
|
|
|
|
```bash |
|
|
|
|
git mv <newname> <oldname> # Undo the file rename |
|
|
|
@ -184,37 +192,40 @@ git mv <oldname> <newname> # Rename the file |
|
|
|
|
git commit # Commit the rename |
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
If you want to rename first and the changes second you can also do this, but |
|
|
|
|
it is a bit more complex: |
|
|
|
|
|
|
|
|
|
```bash |
|
|
|
|
git reset --soft HEAD~ # Go back one commit, but keep the changes |
|
|
|
|
git restore --staged <oldname> <newname> # unstage the deletion and addition |
|
|
|
|
git restore <oldname> # undelete the old file |
|
|
|
|
mv <newname> <newname.tmp> # make a temp backup of the new file |
|
|
|
|
git mv <oldname> <newname> # Rename the old file |
|
|
|
|
git commit # commit the rename |
|
|
|
|
cp <newname.tmp> <newname> # apply the new changes |
|
|
|
|
git commit -a # Commit the changes |
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
If the commit is already a couple of commits ago, you can do the same with an |
|
|
|
|
interactive rebase and amending the commit at the right time. |
|
|
|
|
interactive rebase and doing either of the above at the correct time. |
|
|
|
|
|
|
|
|
|
If you already pushed your commits you will have to check with the team if you |
|
|
|
|
can rewrite the history and push it. If this is not possible, you might need to |
|
|
|
|
find the right treshold to have Git mark it as a rename. |
|
|
|
|
|
|
|
|
|
### Why did Git do it this way |
|
|
|
|
|
|
|
|
|
This is pure speculation but dealing with renames is not as easy as it first |
|
|
|
|
looks. For instance you could add a git command to do a rename (like subversion |
|
|
|
|
has), which could create a new type of object a rename object which links two |
|
|
|
|
objects (old and new). But what if the user forgets to do this and just uses |
|
|
|
|
`mv fileA fileB` and commits this? Should Git automatically assume this is a |
|
|
|
|
rename? It could use the same treshold discused earlier to determine so. That |
|
|
|
|
would make it easier. But then what is the point of having a dedicated rename |
|
|
|
|
command? I think for easy of use, they just decided not to add such a command, |
|
|
|
|
because it is not a solution for all instances. Instead, the rename detection |
|
|
|
|
works good enough for everything and they leave it up to the commiter to make |
|
|
|
|
sure his renames are detected properly. |
|
|
|
|
|
|
|
|
|
## Summary |
|
|
|
|
|
|
|
|
|
So in summary: no, Git does not store renames in its repository. Instead, it |
|
|
|
|
for every add/delete pair part of a commit, Git will do a likeness analysis and |
|
|
|
|
So in summary: no, Git does not store renames in its repository. Instead, for |
|
|
|
|
every add/delete pair in a commit, Git will do an similarity analysis and |
|
|
|
|
when they are X% alike (default 50%), it will assume a rename occured. |
|
|
|
|
|
|
|
|
|
Some commands influenced by this are: git log, git diff and git merge. Options |
|
|
|
|
related to renames are: |
|
|
|
|
Some commands influenced by this are: `git log`, `git diff` and `git merge`. |
|
|
|
|
Options related to renames are: |
|
|
|
|
|
|
|
|
|
```txt |
|
|
|
|
-M=<n>, --find-renames=<n> # where n is the treshold percentage. |
|
|
|
|
--no-renames # don't do any rename detection |
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
It is best practise to handle renames in their own commits. Try to avoid |
|
|
|
|
renaming and modifying a file within the same commit. |
|
|
|
|