18.5 Duplicate Files
20230603
A common challenge is to find duplicate files, such as photos or music or documents. When available disk space becomes tight then it’s also a good time for a clean up.
A simple trick to find duplicates is to calculate a MD5 signature or hash for a file, and to the use that signature to find duplicates of the file, knowing that in general a mapping of the contents of a file to a signature is a unique mapping - the signature is different for different files.
The fdupes package provides the fdupes command that incorporates the use of the MD5 signature within a more thorough pipeline to guarantee the files are duplicates. The pipeline for checking for duplicate files begins with a file size comparison, a partial MD5 signature comparison, a full MD5 signature comparison, and then a byte-to-byte comparison.
With the --delete
option fdupes will begin an
interactive session to list all duplicated files in the current
directory .
. With the --recurse
option duplicates are searched for
in the current directory and below. The interactive session will list
all duplicates and provide options for their resolution. This is a
quick and effective way to reduce duplicated files:
The interactive session will look something like the below example:
Set 40 of 1919:
[+] ./Camera/PXL_20230307_025903772.MP.jpg
[-] ./Camera/20230307_155903_00.jpg
Set 41 of 1919:
[ ] ./Camera/20210812_151633_00.jpg
[ ] ./Camera/PXL_20210812_051633163.PORTRAIT.jpg
Set 42 of 1919:
[ ] ./Camera/20200516_141257_00.jpg
[ ] ./Camera/IMG_20200516_141257.jpg
( Preserve files [1 - 2, all, help] ):
Ready Set 41 of 1919
You can choose to preserve the first of the duplicated files by
entering 1
, the second by 2
, or preserve all
files. In the above
example 1
was typed followed by Enter
. When a selection has been
made you can type prune
to perform the actions which will delete the
files marked for pruning with the -
.
The interactive session of fdupes provides quite a
comprehensive set of commands to mark duplicates automatically. The
functionality can be reviewed through the help
command.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0