rdfind - Man Page

finds duplicate files

Examples (TL;DR)

Identify all duplicates in a given directory and output a summary: rdfind -dryrun true path/to/directory
Replace all duplicates with hardlinks: rdfind -makehardlinks true path/to/directory
Replace all duplicates with symlinks/soft links: rdfind -makesymlinks true path/to/directory
Delete all duplicates and do not ignore empty files: rdfind -deleteduplicates true -ignoreempty false path/to/directory

Synopsis

rdfind [ options ] directory1 | file1 [ directory2 | file2 ] ...

Description

rdfind finds duplicate files across and/or within several directories. It calculates checksum only if necessary. rdfind runs in O(Nlog(N)) time with N being the number of files.

If two (or more) equal files are found, the program decides which of them is the original and the rest are considered duplicates. This is done by ranking the files to each other and deciding which has the highest rank. See section Ranking for details.

By default, no action is taken besides creating a file with the detected files and showing the possible amount of saved space.

If you need better control over the ranking than given, you can use some preprocessor which sorts the file names in desired order and then run the program using xargs. See examples below for how to use find and xargs in conjunction with rdfind.

To include files or directories that have names starting with -, use rdfind ./- to not confuse them with options.

Ranking

Given two or more equal files, the one with the highest rank is selected to be the original and the rest are duplicates. The rules of ranking are given below, where the rules are executed from start until an original has been found. Given two files A and B which have equal size and content, the ranking is as follows:

If A was found while scanning an input argument earlier than B, A is higher ranked.

If A was found at a directory depth lower than B, A is higher ranked (A is closer to the root).

if A and B are found during scanning of the same input argument and share the same directory depth, the one that ranks highest depends on if deterministic operation is enabled. This is on by default, see option -deterministic). If enabled, which one ranks highest is unspecified but deterministic. If disabled, the one that was reported first from the file system is highest ranked.

Options

Searching options etc:

-ignoreempty true|false: Ignore empty files. Setting this to true (the default) is equivalent to -minsize 1, false is equivalent to -minsize 0.
-minsize N: Ignores files with less than N bytes. Default is 1, meaning empty files are ignored.
-maxsize N: Ignores files with N bytes or more. Default is 0, which means this check is disabled.
-followsymlinks true|false: Follow symlinks. Default is false.
-removeidentinode true|false: Removes items found which have identical inode and device ID. Default is true.
-checksum md5|sha1|sha256|sha512: What type of checksum to be used: md5, sha1, sha256 or sha512. The default is sha1 since version 1.4.0.
-deterministic true|false: If set (the default), sort files of equal rank in an unspecified but deterministic order. This makes the behaviour independent of in which order files are listed when querying the file system.

Action options:

-makesymlinks true|false: Replace duplicate files with symbolic links. Default is false.
-makehardlinks true|false: Replace duplicate files with hard links. Default is false.
-makeresultsfile true|false: Make a results file in the current directory. Default is true. If the file exists, it is overwritten. This does not affect whether items are deleted. See -dryrun for how to disable deletions.
-outputname name: Make the results file name to be "name" instead of the default results.txt.
-deleteduplicates true|false: Delete (unlink) files. Default is false.

General options:

-sleep Xms: Sleeps X milliseconds between reading each file, to reduce load. Default is 0 (no sleep). Note that only a few values are supported at present: 0,1-5,10,25,50,100 milliseconds.
-n, -dryrun true|false: Displays what should have been done, don't actually delete or link anything. Default is false.
-h, -help, --help: Displays a brief help message.
-v, -version, --version: Displays the version number.

Examples

Search for duplicate files in the home directory and a backup directory:: rdfind ~ /mnt/backup
Delete duplicates in a backup directory:: rdfind -deleteduplicates true /mnt/backup
Search for duplicate files in directories called foo:: find . -type d -name foo -print0 |xargs -0 rdfind

Files

results.txt (the default name is results.txt and can be changed with option outputname, see above) The results file results.txt will contain one row per duplicate file found, along with a header row explaining the columns. A text describes why the file is considered a duplicate:

DUPTYPE_UNKNOWN some internal error

DUPTYPE_FIRST_OCCURRENCE the file that is considered to be the original.

DUPTYPE_WITHIN_SAME_TREE files in the same tree (found when processing the directory in the same input argument as the original)

DUPTYPE_OUTSIDE_TREE the file is found during processing another input argument than the original.

Environment

Diagnostics

Exit Values

0 on success, nonzero otherwise.

Bugs/Features

When specifying the same directory twice, it keeps the first encountered as the most important (original), and the rest as duplicates. This might not be what you want.

The symlink creates absolute links. This might not be what you want. To create relative links instead, you may use the symlink (2) command, which is able to convert absolute links to relative links.

Older versions unfortunately contained a misspelling on the word occurrence. This is now corrected (since 1.3), which might affect user scripts parsing the output file written by rdfind.

rdfind - Man Page

Examples (TL;DR)

Synopsis

Description

Ranking

Options

Examples

Files

Environment

Diagnostics

Exit Values

Bugs/Features

Security Considerations

Author

Thanks

Version

Copyright

See Also

Info