Shell script for removing duplicate files
The following shell script finds duplicate (2 or more identical) files and outputs a new shell script containing commented-out rm statements for deleting them.
You then have to edit the file to select which files to keep – the script can’t safely do it automatically!
OUTF=rem-duplicates.sh; echo "#! /bin/sh" > $OUTF; find "$@" -type f -print0 | xargs -0 -n1 md5sum | sort --key=1,32 | uniq -w 32 -d --all-repeated=separate | sed -r 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/#rm \1/' >> $OUTF; chmod a+x $OUTF; ls -l $OUTF
Example output (rem-duplicates.sh)
#! /bin/sh #rm ./gdc2001/113-1303_IMG.JPG #rm ./reppulilta/gdc2001/113-1303_IMG.JPG #rm ./lissabon/01-01-2001/108-0883_IMG.JPG #rm ./kuvat\ reppulilta/lissabon/01-01-2001/108-0883_IMG.JPG #rm ./gdc2001/113-1328_IMG.JPG #rm ./kuvat\ reppulilta/gdc2001/113-1328_IMG.JPG
Explanation
- write output script header
- list all files recursively under current directory
- escape all the potentially dangerous characters with xargs
- calculate MD5 sums
- find duplicate sums
- strip off MD5 sums and leave only file names
- escape strange characters from the filenames
- write out commented-out delete commands
- make the output script writable and ls -l it