Pario TechnoBlob

A chronological documentation test project, nothing serious, really!

Shell script for removing duplicate files

Posted on 15. May 200718. May 2009 by Hans-Henry Jakobsen

The following shell script finds duplicate (2 or more identical) files and outputs a new shell script containing commented-out rm statements for deleting them.
You then have to edit the file to select which files to keep – the script can’t safely do it automatically!

OUTF=rem-duplicates.sh;
echo "#! /bin/sh" > $OUTF;
find "$@" -type f -print0 |
  xargs -0 -n1 md5sum |
    sort --key=1,32 | uniq -w 32 -d --all-repeated=separate |
    sed -r 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/#rm \1/' >> $OUTF;
chmod a+x $OUTF; ls -l $OUTF

Example output (rem-duplicates.sh)

#! /bin/sh
#rm ./gdc2001/113-1303_IMG.JPG
#rm ./reppulilta/gdc2001/113-1303_IMG.JPG

#rm ./lissabon/01-01-2001/108-0883_IMG.JPG
#rm ./kuvat\ reppulilta/lissabon/01-01-2001/108-0883_IMG.JPG

#rm ./gdc2001/113-1328_IMG.JPG
#rm ./kuvat\ reppulilta/gdc2001/113-1328_IMG.JPG

Explanation

write output script header
list all files recursively under current directory
escape all the potentially dangerous characters with xargs
calculate MD5 sums
find duplicate sums
strip off MD5 sums and leave only file names
escape strange characters from the filenames
write out commented-out delete commands
make the output script writable and ls -l it