Author Topic: bash script to find duplicate files - recursive  (Read 514 times)

Offline travisN000

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1758
bash script to find duplicate files - recursive
« on: February 18, 2013, 03:49:25 PM »
I need a script to search through my audio collection and find duplicate files (after trying various tools to "automatically" clean up my library, I now have many copies of almost everything and I'm running out of hard drive space).

It's been awhile since I've done any code hacking, so I figured I would see if anyone has a quick answer for me.  

what I have in mind is a recursive find command that will create a list of all files from within /share/Audio that have a string like (1).m in the name (the "1" could be any integer), then check to see if the same file exists in the same directory without the (#).m in it's name; if the original exists, then output the full path/filename of the file with (#).m in it's name to a text list for my review that I could then later feed into an rm command.

Simple enough right..  ::) ;D

As I work out the script I'll post my progress; if anyone can save me the effort, it would be greatly appreciated.  Here is the general idea:


..I'm off to see what I can re-learn about the find command, regular expressions, substitution, etc.

thanks for any help!!
« Last Edit: February 20, 2013, 01:11:56 PM by travisN000 »

Online Just17

  • PCLinuxOS Tester
  • Super Villain
  • *******
  • Posts: 10601
  • MLUs Forever!
Re: bash script to find duplicate files - recursive
« Reply #1 on: February 18, 2013, 04:32:39 PM »
... wondering if you would need to test for more than just the name of the file ....  size maybe .... in case a part copy exists ......
MLUs rule the roost!

Linux XPS 3.2.18-pclos2.pae.bfs  32 bit
Intel Core2 Quad CPU Q9450 @ 2.66GHz
4 GB RAM
MCP51 High Def Audio
GeForce GTX 550 Ti
PHILIPS  ‎DVD+-RW DVD8701
‎Logitech ‎BT Mini-Receiver
Afatech DTT

Offline travisN000

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1758
Re: bash script to find duplicate files - recursive
« Reply #2 on: February 18, 2013, 04:41:35 PM »
From what I can tell after going through a hundred or so directories by hand, all duplicates match the pattern I described, but it doesn't hurt to check file size also.

The following command seems to find the duplicate files based on my original idea:

Code: [Select]
find . -iname "*([0-9]).m*"
..now I need to figure out how to check if the original file is still there (same path/filename without " (#).m" ).  this should be possible with a regex, but I don't know enough about them to do this off the top of my head.

I know I could use the cut command to chop out the undesired part of the filename and then piece it back together, and use this to check and see if the original exists, but I was hoping for something more elegant (ie. regex).  

EDIT:

..to give an idea of the size of the problem and why I want to automate..
Code: [Select]
find . -iname "*\ ([0-9]).m*"| wc -l
..returns 2845 duplicate files!
« Last Edit: February 18, 2013, 04:53:00 PM by travisN000 »

Offline travisN000

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1758
Re: bash script to find duplicate files - recursive
« Reply #3 on: February 18, 2013, 06:26:03 PM »
I have a partial solution, which I added in my original post.  It only looks for .mp3 files; I will have to modify it and re-run it for the .m4a files in my library.  I did include a check for file size; at the moment it does NOT delete the files -- it just outputs the list to std out.

 ;D
« Last Edit: February 20, 2013, 01:12:27 PM by travisN000 »

Offline travisN000

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1758
Re: bash script to find duplicate files - recursive
« Reply #4 on: February 20, 2013, 11:30:22 AM »
Here is a working my working solution; as posted it will prompt the user on every "match" if they want to "rm" (remove/delete) the duplicate file.  There is also a brief couple second sleep built into the script, so that after answering the y/n to the rm prompt the user has a few seconds to terminate the script with a ctrl-C.

Like all things, use at your own risk.. I make NO promises to its fitness for this or any other purpose :D

For my use i used the prompts for the first twenty or so files to make sure it was working as expected,  then I removed the prompts to fully automate the process.

I also plan on re-running it for m4a, etc, then modifying it slightly to clean up those files whose file size doesn't match (delete the smaller / lower bitrate file and if needed remove the number from the name of the duplicate).

Here is what I currently have:
« Last Edit: February 20, 2013, 01:13:07 PM by travisN000 »

Online Just17

  • PCLinuxOS Tester
  • Super Villain
  • *******
  • Posts: 10601
  • MLUs Forever!
Re: bash script to find duplicate files - recursive
« Reply #5 on: February 20, 2013, 11:42:04 AM »
Looks like you got there  :D

congrats!  ;)

MLUs rule the roost!

Linux XPS 3.2.18-pclos2.pae.bfs  32 bit
Intel Core2 Quad CPU Q9450 @ 2.66GHz
4 GB RAM
MCP51 High Def Audio
GeForce GTX 550 Ti
PHILIPS  ‎DVD+-RW DVD8701
‎Logitech ‎BT Mini-Receiver
Afatech DTT

Offline travisN000

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1758
Re: bash script to find duplicate files - recursive
« Reply #6 on: February 20, 2013, 12:58:44 PM »
thanks!

One more update..this one handles files of matched and mismatched size (keeping the largest):

Code: [Select]
#!/bin/bash  

file_ext="m4a"
base_dir="/share/Audio/MusicBrainz_OUT"
dup_match_log=${file_ext}_dup_match.log
dup_NO_match_log=${file_ext}_dup_NOT_match.log

echo "" > $dup_match_log
echo "" > $dup_NO_match_log


#change field separator from [space] to newline
OIFS="$IFS"
IFS=$'\n'

count_name_match=0
counter=0

echo -e "\nWorking...\n\n"

for dup_file in `find $base_dir -iname "*\ ([0-9]).$file_ext" 2>/dev/null`; do
  orig_file="${dup_file%\ ([0-9]).$file_ext}.$file_ext"
  orig_size=$(stat -c%s "$orig_file")
  dup_size=$(stat -c%s "$dup_file")
  count_name_match=$(($count_name_match+1));

  if [ $orig_size -eq $dup_size ]; then
    # if size DOES match
    counter=$(($counter+1));
    echo -e "-MATCH-\n$orig_size $orig_file\n$dup_size $dup_file\n";
    echo -e "$orig_size $orig_file\n$dup_size $dup_file\n" >> $dup_match_log
    # prompt to remove duplicate file with 'rm -iv' option
    rm -vi "$dup_file"
    echo; sleep 4;
  elif [ $orig_size -gt $dup_size ]; then
    echo -e "-original -gt- duplicate-\n$orig_size $orig_file\n$dup_size $dup_file\n";
    echo -e "$orig_size $orig_file\n$dup_size $dup_file\n" >> $dup_NO_match_log
    # prompt to remove duplicate file with 'rm -iv' option
    rm -vi "$dup_file"
    echo; sleep 4;
  elif [ $orig_size -lt $dup_size ] && [ $orig_size -ne 0 ]; then
    echo -e "-duplicate -gt- original-\n$orig_size $orig_file\n$dup_size $dup_file\n";
    echo -e "$orig_size $orig_file\n$dup_size $dup_file\n" >> $dup_NO_match_log
    # prompt to overwrite original file with 'mv -iv' option
    mv -vi "$dup_file" "$orig_file"
    echo; sleep 4;
  else
    echo -e "-ERROR-\n$orig_size $orig_file\n$dup_size $dup_file\n"
  fi

done


echo "Total files matched by name: $count_name_match"  >> $dup_match_log
echo "Files matched name & size: $counter"  >> $dup_match_log

echo "Total files matched by name: $count_name_match"  >> $dup_NO_match_log
echo "Files matched name & size: $counter"  >> $dup_NO_match_log

echo "Total files matched by name: $count_name_match"
echo "Files matched name & size:   $counter"

#reset field separator
IFS="$OIFS"

« Last Edit: February 20, 2013, 01:07:20 PM by travisN000 »