Write a program to Delete Duplicate Files in Python

Are you too worried because there are many duplicate files on the drive that take up memory space? But manually searching and deleting them is too tedious. Continue python seriesToday, I will continue to show you how to delete duplicate files and free up disk space using python.

Join the channel Telegram of the AnonyViet 👉 Link 👈

Solution

Instead of searching the drive for duplicate files, you can automate this process using a script, by writing a program to recursively search the drive and remove all files. Duplicate found.

Working principle

If we read the entire File and then compare it with the rest of the files recursively, it will take a lot of time, so what should we do?

The answer is hashing, with hashing we can generate a certain sequence of letters and numbers that act as the identity of a given File and if we find any other File with the same identity, we will delete it.

There are many different hashing algorithms such as:

md5
sha1
sha224, sha256, sha384 and sha512

Code to remove duplicate files in Python

Hashing in Python is quite simple, we will use the library hashlib default with Python’s standard library.

Here is an example of how we hashing content using hashlib, we will hash a string in Python using md5 hashing algorithm.

Eg

>>> import hashlib
>>> example_text = "Duplython is amazing".encode('utf-8')
>>> hashlib.md5(example_text).hexdigest()
'73a14f46eadcc04f4e04bec8eb66f2ab'

A little explanation, you just import hashlib and then use md5 method to generate hash and finally use hexdigest to generate hash string.

The above example showed us how to hash a string but when considering this with the upcoming project, we should be more concerned with Files than strings, right? Another question was raised.

How do we Hash files?

File hashes are similar to hash strings but with a slight difference, in the process of file hashing, we first need to open the File in binary form and then hash the binary value of the File.

File hash

Let’s say you have a plain text document on your project directory with the name learn.txt. This is how we will do it.

>>> import hashlib
>>> file = open('learn.txt', 'rb').read()
>>> hashlib.md5(file).hexdigest()
'0534cf6d5816c4f1ace48fff75f616c9'

This function will return the same hash values if the files have the same content then it is easy to find and Remove duplicate files with Python. Note: different names but the same content will still return the same hash value.

The challenge arises when we try to read a rather large File that will take a while to load. Therefore, instead of waiting for the entire File to enter memory, we can continue to compute the hash while reading the File.

Calculating the hash while reading the File requires us to read the File in blocks of a certain size and continuously update the hashes as we continue to read the File until the entire File has been hashed. Simply said, divide the file into several parts, then read each part, each part will be hashed, after the hash will be updated to another variable.

Doing it this way can save us a lot of waiting time that we could use to wait for the entire file to be ready.

Eg

>>> import hashlib
>>> block_size = 1024
>>> hash = hashlib.md5()
>>> with open('learn.txt', 'rb') as file:
...     block = file.read(block_size)
...     while len(block)>0:
...             hash.update(block)
...             block = file.read(block_size)
...     print(hash)
... 
0534cf6d5816c4f1ace48fff75f616c9

But hashing is just one step we need to actually remove duplicates, so we’ll use the OS module to remove duplicates.

We will use remove() function in OS module to remove duplicate files.

Use the OS module to delete the learn.txt . file

Eg:

>>> import os
>>> os.listdir()
['Desktop-File-cleaner', '.git', 'learn.txt', 'app.py', 'README.md']
>>> os.remove('learn.txt')
>>> os.listdir()
['Desktop-File-cleaner', '.git', 'app.py', 'README.md']

Once we’ve removed the file with the remove() function, we’ll start building the application.

How to create an application to delete duplicate files

Required libraries:

import time
import os
from hashlib import sha256

I am a person who loves object-oriented programming, so in this article, I will build the tool as a single class, the code below is just the framework of the program.

import time
import os
from hashlib import sha256

class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0

    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')
        
    def main(self)->None:
      self.welcome()

if __name__ == '__main__':
    App = Duplython()
    App.main()

That’s just the interface of the program, when you run it will just print the welcome message to the screen.

$ python3 app.py
******************************************************************
****************        DUPLYTHON      ****************************
********************************************************************

----------------        WELCOME        ----------------------------

Cleaning .................

Now we will create a simple function that will hash a File with a given path using the hashing knowledge we learned above.

import time
import os
from hashlib import sha256

class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0

    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')
        
   def generate_hash(self, Filename:str)->str:
        Filehash = sha256()
        try:
            with open(Filename, 'rb') as File:
                fileblock = File.read(self.block_size)
                while len(fileblock)>0:
                    Filehash.update(fileblock)
                    fileblock = File.read(self.block_size)
                Filehash = Filehash.hexdigest()
            return Filehash
        except:
            return False
        
    def main(self)->None:
      self.welcome()

if __name__ == '__main__':
    App = Duplython()
    App.main()

Implement logic for the program

After generating the File hash, we have to implement it where it will compare those hashes and discard any duplicates found.

I will create a simple function called clean() as shown below.

import time
import os
from hashlib import sha256

class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0

    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')
        
   def generate_hash(self, Filename:str)->str:
        Filehash = sha256()
        try:
            with open(Filename, 'rb') as File:
                fileblock = File.read(self.block_size)
                while len(fileblock)>0:
                    Filehash.update(fileblock)
                    fileblock = File.read(self.block_size)
                Filehash = Filehash.hexdigest()
            return Filehash
        except:
            return False

    def clean(self)->None:
        all_dirs = [path[0] for path in os.walk('.')]
        for path in all_dirs:
            os.chdir(path)
            All_Files =[file for file in os.listdir() if os.path.isfile(file)]
            for file in All_Files:
                filehash = self.generate_hash(file)
                if not filehash in self.File_hashes:
                    if filehash:                       
                        self.File_hashes.append(filehash)
                        #print(file)
                else:
                    byte_saved = os.path.getsize(file); self.count_cleaned+=1
                    self.Total_bytes_saved+=byte_saved
                    os.remove(file); filename = file.split('/')[-1]
                    print(filename, '.. cleaned ')
            os.chdir(self.home_dir)
        
    def main(self)->None:
      self.welcome();self.clean()

if __name__ == '__main__':
    App = Duplython()
    App.main()

Now that our program is almost done, the last thing to do is to show the results of the cleaning process to the user.

I created the function Cleaning_summary() just to do it. Print the results of the cleaning process to the screen to complete the program.

import time
import os
import shutil
from hashlib import sha256

class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0

    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')

    def generate_hash(self, Filename:str)->str:
        Filehash = sha256()
        try:
            with open(Filename, 'rb') as File:
                fileblock = File.read(self.block_size)
                while len(fileblock)>0:
                    Filehash.update(fileblock)
                    fileblock = File.read(self.block_size)
                Filehash = Filehash.hexdigest()
            return Filehash
        except:
            return False

    def clean(self)->None:
        all_dirs = [path[0] for path in os.walk('.')]
        for path in all_dirs:
            os.chdir(path)
            All_Files =[file for file in os.listdir() if os.path.isfile(file)]
            for file in All_Files:
                filehash = self.generate_hash(file)
                if not filehash in self.File_hashes:
                    if filehash:                       
                        self.File_hashes.append(filehash)
                        #print(file)
                else:
                    byte_saved = os.path.getsize(file); self.count_cleaned+=1
                    self.Total_bytes_saved+=byte_saved
                    os.remove(file); filename = file.split('/')[-1]
                    print(filename, '.. cleaned ')
            os.chdir(self.home_dir)
    
    def cleaning_summary(self)->None:
        mb_saved = self.Total_bytes_saved/1048576
        mb_saved = round(mb_saved, 2)
        print('\n\n--------------FINISHED CLEANING ------------')
        print('File cleaned  : ', self.count_cleaned)
        print('Total Space saved : ', mb_saved, 'MB')
        print('-----------------------------------------------')
        
    def main(self)->None:
        self.welcome();self.clean();self.cleaning_summary()

if __name__ == '__main__':
    App = Duplython()
    App.main()

Our Python Duplicate File Removal application is done, now to run the application run it in the specific directory you want to clean and it will recursively go through a certain directory to find all. all files and delete duplicate files.

Result

$ python3 app.py 
******************************************************************
****************        DUPLYTHON      ****************************
********************************************************************


----------------        WELCOME        ----------------------------

Cleaning .................
0(copy).jpeg .. cleaned 
0 (1)(copy).jpeg .. cleaned 
0 (2)(copy).jpeg .. cleaned 


--------------FINISHED CLEANING ------------
File cleaned  :  3
Total Space saved :  0.38 MB
-----------------------------------------------

The article achieved: 5/5 – (100 votes)