Write a program to delete duplicate files with python

You are too worried because there are many duplicated files on the drive that occupy memory capacity? But when searching and erasing them manually is too tedious. Continue Python seriestoday, I will continue to guide you how to delete duplicate files and release the drive capacity with Python.

Join the channel Telegram belong to Anonyviet 👉 Link 👈

Solution

Instead of searching throughout the drive to delete duplicate files, you can automate this process by using script, by writing a program to search for recursive in the drive and remove all duplicated files found.

Operation principle

If we read the entire file and then compare it to the remaining files by recursive, it will take a lot of time, so how do we do it?

The answer is the hashing (hash), with the hashing we can create a series of letters and certain numbers that act as the identity of a certain file and if we find any other file with the same identity, we will delete it.

There are many different hashing algorithms such as:

MD5
Sha1
sha224, sha256, sha384 and sha512

Code delete duplicate files with python

Hashing in Python is quite simple, we will use the library hashlib The default with Python's standard library.

Here is an example of how we havehing content using the hashlib, we will chop a string in Python using MD5 hash algorithm.

For example

>>> import hashlib
>>> example_text = "Duplython is amazing".encode('utf-8')
>>> hashlib.md5(example_text).hexdigest()
'73a14f46eadcc04f4e04bec8eb66f2ab'

Explain a little, you just need import hashlib And then use the MD5 method to create the hash and finally use HexDigest to create the hash string.

The above example shows us how to have a chain but when considering this relationship with the project is about to do, we must pay more attention to the files than the chain, right? Another question has been asked.

How do we haveh files like?

The hash (hash) files are similar to the hash string but there is a small difference, in the process of chopping the file, we first need to open the file in a binary form and then chopped the binary value of the file.

File hash

Suppose you have a simple text document on your project directory with the name learn.txt. This is how we will do it.

>>> import hashlib
>>> file = open('learn.txt', 'rb').read()
>>> hashlib.md5(file).hexdigest()
'0534cf6d5816c4f1ace48fff75f616c9'

This function will return the same hash values if those files have the same content then easily find and delete duplicate files with python. Note: Unlike the name but the same content is still paying the same hash value.

The challenge arises when we try to read a fairly large file will take a while to download it. Therefore, instead of waiting for the entire file to memory, we can continue to calculate the hash function when reading the file.

The calculation of the hash function while reading the file requires us to read the file in certain sizes and continuously update the hash functions when we continue reading the file until the entire file hash. Simply put the file into multiple parts, then read each part, each part will be hash, after the hash will be updated into another variable.

Doing this way can help us save a lot of waiting time we can use to wait for the entire file to be ready.

For example

>>> import hashlib
>>> block_size = 1024
>>> hash = hashlib.md5()
>>> with open('learn.txt', 'rb') as file:
...     block = file.read(block_size)
...     while len(block)>0:
...             hash.update(block)
...             block = file.read(block_size)
...     print(hash)
... 
0534cf6d5816c4f1ace48fff75f616c9

But hash is just a step we need to really remove copies, so we will use the OS module to delete copies.

We will use the remove () function in the OS module to delete duplicate files.

Use the OS module to delete the space.txt

For example:

>>> import os
>>> os.listdir()
['Desktop-File-cleaner', '.git', 'learn.txt', 'app.py', 'README.md']
>>> os.remove('learn.txt')
>>> os.listdir()
['Desktop-File-cleaner', '.git', 'app.py', 'README.md']

Once you have deleted the file with the remove () function, we will start building the application.

How to create applications to delete duplicate files

The necessary libraries:

import time
import os
from hashlib import sha256

I am a person who really likes to program object -oriented programming, so in this article, I will build the tool as a single class, the code below is just the frame of the program.

import time
import os
from hashlib import sha256

class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0

    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')
        
    def main(self)->None:
      self.welcome()

if __name__ == '__main__':
    App = Duplython()
    App.main()

It is just the interface of the program, when you run it will only print greetings on the screen.

$ python3 app.py
******************************************************************
****************        DUPLYTHON      ****************************
********************************************************************

----------------        WELCOME        ----------------------------

Cleaning .................

Now we will create a simple function used to chop a file with a certain path using the hash knowledge we have learned above.

import time
import os
from hashlib import sha256

class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0

    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')
        
   def generate_hash(self, Filename:str)->str:
        Filehash = sha256()
        try:
            with open(Filename, 'rb') as File:
                fileblock = File.read(self.block_size)
                while len(fileblock)>0:
                    Filehash.update(fileblock)
                    fileblock = File.read(self.block_size)
                Filehash = Filehash.hexdigest()
            return Filehash
        except:
            return False
        
    def main(self)->None:
      self.welcome()

if __name__ == '__main__':
    App = Duplython()
    App.main()

Deploy logic for the program

After creating the file hash function, we must deploy at the place that will compare those hash chains and remove any copy found.

I will create a simple function called Clean () as shown below.

import time
import os
from hashlib import sha256

class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0

    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')
        
   def generate_hash(self, Filename:str)->str:
        Filehash = sha256()
        try:
            with open(Filename, 'rb') as File:
                fileblock = File.read(self.block_size)
                while len(fileblock)>0:
                    Filehash.update(fileblock)
                    fileblock = File.read(self.block_size)
                Filehash = Filehash.hexdigest()
            return Filehash
        except:
            return False

    def clean(self)->None:
        all_dirs = [path[0] for path in os.walk('.')]
        for path in all_dirs:
            os.chdir(path)
            All_Files =[file for file in os.listdir() if os.path.isfile(file)]
            for file in All_Files:
                filehash = self.generate_hash(file)
                if not filehash in self.File_hashes:
                    if filehash:                       
                        self.File_hashes.append(filehash)
                        #print(file)
                else:
                    byte_saved = os.path.getsize(file); self.count_cleaned+=1
                    self.Total_bytes_saved+=byte_saved
                    os.remove(file); filename = file.split('/')[-1]
                    print(filename, '.. cleaned ')
            os.chdir(self.home_dir)
        
    def main(self)->None:
      self.welcome();self.clean()

if __name__ == '__main__':
    App = Duplython()
    App.main()

Now our program is nearly completed, the last thing is to display the results of the cleaning process for users to view.

I have created the function Cleaning_summary() Just to do that. Print the results of the cleaning process to the screen to complete the program.

import time
import os
import shutil
from hashlib import sha256

class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0

    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')

    def generate_hash(self, Filename:str)->str:
        Filehash = sha256()
        try:
            with open(Filename, 'rb') as File:
                fileblock = File.read(self.block_size)
                while len(fileblock)>0:
                    Filehash.update(fileblock)
                    fileblock = File.read(self.block_size)
                Filehash = Filehash.hexdigest()
            return Filehash
        except:
            return False

    def clean(self)->None:
        all_dirs = [path[0] for path in os.walk('.')]
        for path in all_dirs:
            os.chdir(path)
            All_Files =[file for file in os.listdir() if os.path.isfile(file)]
            for file in All_Files:
                filehash = self.generate_hash(file)
                if not filehash in self.File_hashes:
                    if filehash:                       
                        self.File_hashes.append(filehash)
                        #print(file)
                else:
                    byte_saved = os.path.getsize(file); self.count_cleaned+=1
                    self.Total_bytes_saved+=byte_saved
                    os.remove(file); filename = file.split('/')[-1]
                    print(filename, '.. cleaned ')
            os.chdir(self.home_dir)
    
    def cleaning_summary(self)->None:
        mb_saved = self.Total_bytes_saved/1048576
        mb_saved = round(mb_saved, 2)
        print('\n\n--------------FINISHED CLEANING ------------')
        print('File cleaned  : ', self.count_cleaned)
        print('Total Space saved : ', mb_saved, 'MB')
        print('-----------------------------------------------')
        
    def main(self)->None:
        self.welcome();self.clean();self.cleaning_summary()

if __name__ == '__main__':
    App = Duplython()
    App.main()

The application of deleting overlapping files with our python has been completed, now to run the application, run it in the specific folder you want to clean and it will recur through a certain folder to find all files and delete duplicate files.

Result

$ python3 app.py 
******************************************************************
****************        DUPLYTHON      ****************************
********************************************************************


----------------        WELCOME        ----------------------------

Cleaning .................
0(copy).jpeg .. cleaned 
0 (1)(copy).jpeg .. cleaned 
0 (2)(copy).jpeg .. cleaned 


--------------FINISHED CLEANING ------------
File cleaned  :  3
Total Space saved :  0.38 MB
-----------------------------------------------