• Home
  • News
  • Software
  • Knowledge
  • MMO
  • Tips
  • Security
  • Network
  • Office
AnonyViet - English Version
  • Home
  • News
  • Software
  • Knowledge
  • MMO
  • Tips
  • Security
  • Network
  • Office
No Result
View All Result
  • Home
  • News
  • Software
  • Knowledge
  • MMO
  • Tips
  • Security
  • Network
  • Office
No Result
View All Result
AnonyViet - English Version
No Result
View All Result

Write a program to Delete Duplicate Files in Python

AnonyViet by AnonyViet
January 31, 2023
in Tips
0

Are you too worried because there are many duplicate files on the drive that take up memory space? But manually searching and deleting them is too tedious. Continue python seriesToday, I will continue to show you how to delete duplicate files and free up disk space using python.

Join the channel Telegram of the AnonyViet 👉 Link 👈

Solution

Instead of searching the drive for duplicate files, you can automate this process using a script, by writing a program to recursively search the drive and remove all files. Duplicate found.

Working principle

If we read the entire File and then compare it with the rest of the files recursively, it will take a lot of time, so what should we do?

The answer is hashing, with hashing we can generate a certain sequence of letters and numbers that act as the identity of a given File and if we find any other File with the same identity, we will delete it.

Delete Duplicate Files in Python

There are many different hashing algorithms such as:

  • md5
  • sha1
  • sha224, sha256, sha384 and sha512

Code to remove duplicate files in Python

Hashing in Python is quite simple, we will use the library hashlib default with Python’s standard library.

Here is an example of how we hashing content using hashlib, we will hash a string in Python using md5 hashing algorithm.

Eg

>>> import hashlib
>>> example_text = "Duplython is amazing".encode('utf-8')
>>> hashlib.md5(example_text).hexdigest()
'73a14f46eadcc04f4e04bec8eb66f2ab'

A little explanation, you just import hashlib and then use md5 method to generate hash and finally use hexdigest to generate hash string.

The above example showed us how to hash a string but when considering this with the upcoming project, we should be more concerned with Files than strings, right? Another question was raised.

How do we Hash files?

File hashes are similar to hash strings but with a slight difference, in the process of file hashing, we first need to open the File in binary form and then hash the binary value of the File.

File hash

Let’s say you have a plain text document on your project directory with the name learn.txt. This is how we will do it.

>>> import hashlib
>>> file = open('learn.txt', 'rb').read()
>>> hashlib.md5(file).hexdigest()
'0534cf6d5816c4f1ace48fff75f616c9'

This function will return the same hash values ​​if the files have the same content then it is easy to find and Remove duplicate files with Python. Note: different names but the same content will still return the same hash value.

The challenge arises when we try to read a rather large File that will take a while to load. Therefore, instead of waiting for the entire File to enter memory, we can continue to compute the hash while reading the File.

Calculating the hash while reading the File requires us to read the File in blocks of a certain size and continuously update the hashes as we continue to read the File until the entire File has been hashed. Simply said, divide the file into several parts, then read each part, each part will be hashed, after the hash will be updated to another variable.

Doing it this way can save us a lot of waiting time that we could use to wait for the entire file to be ready.

Eg

>>> import hashlib
>>> block_size = 1024
>>> hash = hashlib.md5()
>>> with open('learn.txt', 'rb') as file:
...     block = file.read(block_size)
...     while len(block)>0:
...             hash.update(block)
...             block = file.read(block_size)
...     print(hash)
... 
0534cf6d5816c4f1ace48fff75f616c9

But hashing is just one step we need to actually remove duplicates, so we’ll use the OS module to remove duplicates.

We will use remove() function in OS module to remove duplicate files.

Use the OS module to delete the learn.txt . file

Eg:

>>> import os
>>> os.listdir()
['Desktop-File-cleaner', '.git', 'learn.txt', 'app.py', 'README.md']
>>> os.remove('learn.txt')
>>> os.listdir()
['Desktop-File-cleaner', '.git', 'app.py', 'README.md']

Once we’ve removed the file with the remove() function, we’ll start building the application.

How to create an application to delete duplicate files

Required libraries:

import time
import os
from hashlib import sha256

I am a person who loves object-oriented programming, so in this article, I will build the tool as a single class, the code below is just the framework of the program.

import time
import os
from hashlib import sha256

class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0

    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')
        
    def main(self)->None:
      self.welcome()

if __name__ == '__main__':
    App = Duplython()
    App.main()

That’s just the interface of the program, when you run it will just print the welcome message to the screen.

$ python3 app.py
******************************************************************
****************        DUPLYTHON      ****************************
********************************************************************

----------------        WELCOME        ----------------------------
​
Cleaning .................

Now we will create a simple function that will hash a File with a given path using the hashing knowledge we learned above.

import time
import os
from hashlib import sha256

class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0

    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')
        
   def generate_hash(self, Filename:str)->str:
        Filehash = sha256()
        try:
            with open(Filename, 'rb') as File:
                fileblock = File.read(self.block_size)
                while len(fileblock)>0:
                    Filehash.update(fileblock)
                    fileblock = File.read(self.block_size)
                Filehash = Filehash.hexdigest()
            return Filehash
        except:
            return False
        
    def main(self)->None:
      self.welcome()

if __name__ == '__main__':
    App = Duplython()
    App.main()

Implement logic for the program

After generating the File hash, we have to implement it where it will compare those hashes and discard any duplicates found.

I will create a simple function called clean() as shown below.

import time
import os
from hashlib import sha256
​
class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0
​
    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')
        
   def generate_hash(self, Filename:str)->str:
        Filehash = sha256()
        try:
            with open(Filename, 'rb') as File:
                fileblock = File.read(self.block_size)
                while len(fileblock)>0:
                    Filehash.update(fileblock)
                    fileblock = File.read(self.block_size)
                Filehash = Filehash.hexdigest()
            return Filehash
        except:
            return False
​
    def clean(self)->None:
        all_dirs = [path[0] for path in os.walk('.')]
        for path in all_dirs:
            os.chdir(path)
            All_Files =[file for file in os.listdir() if os.path.isfile(file)]
            for file in All_Files:
                filehash = self.generate_hash(file)
                if not filehash in self.File_hashes:
                    if filehash:                       
                        self.File_hashes.append(filehash)
                        #print(file)
                else:
                    byte_saved = os.path.getsize(file); self.count_cleaned+=1
                    self.Total_bytes_saved+=byte_saved
                    os.remove(file); filename = file.split('/')[-1]
                    print(filename, '.. cleaned ')
            os.chdir(self.home_dir)
        
    def main(self)->None:
      self.welcome();self.clean()
​
if __name__ == '__main__':
    App = Duplython()
    App.main()

Now that our program is almost done, the last thing to do is to show the results of the cleaning process to the user.

I created the function Cleaning_summary() just to do it. Print the results of the cleaning process to the screen to complete the program.

import time
import os
import shutil
from hashlib import sha256

class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0

    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')

    def generate_hash(self, Filename:str)->str:
        Filehash = sha256()
        try:
            with open(Filename, 'rb') as File:
                fileblock = File.read(self.block_size)
                while len(fileblock)>0:
                    Filehash.update(fileblock)
                    fileblock = File.read(self.block_size)
                Filehash = Filehash.hexdigest()
            return Filehash
        except:
            return False

    def clean(self)->None:
        all_dirs = [path[0] for path in os.walk('.')]
        for path in all_dirs:
            os.chdir(path)
            All_Files =[file for file in os.listdir() if os.path.isfile(file)]
            for file in All_Files:
                filehash = self.generate_hash(file)
                if not filehash in self.File_hashes:
                    if filehash:                       
                        self.File_hashes.append(filehash)
                        #print(file)
                else:
                    byte_saved = os.path.getsize(file); self.count_cleaned+=1
                    self.Total_bytes_saved+=byte_saved
                    os.remove(file); filename = file.split('/')[-1]
                    print(filename, '.. cleaned ')
            os.chdir(self.home_dir)
    
    def cleaning_summary(self)->None:
        mb_saved = self.Total_bytes_saved/1048576
        mb_saved = round(mb_saved, 2)
        print('\n\n--------------FINISHED CLEANING ------------')
        print('File cleaned  : ', self.count_cleaned)
        print('Total Space saved : ', mb_saved, 'MB')
        print('-----------------------------------------------')
        
    def main(self)->None:
        self.welcome();self.clean();self.cleaning_summary()

if __name__ == '__main__':
    App = Duplython()
    App.main()

Our Python Duplicate File Removal application is done, now to run the application run it in the specific directory you want to clean and it will recursively go through a certain directory to find all. all files and delete duplicate files.

Result

$ python3 app.py 
******************************************************************
****************        DUPLYTHON      ****************************
********************************************************************
​
​
----------------        WELCOME        ----------------------------
​
Cleaning .................
0(copy).jpeg .. cleaned 
0 (1)(copy).jpeg .. cleaned 
0 (2)(copy).jpeg .. cleaned 

​
--------------FINISHED CLEANING ------------
File cleaned  :  3
Total Space saved :  0.38 MB
-----------------------------------------------

The article achieved: 5/5 – (100 votes)

Tags: deleteduplicatefilesprogramPythonwrite
Previous Post

Download UnHackMe 11.40 Full Key – Protect Windows from Virus

Next Post

How to register for a Freepik Premium account to get free graphic files

AnonyViet

AnonyViet

Related Posts

Instructions on how to format text on the Windows 11 notepad
Tips

Instructions on how to format text on the Windows 11 notepad

August 16, 2025
4 ways to fix bluetooth connectivity on Windows 11
Tips

4 ways to fix bluetooth connectivity on Windows 11

August 8, 2025
How to know the computer is tracked and processed by Keylogger
Tips

How to know the computer is tracked and processed by Keylogger

August 7, 2025
Opal: Create applications who do not need to write code
Tips

Opal: Create applications who do not need to write code

August 3, 2025
How to activate a new Start menu on Windows 11
Tips

How to activate a new Start menu on Windows 11

July 29, 2025
Intellgpt: AI tool for osint and data science
Tips

Intellgpt: AI tool for osint and data science

July 28, 2025
Next Post
How to register for a Freepik Premium account to get free graphic files

How to register for a Freepik Premium account to get free graphic files

0 0 votes
Article Rating
Subscribe
Login
Notify of
guest

guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Recent News

Instructions on how to format text on the Windows 11 notepad

Instructions on how to format text on the Windows 11 notepad

August 16, 2025
Instructions for receiving 80GB of free data from VinaPhone from August 15

Instructions for receiving 80GB of free data from VinaPhone from August 15

August 15, 2025
Online driving exam preparation: Support theory and practice

Online driving exam preparation: Support theory and practice

August 15, 2025
How to add application to your favorite bar

How to add application to your favorite bar

August 14, 2025
Instructions on how to format text on the Windows 11 notepad

Instructions on how to format text on the Windows 11 notepad

August 16, 2025
Instructions for receiving 80GB of free data from VinaPhone from August 15

Instructions for receiving 80GB of free data from VinaPhone from August 15

August 15, 2025
Online driving exam preparation: Support theory and practice

Online driving exam preparation: Support theory and practice

August 15, 2025
AnonyViet - English Version

AnonyViet

AnonyViet is a website share knowledge that you have never learned in school!

We are ready to welcome your comments, as well as your articles sent to AnonyViet.

Follow Us

Contact:

Email: anonyviet.com[@]gmail.com

Main Website: https://anonyviet.com

Recent News

Instructions on how to format text on the Windows 11 notepad

Instructions on how to format text on the Windows 11 notepad

August 16, 2025
Instructions for receiving 80GB of free data from VinaPhone from August 15

Instructions for receiving 80GB of free data from VinaPhone from August 15

August 15, 2025
  • Home
  • Home 2
  • Home 3
  • Home 4
  • Home 5
  • Home 6
  • Next Dest Page
  • Sample Page

©2024 AnonyVietFor Knowledge kqxs hôm nay xem phim miễn phí mm88 8XBET mm88 trang chủ new88

No Result
View All Result
  • Home
  • News
  • Software
  • Knowledge
  • MMO
  • Tips
  • Security
  • Network
  • Office

©2024 AnonyVietFor Knowledge kqxs hôm nay xem phim miễn phí mm88 8XBET mm88 trang chủ new88

wpDiscuz
0
0
Would love your thoughts, please comment.x
()
x
| Reply