• Home
  • News
  • Software
  • Knowledge
  • MMO
  • Tips
  • Security
  • Network
  • Office
AnonyViet - English Version
  • Home
  • News
  • Software
  • Knowledge
  • MMO
  • Tips
  • Security
  • Network
  • Office
No Result
View All Result
  • Home
  • News
  • Software
  • Knowledge
  • MMO
  • Tips
  • Security
  • Network
  • Office
No Result
View All Result
AnonyViet - English Version
No Result
View All Result

Write a program to Delete Duplicate Files in Python

AnonyViet by AnonyViet
January 31, 2023
in Tips
0

Are you too worried because there are many duplicate files on the drive that take up memory space? But manually searching and deleting them is too tedious. Continue python seriesToday, I will continue to show you how to delete duplicate files and free up disk space using python.

Join the channel Telegram of the AnonyViet 👉 Link 👈

Solution

Instead of searching the drive for duplicate files, you can automate this process using a script, by writing a program to recursively search the drive and remove all files. Duplicate found.

Working principle

If we read the entire File and then compare it with the rest of the files recursively, it will take a lot of time, so what should we do?

The answer is hashing, with hashing we can generate a certain sequence of letters and numbers that act as the identity of a given File and if we find any other File with the same identity, we will delete it.

Delete Duplicate Files in Python

There are many different hashing algorithms such as:

  • md5
  • sha1
  • sha224, sha256, sha384 and sha512

Code to remove duplicate files in Python

Hashing in Python is quite simple, we will use the library hashlib default with Python’s standard library.

Here is an example of how we hashing content using hashlib, we will hash a string in Python using md5 hashing algorithm.

Eg

>>> import hashlib
>>> example_text = "Duplython is amazing".encode('utf-8')
>>> hashlib.md5(example_text).hexdigest()
'73a14f46eadcc04f4e04bec8eb66f2ab'

A little explanation, you just import hashlib and then use md5 method to generate hash and finally use hexdigest to generate hash string.

The above example showed us how to hash a string but when considering this with the upcoming project, we should be more concerned with Files than strings, right? Another question was raised.

How do we Hash files?

File hashes are similar to hash strings but with a slight difference, in the process of file hashing, we first need to open the File in binary form and then hash the binary value of the File.

File hash

Let’s say you have a plain text document on your project directory with the name learn.txt. This is how we will do it.

>>> import hashlib
>>> file = open('learn.txt', 'rb').read()
>>> hashlib.md5(file).hexdigest()
'0534cf6d5816c4f1ace48fff75f616c9'

This function will return the same hash values ​​if the files have the same content then it is easy to find and Remove duplicate files with Python. Note: different names but the same content will still return the same hash value.

The challenge arises when we try to read a rather large File that will take a while to load. Therefore, instead of waiting for the entire File to enter memory, we can continue to compute the hash while reading the File.

Calculating the hash while reading the File requires us to read the File in blocks of a certain size and continuously update the hashes as we continue to read the File until the entire File has been hashed. Simply said, divide the file into several parts, then read each part, each part will be hashed, after the hash will be updated to another variable.

Doing it this way can save us a lot of waiting time that we could use to wait for the entire file to be ready.

Eg

>>> import hashlib
>>> block_size = 1024
>>> hash = hashlib.md5()
>>> with open('learn.txt', 'rb') as file:
...     block = file.read(block_size)
...     while len(block)>0:
...             hash.update(block)
...             block = file.read(block_size)
...     print(hash)
... 
0534cf6d5816c4f1ace48fff75f616c9

But hashing is just one step we need to actually remove duplicates, so we’ll use the OS module to remove duplicates.

We will use remove() function in OS module to remove duplicate files.

Use the OS module to delete the learn.txt . file

Eg:

>>> import os
>>> os.listdir()
['Desktop-File-cleaner', '.git', 'learn.txt', 'app.py', 'README.md']
>>> os.remove('learn.txt')
>>> os.listdir()
['Desktop-File-cleaner', '.git', 'app.py', 'README.md']

Once we’ve removed the file with the remove() function, we’ll start building the application.

How to create an application to delete duplicate files

Required libraries:

import time
import os
from hashlib import sha256

I am a person who loves object-oriented programming, so in this article, I will build the tool as a single class, the code below is just the framework of the program.

import time
import os
from hashlib import sha256

class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0

    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')
        
    def main(self)->None:
      self.welcome()

if __name__ == '__main__':
    App = Duplython()
    App.main()

That’s just the interface of the program, when you run it will just print the welcome message to the screen.

$ python3 app.py
******************************************************************
****************        DUPLYTHON      ****************************
********************************************************************

----------------        WELCOME        ----------------------------
​
Cleaning .................

Now we will create a simple function that will hash a File with a given path using the hashing knowledge we learned above.

import time
import os
from hashlib import sha256

class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0

    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')
        
   def generate_hash(self, Filename:str)->str:
        Filehash = sha256()
        try:
            with open(Filename, 'rb') as File:
                fileblock = File.read(self.block_size)
                while len(fileblock)>0:
                    Filehash.update(fileblock)
                    fileblock = File.read(self.block_size)
                Filehash = Filehash.hexdigest()
            return Filehash
        except:
            return False
        
    def main(self)->None:
      self.welcome()

if __name__ == '__main__':
    App = Duplython()
    App.main()

Implement logic for the program

After generating the File hash, we have to implement it where it will compare those hashes and discard any duplicates found.

I will create a simple function called clean() as shown below.

import time
import os
from hashlib import sha256
​
class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0
​
    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')
        
   def generate_hash(self, Filename:str)->str:
        Filehash = sha256()
        try:
            with open(Filename, 'rb') as File:
                fileblock = File.read(self.block_size)
                while len(fileblock)>0:
                    Filehash.update(fileblock)
                    fileblock = File.read(self.block_size)
                Filehash = Filehash.hexdigest()
            return Filehash
        except:
            return False
​
    def clean(self)->None:
        all_dirs = [path[0] for path in os.walk('.')]
        for path in all_dirs:
            os.chdir(path)
            All_Files =[file for file in os.listdir() if os.path.isfile(file)]
            for file in All_Files:
                filehash = self.generate_hash(file)
                if not filehash in self.File_hashes:
                    if filehash:                       
                        self.File_hashes.append(filehash)
                        #print(file)
                else:
                    byte_saved = os.path.getsize(file); self.count_cleaned+=1
                    self.Total_bytes_saved+=byte_saved
                    os.remove(file); filename = file.split('/')[-1]
                    print(filename, '.. cleaned ')
            os.chdir(self.home_dir)
        
    def main(self)->None:
      self.welcome();self.clean()
​
if __name__ == '__main__':
    App = Duplython()
    App.main()

Now that our program is almost done, the last thing to do is to show the results of the cleaning process to the user.

I created the function Cleaning_summary() just to do it. Print the results of the cleaning process to the screen to complete the program.

import time
import os
import shutil
from hashlib import sha256

class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0

    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')

    def generate_hash(self, Filename:str)->str:
        Filehash = sha256()
        try:
            with open(Filename, 'rb') as File:
                fileblock = File.read(self.block_size)
                while len(fileblock)>0:
                    Filehash.update(fileblock)
                    fileblock = File.read(self.block_size)
                Filehash = Filehash.hexdigest()
            return Filehash
        except:
            return False

    def clean(self)->None:
        all_dirs = [path[0] for path in os.walk('.')]
        for path in all_dirs:
            os.chdir(path)
            All_Files =[file for file in os.listdir() if os.path.isfile(file)]
            for file in All_Files:
                filehash = self.generate_hash(file)
                if not filehash in self.File_hashes:
                    if filehash:                       
                        self.File_hashes.append(filehash)
                        #print(file)
                else:
                    byte_saved = os.path.getsize(file); self.count_cleaned+=1
                    self.Total_bytes_saved+=byte_saved
                    os.remove(file); filename = file.split('/')[-1]
                    print(filename, '.. cleaned ')
            os.chdir(self.home_dir)
    
    def cleaning_summary(self)->None:
        mb_saved = self.Total_bytes_saved/1048576
        mb_saved = round(mb_saved, 2)
        print('\n\n--------------FINISHED CLEANING ------------')
        print('File cleaned  : ', self.count_cleaned)
        print('Total Space saved : ', mb_saved, 'MB')
        print('-----------------------------------------------')
        
    def main(self)->None:
        self.welcome();self.clean();self.cleaning_summary()

if __name__ == '__main__':
    App = Duplython()
    App.main()

Our Python Duplicate File Removal application is done, now to run the application run it in the specific directory you want to clean and it will recursively go through a certain directory to find all. all files and delete duplicate files.

Result

$ python3 app.py 
******************************************************************
****************        DUPLYTHON      ****************************
********************************************************************
​
​
----------------        WELCOME        ----------------------------
​
Cleaning .................
0(copy).jpeg .. cleaned 
0 (1)(copy).jpeg .. cleaned 
0 (2)(copy).jpeg .. cleaned 

​
--------------FINISHED CLEANING ------------
File cleaned  :  3
Total Space saved :  0.38 MB
-----------------------------------------------

The article achieved: 5/5 – (100 votes)

Tags: deleteduplicatefilesprogramPythonwrite
Previous Post

Download UnHackMe 11.40 Full Key – Protect Windows from Virus

Next Post

How to register for a Freepik Premium account to get free graphic files

AnonyViet

AnonyViet

Related Posts

Top 5 game programming languages ​​to learn now
Tips

Top 5 game programming languages ​​to learn now

June 8, 2025
[Godot Shooter] #2: Creating characters & shooting bullets
Tips

[Godot Shooter] #2: Creating characters & shooting bullets

June 7, 2025
What do you need to learn game programming? Is it difficult? How long does it take?
Tips

What do you need to learn game programming? Is it difficult? How long does it take?

June 6, 2025
Instructions for registering chatgpt team at $ 1
Tips

Instructions for registering chatgpt team at $ 1

June 5, 2025
How to engrave the right mouse menu error on Windows
Tips

How to engrave the right mouse menu error on Windows

June 5, 2025
How to create online meme photos is very easy with a few steps
Tips

How to create online meme photos is very easy with a few steps

June 5, 2025
Next Post
How to register for a Freepik Premium account to get free graphic files

How to register for a Freepik Premium account to get free graphic files

0 0 votes
Article Rating
Subscribe
Login
Notify of
guest

guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Recent News

Top 5 game programming languages ​​to learn now

Top 5 game programming languages ​​to learn now

June 8, 2025
The iPhone list is updated with iOS 26

The iPhone list is updated with iOS 26

June 8, 2025
Discover the glowing effect next to the iPhone ios 18 screen

Discover the glowing effect next to the iPhone ios 18 screen

June 8, 2025
[Godot Shooter] #2: Creating characters & shooting bullets

[Godot Shooter] #2: Creating characters & shooting bullets

June 7, 2025
Top 5 game programming languages ​​to learn now

Top 5 game programming languages ​​to learn now

June 8, 2025
The iPhone list is updated with iOS 26

The iPhone list is updated with iOS 26

June 8, 2025
Discover the glowing effect next to the iPhone ios 18 screen

Discover the glowing effect next to the iPhone ios 18 screen

June 8, 2025
AnonyViet - English Version

AnonyViet

AnonyViet is a website share knowledge that you have never learned in school!

We are ready to welcome your comments, as well as your articles sent to AnonyViet.

Follow Us

Contact:

Email: anonyviet.com[@]gmail.com

Main Website: https://anonyviet.com

Recent News

Top 5 game programming languages ​​to learn now

Top 5 game programming languages ​​to learn now

June 8, 2025
The iPhone list is updated with iOS 26

The iPhone list is updated with iOS 26

June 8, 2025
  • Home
  • Home 2
  • Home 3
  • Home 4
  • Home 5
  • Home 6
  • Next Dest Page
  • Sample Page

©2024 AnonyVietFor Knowledge kqxs hôm nay xem phim miễn phí SHBET https://kubet88.yoga/ bj88

No Result
View All Result
  • Home
  • News
  • Software
  • Knowledge
  • MMO
  • Tips
  • Security
  • Network
  • Office

©2024 AnonyVietFor Knowledge kqxs hôm nay xem phim miễn phí SHBET https://kubet88.yoga/ bj88

wpDiscuz
0
0
Would love your thoughts, please comment.x
()
x
| Reply