• Home
  • News
  • Software
  • Knowledge
  • MMO
  • Tips
  • Security
  • Network
  • Office
AnonyViet - English Version
  • Home
  • News
  • Software
  • Knowledge
  • MMO
  • Tips
  • Security
  • Network
  • Office
No Result
View All Result
  • Home
  • News
  • Software
  • Knowledge
  • MMO
  • Tips
  • Security
  • Network
  • Office
No Result
View All Result
AnonyViet - English Version
No Result
View All Result

Write a program to delete duplicate files with python

AnonyViet by AnonyViet
April 15, 2025
in Tips
0

You are too worried because there are many duplicated files on the drive that occupy memory capacity? But when searching and erasing them manually is too tedious. Continue Python seriestoday, I will continue to guide you how to delete duplicate files and release the drive capacity with Python.

Join the channel Telegram belong to Anonyviet 👉 Link 👈

Solution

Instead of searching throughout the drive to delete duplicate files, you can automate this process by using script, by writing a program to search for recursive in the drive and remove all duplicated files found.

Operation principle

If we read the entire file and then compare it to the remaining files by recursive, it will take a lot of time, so how do we do it?

The answer is the hashing (hash), with the hashing we can create a series of letters and certain numbers that act as the identity of a certain file and if we find any other file with the same identity, we will delete it.

Delete duplicate files with python

There are many different hashing algorithms such as:

  • MD5
  • Sha1
  • sha224, sha256, sha384 and sha512

Code delete duplicate files with python

Hashing in Python is quite simple, we will use the library hashlib The default with Python's standard library.

Here is an example of how we havehing content using the hashlib, we will chop a string in Python using MD5 hash algorithm.

For example

>>> import hashlib
>>> example_text = "Duplython is amazing".encode('utf-8')
>>> hashlib.md5(example_text).hexdigest()
'73a14f46eadcc04f4e04bec8eb66f2ab'

Explain a little, you just need import hashlib And then use the MD5 method to create the hash and finally use HexDigest to create the hash string.

The above example shows us how to have a chain but when considering this relationship with the project is about to do, we must pay more attention to the files than the chain, right? Another question has been asked.

How do we haveh files like?

The hash (hash) files are similar to the hash string but there is a small difference, in the process of chopping the file, we first need to open the file in a binary form and then chopped the binary value of the file.

File hash

Suppose you have a simple text document on your project directory with the name learn.txt. This is how we will do it.

>>> import hashlib
>>> file = open('learn.txt', 'rb').read()
>>> hashlib.md5(file).hexdigest()
'0534cf6d5816c4f1ace48fff75f616c9'

This function will return the same hash values ​​if those files have the same content then easily find and delete duplicate files with python. Note: Unlike the name but the same content is still paying the same hash value.

The challenge arises when we try to read a fairly large file will take a while to download it. Therefore, instead of waiting for the entire file to memory, we can continue to calculate the hash function when reading the file.

The calculation of the hash function while reading the file requires us to read the file in certain sizes and continuously update the hash functions when we continue reading the file until the entire file hash. Simply put the file into multiple parts, then read each part, each part will be hash, after the hash will be updated into another variable.

Doing this way can help us save a lot of waiting time we can use to wait for the entire file to be ready.

For example

>>> import hashlib
>>> block_size = 1024
>>> hash = hashlib.md5()
>>> with open('learn.txt', 'rb') as file:
...     block = file.read(block_size)
...     while len(block)>0:
...             hash.update(block)
...             block = file.read(block_size)
...     print(hash)
... 
0534cf6d5816c4f1ace48fff75f616c9

But hash is just a step we need to really remove copies, so we will use the OS module to delete copies.

We will use the remove () function in the OS module to delete duplicate files.

Use the OS module to delete the space.txt

For example:

>>> import os
>>> os.listdir()
['Desktop-File-cleaner', '.git', 'learn.txt', 'app.py', 'README.md']
>>> os.remove('learn.txt')
>>> os.listdir()
['Desktop-File-cleaner', '.git', 'app.py', 'README.md']

Once you have deleted the file with the remove () function, we will start building the application.

How to create applications to delete duplicate files

The necessary libraries:

import time
import os
from hashlib import sha256

I am a person who really likes to program object -oriented programming, so in this article, I will build the tool as a single class, the code below is just the frame of the program.

import time
import os
from hashlib import sha256

class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0

    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')
        
    def main(self)->None:
      self.welcome()

if __name__ == '__main__':
    App = Duplython()
    App.main()

It is just the interface of the program, when you run it will only print greetings on the screen.

$ python3 app.py
******************************************************************
****************        DUPLYTHON      ****************************
********************************************************************

----------------        WELCOME        ----------------------------
​
Cleaning .................

Now we will create a simple function used to chop a file with a certain path using the hash knowledge we have learned above.

import time
import os
from hashlib import sha256

class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0

    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')
        
   def generate_hash(self, Filename:str)->str:
        Filehash = sha256()
        try:
            with open(Filename, 'rb') as File:
                fileblock = File.read(self.block_size)
                while len(fileblock)>0:
                    Filehash.update(fileblock)
                    fileblock = File.read(self.block_size)
                Filehash = Filehash.hexdigest()
            return Filehash
        except:
            return False
        
    def main(self)->None:
      self.welcome()

if __name__ == '__main__':
    App = Duplython()
    App.main()

Deploy logic for the program

After creating the file hash function, we must deploy at the place that will compare those hash chains and remove any copy found.

I will create a simple function called Clean () as shown below.

import time
import os
from hashlib import sha256
​
class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0
​
    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')
        
   def generate_hash(self, Filename:str)->str:
        Filehash = sha256()
        try:
            with open(Filename, 'rb') as File:
                fileblock = File.read(self.block_size)
                while len(fileblock)>0:
                    Filehash.update(fileblock)
                    fileblock = File.read(self.block_size)
                Filehash = Filehash.hexdigest()
            return Filehash
        except:
            return False
​
    def clean(self)->None:
        all_dirs = [path[0] for path in os.walk('.')]
        for path in all_dirs:
            os.chdir(path)
            All_Files =[file for file in os.listdir() if os.path.isfile(file)]
            for file in All_Files:
                filehash = self.generate_hash(file)
                if not filehash in self.File_hashes:
                    if filehash:                       
                        self.File_hashes.append(filehash)
                        #print(file)
                else:
                    byte_saved = os.path.getsize(file); self.count_cleaned+=1
                    self.Total_bytes_saved+=byte_saved
                    os.remove(file); filename = file.split('/')[-1]
                    print(filename, '.. cleaned ')
            os.chdir(self.home_dir)
        
    def main(self)->None:
      self.welcome();self.clean()
​
if __name__ == '__main__':
    App = Duplython()
    App.main()

Now our program is nearly completed, the last thing is to display the results of the cleaning process for users to view.

I have created the function Cleaning_summary() Just to do that. Print the results of the cleaning process to the screen to complete the program.

import time
import os
import shutil
from hashlib import sha256

class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0

    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')

    def generate_hash(self, Filename:str)->str:
        Filehash = sha256()
        try:
            with open(Filename, 'rb') as File:
                fileblock = File.read(self.block_size)
                while len(fileblock)>0:
                    Filehash.update(fileblock)
                    fileblock = File.read(self.block_size)
                Filehash = Filehash.hexdigest()
            return Filehash
        except:
            return False

    def clean(self)->None:
        all_dirs = [path[0] for path in os.walk('.')]
        for path in all_dirs:
            os.chdir(path)
            All_Files =[file for file in os.listdir() if os.path.isfile(file)]
            for file in All_Files:
                filehash = self.generate_hash(file)
                if not filehash in self.File_hashes:
                    if filehash:                       
                        self.File_hashes.append(filehash)
                        #print(file)
                else:
                    byte_saved = os.path.getsize(file); self.count_cleaned+=1
                    self.Total_bytes_saved+=byte_saved
                    os.remove(file); filename = file.split('/')[-1]
                    print(filename, '.. cleaned ')
            os.chdir(self.home_dir)
    
    def cleaning_summary(self)->None:
        mb_saved = self.Total_bytes_saved/1048576
        mb_saved = round(mb_saved, 2)
        print('\n\n--------------FINISHED CLEANING ------------')
        print('File cleaned  : ', self.count_cleaned)
        print('Total Space saved : ', mb_saved, 'MB')
        print('-----------------------------------------------')
        
    def main(self)->None:
        self.welcome();self.clean();self.cleaning_summary()

if __name__ == '__main__':
    App = Duplython()
    App.main()

The application of deleting overlapping files with our python has been completed, now to run the application, run it in the specific folder you want to clean and it will recur through a certain folder to find all files and delete duplicate files.

Result

$ python3 app.py 
******************************************************************
****************        DUPLYTHON      ****************************
********************************************************************
​
​
----------------        WELCOME        ----------------------------
​
Cleaning .................
0(copy).jpeg .. cleaned 
0 (1)(copy).jpeg .. cleaned 
0 (2)(copy).jpeg .. cleaned 

​
--------------FINISHED CLEANING ------------
File cleaned  :  3
Total Space saved :  0.38 MB
-----------------------------------------------
Previous Post

What is a VPN Connection? All you need to know

Next Post

Create a shortcut for trash calls on iPhone with just a few steps

AnonyViet

AnonyViet

Related Posts

Facebook ad blocking utilities and converted into hotgirl photos
Tips

Facebook ad blocking utilities and converted into hotgirl photos

May 14, 2025
How to detect IRQ conflict – Hardware conflict on Windows
Tips

How to detect IRQ conflict – Hardware conflict on Windows

May 13, 2025
How to use Auto Share Facebook to increase the article sharing
Tips

How to use Auto Share Facebook to increase the article sharing

May 12, 2025
Instructions for installing Facebook Lite on super light iPhone
Tips

Instructions for installing Facebook Lite on super light iPhone

May 10, 2025
Download Vietnamese Pascal Video course for beginners
Tips

Download Vietnamese Pascal Video course for beginners

May 9, 2025
How to download and delete your information on Google
Tips

How to download and delete your information on Google

May 8, 2025
Next Post
Create a shortcut for trash calls on iPhone with just a few steps

Create a shortcut for trash calls on iPhone with just a few steps

0 0 votes
Article Rating
Subscribe
Login
Notify of
guest

guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Recent News

What is Vibe Coding? – Anonyviet

What is Vibe Coding? – Anonyviet

May 14, 2025
6 simple ways to know who blocked you on Instagram

6 simple ways to know who blocked you on Instagram

May 14, 2025
Facebook ad blocking utilities and converted into hotgirl photos

Facebook ad blocking utilities and converted into hotgirl photos

May 14, 2025
Lucida: Download SoundCloud, Tidal for free, no advertising

Lucida: Download SoundCloud, Tidal for free, no advertising

May 13, 2025
What is Vibe Coding? – Anonyviet

What is Vibe Coding? – Anonyviet

May 14, 2025
6 simple ways to know who blocked you on Instagram

6 simple ways to know who blocked you on Instagram

May 14, 2025
Facebook ad blocking utilities and converted into hotgirl photos

Facebook ad blocking utilities and converted into hotgirl photos

May 14, 2025
AnonyViet - English Version

AnonyViet

AnonyViet is a website share knowledge that you have never learned in school!

We are ready to welcome your comments, as well as your articles sent to AnonyViet.

Follow Us

Contact:

Email: anonyviet.com[@]gmail.com

Main Website: https://anonyviet.com

Recent News

What is Vibe Coding? – Anonyviet

What is Vibe Coding? – Anonyviet

May 14, 2025
6 simple ways to know who blocked you on Instagram

6 simple ways to know who blocked you on Instagram

May 14, 2025
  • Home
  • Home 2
  • Home 3
  • Home 4
  • Home 5
  • Home 6
  • Next Dest Page
  • Sample Page

©2024 AnonyVietFor Knowledge kqxs hôm nay xem phim miễn phí SHBET bongdaso

wpDiscuz
0
0
Would love your thoughts, please comment.x
()
x
| Reply
No Result
View All Result
  • Home
  • News
  • Software
  • Knowledge
  • MMO
  • Tips
  • Security
  • Network
  • Office

©2024 AnonyVietFor Knowledge kqxs hôm nay xem phim miễn phí SHBET bongdaso