A) Introduction
One of the issues that businesses are most concerned about is data, their data is always increasing day by day. It is essential to have solutions to expand as well as optimize the data storage system. Data deduplication is a new feature of Microsoft that allows scanning and finding duplicate data on the drive. Saves disk space significantly, and has absolutely no impact on data or data retrieval.
Join the channel Telegram of the AnonyViet 👉 Link 👈 |
- Capacity optimization: Filter and find duplicate data on the drive, process them and recover wasted resources, keeping the drive optimized and saving space. More efficient than other solutions like Single Instance Storage or NTFS Compression.
- Scale and performance: Handles 50MB/1s of data and allows handling of multiple drives at the same time. The processing does not affect data retrieval jobs on the server. Consume less CPU and RAM resources for this, if the Server is running high then the processing may stop. Supports scheduling and data handling policies (File Policies).
- Reliability and data integrity: Always check and verify data, to ensure integrity. Also perform queries and check the structure of the file and frequently refer to data blocks.
- Optimization management with familiar tools: configured via the Server Manager interface or Windows PowerShell.
Data deduplication is compatible with features such as: Failover Cluster, BranchCache, DFS Replication and File Service Resource Management (FSRM).
Data Deduplication process evaluation table after scanning:
Content | Saving | |
User documents | Documents, photos, music, videos | 30-50% |
Deployment shares | Software binaries, cab files, symbols files | 70-80% |
Virtualization libraries | Virtual hard disk files | 80-95% |
General file share | All of the above | 50-60% |
B) Working principle
For example, if Data deduplication detects that there are two files that are similar in size and content, it will delete one file. They work by splitting data into multiple blocks (32-128Kb for a block), then they compare these blocks with each other and filter out which blocks have duplicates. If any blocks are duplicated, it will delete them and save only 1 copy for that block on the drive. These blocks are saved on the folder System Volume Information of that drive.
After the Data Deduplication feature is enabled on the volumes, they begin to filter and process data duplication. In the figure, file abc and file def have the same 3MB data block (A, B, C). They will put this data block into a place called Chunk Store (duplicate data store), and in the structure of file abc and file def, the system will create a field called reparse data containing pointers and pointers. about this common data block. This data deduplication process does not affect the user’s data access.
When a drive has data deduplication enabled, the drive will have components such as:
- Unoptimized File : files that are not processed, such as files smaller than 32KB, encrypted files, files with extended attributes, system state files, SQL Database and Exchange databases, files processed by the application and change frequently.
- Optimized files: files to be processed, are document files, virtual machine files (VHD), software installation files and files with little content change. In their structure there will be a reparse data field (containing pointers).
- Chunk store : is a repository of duplicate data
C) The main tasks of Data Deduplication
first. Optimization job
Perform a scan and put duplicate data in chunk store. At the same time, compress these chunk data. Those data that have been scanned are labeled “policy”, ie checked and reviewed.
2. Data Scrubbing Job
Perform a scan and check the metadata (structure) of the data. Make sure the data structure is intact and unaltered. Simultaneously scan and detect any corrupted data (corrupt), record it in a log file. Based on that log file to perform analysis and error correction of these data. With features like:
- If any data is accessed frequently and more than 100 times, those data will be prioritized for backup. If the data is corrupted (corrupt), the backup will be taken out for use.
- If you implement the Storage Space feature to map the data (mirror), then Data Deduplication will use the mapped disk-side data for retrieval and recovery in case of data failure.
- If a chunk of data is corrupted while being processed for inclusion in the chunk store, that chunk is discarded and another chunk is regenerated.
3. Garbage collection jobs
Sometimes those data are deleted or edited by the user, and they are no longer duplicated on the drive. This feature helps to scan and delete data that is no longer duplicated in the chunk store. When you do the optimization job again, all the old data in the chunk store is automatically deleted.
D) Lab Data Deduplication
Model deployment:
- AD machine: IP address 172.1.1.1/24, domain huypd.com
- Machine FS1 : IP address 172.1.1.2/24, domain huypd.cominstall the Data Deduplication feature
Perform:
- On FS1, install the Data Deduplication feature. Server Manager –> Add roles and features
- Select the role “File and Storage Services -> File and iSCSI Services -> Data Deduplication
- In this lab, I copied all the data to drive E, the data here is music. I copied these data twice to create duplicates
- Open Windows PowerShell -> type “cd e:” and “Get-ChildItem –Recurse” to see all files in E data
- After installing the Data Deduplication feature, the next step is to enable this feature for drive E -> Go to Server Manager -> select “File and Storage services” -> select Volumes -> select drive E and right-click and select “Configure Data Deduplication”
- Deduplicate files older than: time is deemed to have expired data and must perform duplicate scans and checks
- Custom file extensions to exclude: the types of data in drive E that you want to skip deduplication (eg: here I try GHO, exclude ghost data). Then click Add
- Set deduplication schedule: choose when to perform deduplication and deduplication.
- Choose a time for scanning and deduplication
- Open Windows Powershell –> type “cd e:” and “get-dedupvolume” to check if this drive is Data Deduplication.
- Since it is the first time to configure, I should activate the classification and de-duplication mechanism manually -> Type “Start-Dedupjob –Full –Path E: –Type Optmization”
- After the system scans, we execute the command “Get-dedupstatus | fl” to see the results.
- At this time, in Server Manager, the parameters and scan results are also displayed.
- Perform the Scrubbing feature with the command “Start-Dedupjob –Full –Path E: –Type Scrubbing”
- Implement the Garbage Collection feature with the command “Start-Dedupjob –Full –Path E: –Type Garbage Collection”