A common problem when sorting data is the big amount of memory used.
Back to the point, I had a 100 MB file with 200 000 000 lines that needed to be sorted using as less memory as possible. Each line of this file was a number randomly picked between 1 and 100 000.
My idea is to split this big file into smaller pieces, sort the pieces separately and merge them after that. This will decrease the memory usage at the expense of disk I/O operations which is an acceptable solution for me.
Firstly, when you are trying to optimize something you need tools that give you statistical information you can compare. For the purpose of this task I used the following tools:
Track disk I/O operations:
This will display a continuous device report at one second intervals.
NB! You might need to install sysstat: sudo apt-get install sysstat
Track memory usage:
After all tools for statistics are set I ran the sorting function so I can track its memory usage:
I got the following peak statistics:
Memory usage: 15.1%
Kilobytes written per second: 44.00
Number of transfers per second: 5
As mentioned before I decided to split the big file into smaller pieces and sort them separately. I hoped the memory usage would decrease. Here’s what I’ve tried using Bash:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| #!/bin/bash | |
| big_file="big.txt" | |
| # Create folders if they don't exist | |
| mkdir -p parts/ sorted/ | |
| # Split the file into pieces, 5 MB each | |
| split –bytes=5M $big_file parts/ | |
| # Sort each small file | |
| FILES=$(cd parts/;ls -N) | |
| for f in $FILES | |
| do | |
| sort "parts/$f" > "sorted/$f.txt" | |
| done | |
| # Merge sorted files | |
| sorted_parts=$(cd sorted/;ls -N) | |
| cd sorted/ | |
| sort -m $sorted_parts > ../big_sorted.txt |
After running the script I got the following peak statistics:
Memory usage: 1.4%
Kilobytes written per second: 38400.00
Number of transfers per second: 76
The memory usage was decreased with 13.7% points. This rate is sufficient for me to accept this task as completed.