Sorting big files with less memory usage

A common problem when sorting data is the big amount of memory used.

Back to the point, I had a 100 MB file with 200 000 000 lines that needed to be sorted using as less memory as possible. Each line of this file was a number randomly picked between 1 and 100 000.

My idea is to split this big file into smaller pieces, sort the pieces separately and merge them after that. This will decrease the memory usage at the expense of disk I/O operations which is an acceptable solution for me.

Firstly, when you are trying to optimize something you need tools that give you statistical information you can compare. For the purpose of this task I used the following tools:

Track disk I/O operations:

iostat -d 1

This will display a continuous device report at one second intervals.
NB! You might need to install sysstat: sudo apt-get install sysstat

Track memory usage:

top -b | grep sort

After all tools for statistics are set I ran the sorting function so I can track its memory usage:

sort big.txt > big_sorted.txt

I got the following peak statistics:

Memory usage: 15.1%
Kilobytes written per second: 44.00
Number of transfers per second: 5

As mentioned before I decided to split the big file into smaller pieces and sort them separately. I hoped the memory usage would decrease. Here’s what I’ve tried using Bash:


#!/bin/bash
big_file="big.txt"
# Create folders if they don't exist
mkdir -p parts/ sorted/
# Split the file into pieces, 5 MB each
split –bytes=5M $big_file parts/
# Sort each small file
FILES=$(cd parts/;ls -N)
for f in $FILES
do
sort "parts/$f" > "sorted/$f.txt"
done
# Merge sorted files
sorted_parts=$(cd sorted/;ls -N)
cd sorted/
sort -m $sorted_parts > ../big_sorted.txt

view raw

gistfile1.txt

hosted with ❤ by GitHub

After running the script I got the following peak statistics:

Memory usage: 1.4%
Kilobytes written per second: 38400.00
Number of transfers per second: 76

The memory usage was decreased with 13.7% points. This rate is sufficient for me to accept this task as completed.