So we noticed that a simple solution based on cat(1) ran faster than one based on dd(1M). We also know that since both work are done within a single process, the ptime(1) accurately shows the actual CPU consumption of each.
For this article, I use a 400MB file to cut down waiting time while remaining significant.
First, lets find out the syscall and read/write profiles of each with the following DTrace(7) script.
syscall:::entry
/progenyof($1) && pid!=$1/
{ @[probefunc, execname, pid] = count();
@["total", execname, pid ] = count();
}
syscall::read:return,
syscall::write:return
/progenyof($1) && pid!=$1/
{ @rw[probefunc, execname, pid] = quantize(arg1); }
END
{ printf("\n%-20s%-10s%10s%10s\n", "SysCall", "Exec", "Pid", "Count");
printa("%-20s%-10s%10d%10@d\n", @);
printa("%-20s%-10s%10d%@d\n", @rw);
}
We then run the commands
dd if=zerofile of=/zones/sadm/zerofile bs=8192
cat zerofile > /zones/sadm/zerofile
and we got:
SysCall Exec Pid Count
close bash 29837 1
creat64 dd 29838 1
exece bash 29837 1
exece bash 29838 1
...
resolvepath cat 29837 5
resolvepath dd 29838 5
sigaction bash 29837 7
sigaction bash 29838 7
mmap64 cat 29837 8
write cat 29837 8
mmap dd 29838 12
mmap cat 29837 12
total bash 29838 20
total bash 29837 23
total cat 29837 67
read dd 29838 8193
write dd 29838 8200
total dd 29838 16444
read cat 29837
value ------------- Distribution ------------- count
0 | 0
1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
2 | 0
read dd 29838
value ------------- Distribution ------------- count
-1 | 0
0 | 1
1 | 0
.......
4096 | 0
8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 8192
16384 | 0
write cat 29837
value ------------- Distribution ------------- count
4194304 | 0
8388608 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 8
16777216 | 0
write dd 29838
value ------------- Distribution ------------- count
0 | 0
1 | 4
2 | 0
4 | 2
8 | 2
16 | 0
...........
4096 | 0
8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 8192
16384 | 0
We got a few interesting observations:
- cat totalled 67 syscalls whereas dd totalled 16444
- cat used mmap(2) instead of read(2).
- cat uses a 8MB write vs. the 8k writes we specified for dd.
- we further confirmed that reads are done at 8MB by tracinging the mmap64(2) calls.
To equal the playing ground, we ptime(1) both at 8MB IO. We got runs as follows:
| Command | System Time: Average/Min/Max |
|---|---|
| dd | 0.5306/0.498/0.538 |
| cat | 0.3866/0.344/0.404 |
So cat(1) is still significantly faster. The main difference between the 2 is mainly the use of mmap(2) vs. read(2). We use the following dtrace script to measure the difference...
syscall::mmap64:entry,
syscall::read:entry,
syscall::write:entry
/progenyof($1) && pid != $1/
{ self->t = vtimestamp; }
syscall::mmap64:return,
syscall::read:return,
syscall::write:return
/ self->t /
{
@[probefunc,pid,execname] = quantize(vtimestamp - self->t);
@total["total", pid, execname] = sum(vtimestamp - self->t);
self->t = 0;
}
We got the following for dd and then cat
read 3164 cat
value ------------- Distribution ------------- count
4096 | 0
8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
16384 | 0
mmap64 3164 cat
value ------------- Distribution ------------- count
8192 | 0
16384 |@@@@@ 1
32768 | 0
..........
1048576 | 0
2097152 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 7
4194304 | 0
read 3165 dd
value ------------- Distribution ------------- count
4096 | 0
8192 |@@@@ 1
16384 | 0
..........
8388608 | 0
16777216 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 7
33554432 |@@@@ 1
67108864 | 0
write 3164 cat
value ------------- Distribution ------------- count
16777216 | 0
33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 8
67108864 | 0
write 3165 dd
value ------------- Distribution ------------- count
8192 | 0
16384 |@@@@@@@@@@ 3
32768 | 0
65536 |@@@ 1
131072 | 0
............
16777216 | 0
33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ 8
67108864 | 0
total 3164 cat 391799977
total 3165 dd 536011936
As expected, both write(2) are quite comparable at about 33ms. The main difference comes from mmap64(2) with took ~2ms against the read(2) which took ~16ms. In this example, there are 8 of them, so the difference of about (16-2) * 8 = 112ms sort of explains away the difference between the 2 average timings.

10 comments:
The count for cat's read() is missing.
Oh, unless I read wrongly, does that mean that it did 0 reads? Interesting. I got different results in Linux. Will be blogging about it tomorrow. Which version of cat are you using?
You did not read wrong. cat(1) uses mmap64(2) instead of read(2). That accounted for why it was faster.
Of course, the next logical question is why mmap(2) is faster than read(2). Warning: DTrace Ahead.
Ok, I will blog my findings tomorrow. Apparently, things on Linux are a little different. Btw, what's the version of cat are you using? Which package is it from?
Oh yes, I/O on memory mapping is definitely faster. The test is unnecessary, but it is still interesting to test it out.
One should expect that mmap is faster than read/write. So nothing surprising I think
R
It's not really surprising that mmap(2) is faster than read(2). In fact, the way solaris's VM is laid out, read(2) likely pages in the memory ala mmap mechanism before simulating a stream mechanism. It's more interesting to actually table the exact workings of read(2) vs. mmap(2) within the OS kernel :). Besides, its also fun.
btw, eugene blogged about his findings on Linux.
But? with "cat file1 > file2" isn't the shell also involved?
Anonymous,
Only in the initial setup.
Essentially, the shell opens the file and sets its handler as stdout before exec'ing 'cat'. No more shell involvement afterwards. 'cat' writes to stdout, hence the file handler.
Post a Comment