15

In this answer (How can I remove the first line of a file with sed?) there are two ways to delete the first record in a file:

sed '1d' $file >> headerless.txt

** ---------------- OR ----------------**

tail -n +2 $file >> headerless.txt

Personally I think the tail option is cosmetically more pleasing and more readable but probably because I'm sed-challenged.

Which method is fastest?

WinEunuuchs2Unix
  • 99,709
  • 34
  • 237
  • 401
  • 5
    Not an answer, but a possible consideration is that `sed` is more portable: "+2" for `tail` works fine on Ubuntu, which uses GNU `tail`, but won't work on BSD `tail`. – John N Dec 20 '16 at 21:06
  • @JohnN thanks for sharing `tail` lack of cross-platform compatability. – WinEunuuchs2Unix Dec 20 '16 at 21:16
  • 3
    @John N "+2" for tail works fine on may Mac running Sierra which claims to use the BSD tail command – Nick Sillito Dec 20 '16 at 22:53
  • Urgh, you're quite right - I've just re-run it and this time checked the input. Which I should have done the first time. It's POSIX, too. /slinks off, embarrased. – John N Dec 20 '16 at 23:05
  • @JohnN look at the bright side you got 4 good comment flags for an incorrect comment...LOL – WinEunuuchs2Unix Dec 20 '16 at 23:15
  • 2
    @JohnN You're not completely wrong. In the past, UNIX didn't provide the `-n` option, and used the syntax `tail +2 $file`. See https://www.freebsd.org/cgi/man.cgi?query=tail&apropos=0&sektion=0&manpath=Unix+Seventh+Edition&arch=default&format=html It's possible you were thinking of that rather than one of the modern BSDs. – hvd Dec 21 '16 at 13:56
  • Interesting that you think the `tail` one is more readable, I find the `sed` one more understandable, and a more direct translation of the desired behavior (`d`elete `1`) – Kevin Dec 21 '16 at 20:08
  • @Kevin It's just that I've used `tail` before but haven't used `sed` unless copying someone else's instructions. – WinEunuuchs2Unix Dec 21 '16 at 20:12
  • I think it's worth learning sed, it's really quite powerful. – Kevin Dec 21 '16 at 20:20
  • @Kevin Agreed... Indeed you could say I just learned a little more about it in the last 10 minutes thanks to your comment :) – WinEunuuchs2Unix Dec 21 '16 at 20:21
  • There's also `awk 'NR > 1'`. But unless you're working on gigabyte+ files and speed is demonstrably a problem, it's really moot. And if it is a real problem, you should profile options on your setup and data set anyway. – Kevin Dec 21 '16 at 20:22
  • @Kevin You can post `awk` as an answer. The fastest way would be to change the files starting address to the byte after first CR/LF and reduce file's size by *Old Starting Address* - *New Starting Address*. Not sure how the OS would like that though... Especially bad if files have to start on specific 512 byte boundaries. – WinEunuuchs2Unix Dec 21 '16 at 20:31

6 Answers6

31

Performance of sed vs. tail to remove the first line of a file

TL;DR

  • sed is very powerful and versatile, but this is what makes it slow, especially for large files with many lines.

  • tail does just one simple thing, but that one it does well and fast, even for bigger files with many lines.

For small and medium sized files, sed and tail are performing similarly fast (or slow, depending on your expectations). However, for larger input files (multiple MBs), the performance difference grows significantly (an order of magnitude for files in the range of hundreds of MBs), with tail clearly outperforming sed.

Experiment

General Preparations:

Our commands to analyze are:

sed '1d' testfile > /dev/null
tail -n +2 testfile > /dev/null

Note that I'm piping the output to /dev/null each time to eliminate the terminal output or file writes as performance bottleneck.

Let's set up a RAM disk to eliminate disk I/O as potential bottleneck. I personally have a tmpfs mounted at /tmp so I simply placed my testfile there for this experiment.

Then I am once creating a random test file containing a specified amount of lines $numoflines with random line length and random data using this command (note that it's definitely not optimal, it becomes really slow for about >2M lines, but who cares, it's not the thing we're analyzing):

cat /dev/urandom | base64 -w0 | tr 'n' '\n'| head -n "$numoflines" > testfile

Oh, btw. my test laptop is running Ubuntu 16.04, 64 bit on an Intel i5-6200U CPU. Just for comparison.

Timing big files:

Setting up a huge testfile:

Running the command above with numoflines=10000000 produced a random file containing 10M lines, occupying a bit over 600 MB - it's quite huge, but let's start with it, because we can:

$ wc -l testfile 
10000000 testfile

$ du -h testfile 
611M    testfile

$ head -n 3 testfile 
qOWrzWppWJxx0e59o2uuvkrfjQbzos8Z0RWcCQPMGFPueRKqoy1mpgjHcSgtsRXLrZ8S4CU8w6O6pxkKa3JbJD7QNyiHb4o95TSKkdTBYs8uUOCRKPu6BbvG
NklpTCRzUgZK
O/lcQwmJXl1CGr5vQAbpM7TRNkx6XusYrO

Perform the timed run with our huge testfile:

Now let's do just a single timed run with both commands first to estimate with what magnitudes we're working.

$ time sed '1d' testfile > /dev/null
real    0m2.104s
user    0m1.944s
sys     0m0.156s

$ time tail -n +2 testfile > /dev/null
real    0m0.181s
user    0m0.044s
sys     0m0.132s

We already see a really clear result for big files, tail is a magnitude faster than sed. But just for fun and to be sure there are no random side effects making a big difference, let's do it 100 times:

$ time for i in {1..100}; do sed '1d' testfile > /dev/null; done
real    3m36.756s
user    3m19.756s
sys     0m15.792s

$ time for i in {1..100}; do tail -n +2 testfile > /dev/null; done
real    0m14.573s
user    0m1.876s
sys     0m12.420s

The conclusion stays the same, sed is inefficient to remove the first line of a big file, tail should be used there.

And yes, I know Bash's loop constructs are slow, but we're only doing relatively few iterations here and the time a plain loop takes is not significant compared to the sed/tail runtimes anyway.

Timing small files:

Setting up a small testfile:

Now for completeness, let's look at the more common case that you have a small input file in the kB range. Let's create a random input file with numoflines=100, looking like this:

$ wc -l testfile 
100 testfile

$ du -h testfile 
8,0K    testfile

$ head -n 3 testfile 
tYMWxhi7GqV0DjWd
pemd0y3NgfBK4G4ho/
aItY/8crld2tZvsU5ly

Perform the timed run with our small testfile:

As we can expect the timings for such small files to be in the range of a few milliseconds from experience, let's just do 1000 iterations right away:

$ time for i in {1..1000}; do sed '1d' testfile > /dev/null; done
real    0m7.811s
user    0m0.412s
sys     0m7.020s

$ time for i in {1..1000}; do tail -n +2 testfile > /dev/null; done
real    0m7.485s
user    0m0.292s
sys     0m6.020s

As you can see, the timings are quite similar, there's not much to interpret or wonder about. For small files, both tools are equally well suited.

Byte Commander
  • 105,631
  • 46
  • 284
  • 425
  • +1 for answering thank you. I edited the original question (sorry) based upon comment from Serg that `awk` can do this too. My original question was based on the link I found in the first place. After all your hard work please advise if I should remove `awk` as a solution candidate and return focus to original project scope of only `sed` and `tail`. – WinEunuuchs2Unix Dec 20 '16 at 22:58
  • What system is this? On my mac (so BSD tools), testing on /usr/share/dict/words gives me 0.09s for sed and 0.19s for tail (and `awk 'NR > 1'`, interestingly). – Kevin Dec 21 '16 at 20:16
5

Here's another alternative, using just bash builtins and cat:

{ read ; cat > headerless.txt; } < $file

$file is redirected into the { } command grouping. The read simply reads and discards the first line. The rest of the stream is then piped to cat which writes it to the destination file.

On my Ubuntu 16.04 the performance of this and the tail solution are very similar. I created a largish test file with seq:

$ seq 100000000 > 100M.txt
$ ls -l 100M.txt 
-rw-rw-r-- 1 ubuntu ubuntu 888888898 Dec 20 17:04 100M.txt
$

tail solution:

$ time tail -n +2 100M.txt > headerless.txt

real    0m1.469s
user    0m0.052s
sys 0m0.784s
$ 

cat/brace solution:

$ time { read ; cat > headerless.txt; } < 100M.txt 

real    0m1.877s
user    0m0.000s
sys 0m0.736s
$ 

I only have an Ubuntu VM handy right now though, and saw significant variation in the timings of both, though they're all in the same ballpark.

Digital Trauma
  • 2,415
  • 15
  • 24
  • 1
    +1 for answer thank you. That's a very interesting solution and I love the braces and right to left reading via bash's hierarchy order. (not sure if I worded that correctly). Is it possible to update your answer with size of input file and timing benchmark results if that's easy enough to do? – WinEunuuchs2Unix Dec 21 '16 at 00:50
  • @WinEunuuchs2Unix Timings added, though they're not very reliable as this is on a VM. I don't have a bare-metal Ubuntu installation handy right now. – Digital Trauma Dec 21 '16 at 01:11
  • I don't think VM vs Bare Metal matters when you are comparing VM to VM anyway. Thanks for the timing proof. I'd probably go with `tail` but still think the `read` option is very cool. – WinEunuuchs2Unix Dec 21 '16 at 01:34
4

Trying in on my system, and prefixing each command with time I got the following results:

sed:

real    0m0.129s
user    0m0.012s
sys     0m0.000s

and tail:

real    0m0.003s
user    0m0.000s
sys     0m0.000s

which suggest that, on my system at least AMD FX 8250 running Ubuntu 16.04, tail is significantly faster. The test file had 10,000 lines with a size of 540k. The file was read from a HDD.

Nick Sillito
  • 1,576
  • 8
  • 13
  • +1 for answering thank you. In a separate test in AU Chatroom one user showed tail is 10 times faster (2.31 seconds) than sed (21.86 seconds) using a RAMDisk with 61 MB file. I did edit your answer to apply code blocks but you might want to edit it too with the file size you used. – WinEunuuchs2Unix Dec 20 '16 at 21:11
  • @Serg Absolutely fair that this is only an anecdotal answer, and potentially you would get different results with different hardware configurations, different test files etc. – Nick Sillito Dec 20 '16 at 21:47
  • 2
    The file not being in cache, when using `sed` might play a factor in this result, that's the order you have tested them in. – Minix Dec 21 '16 at 00:20
  • what sort of system? As I commented on another post here, on my mac `sed` was about twice as fast. – Kevin Dec 21 '16 at 20:17
1

Top answer didn't take disk into account doing > /dev/null

if you have a large file and don't want to create a temporary duplicate on your disk try vim -c

$ cat /dev/urandom | base64 -w0 | tr 'n' '\n'| head -n 10000000 > testfile
$ time sed -i '1d' testfile

real    0m59.053s
user    0m9.625s
sys     0m48.952s

$ cat /dev/urandom | base64 -w0 | tr 'n' '\n'| head -n 10000000 > testfile
$ time vim -e -s testfile -c ':1d' -c ':wq'

real    0m8.259s
user    0m3.640s
sys     0m3.093s

Edit: if the file is larger than available memory vim -c doesn't work, looks like its not smart enough to do an incremental load of the file

1

There is no objective way to say which is better, because sed and tail aren't the only things that run on a system during program execution. A lot of factors such as disk i/o, network i/o, CPU interrupts for higher priority processes - all those influence how fast your program will run.

Both of them are written in C, so this is not language issue, but more of environmental one. For example, I have SSD and on my system this will take time in microseconds, but for same file on hard drive it will take more time because HDDs are significantly slower. So hardware plays role in this,too.

There's a few things that you may want to keep in mind when considering which command to choose:

  • What is your purpose ? sed is stream editor for transforming text. tail is for outputting specific lines of text. If you want to deal with lines and only print them out , use tail. If you want to edit the text, use sed.
  • tail has far simpler syntax than sed, so use what you can read yourself and what others can read.

Another important factor is the amount of data you're processing. Small files won't give you any performance difference. The picture gets interesting when you're dealing with big files. With a 2 GB BIGFILE.txt, we can see that sed has far more system calls than tail, and runs considerably slower.

bash-4.3$ du -sh BIGFILE.txt 
2.0G    BIGFILE.txt
bash-4.3$ strace -c  sed '1d' ./BIGFILE.txt  > /dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 59.38    0.079781           0    517051           read
 40.62    0.054570           0    517042           write
  0.00    0.000000           0        10         1 open
  0.00    0.000000           0        11           close
  0.00    0.000000           0        10           fstat
  0.00    0.000000           0        19           mmap
  0.00    0.000000           0        12           mprotect
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         2           rt_sigaction
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         1         1 ioctl
  0.00    0.000000           0         7         7 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0         2         2 statfs
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00    0.134351               1034177        11 total
bash-4.3$ strace -c  tail  -n +2 ./BIGFILE.txt  > /dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 62.30    0.148821           0    517042           write
 37.70    0.090044           0    258525           read
  0.00    0.000000           0         9         3 open
  0.00    0.000000           0         8           close
  0.00    0.000000           0         7           fstat
  0.00    0.000000           0        10           mmap
  0.00    0.000000           0         4           mprotect
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         1         1 ioctl
  0.00    0.000000           0         3         3 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.238865                775615         7 total
Sergiy Kolodyazhnyy
  • 103,293
  • 19
  • 273
  • 492
  • +1 for answering thank you. But I'm not sure this comment is helping me decide which command I should use.... – WinEunuuchs2Unix Dec 20 '16 at 21:13
  • @WinEunuuchs2Unix Well, you asked which command is better, so i am answering exactly that question. Which command to choose, is up to you. If you can read `tail` better than `sed` - use that. I personally would use `python` or `awk` rather than `sed` because it can get complex. Besides, if you are concerned about performance, let's face the reality - you are seeing results in microseconds here. You won't feel difference unless it's a freakin huge file in range of gigabytes that you're trying to read – Sergiy Kolodyazhnyy Dec 20 '16 at 21:16
  • Oh I would appreciate an `awk` answer too :)... My question was based on another AU Q&A (in the link) and there they never mentioned `awk`. I agree the time difference is nominal on small files. I was just trying to develop some good habits. – WinEunuuchs2Unix Dec 20 '16 at 21:18
  • 1
    @WinEunuuchs2Unix Sure, here it is : `awk 'NR!=1' input_file.txt `. It gives me equally same result , around 150 milliseconds , same number for both `tail` and `sed`. But agian, I am using SSD, so i'd say it's the hard drive and CPU that matter, not the command. – Sergiy Kolodyazhnyy Dec 20 '16 at 21:22
  • The `awk` version looks the best (most readable). There is no output file and does it in place? Also if you run awk, sed and tail on the same file should consideration be given the kernel may have cached/buffered some of the file stuff in RAM? – WinEunuuchs2Unix Dec 20 '16 at 21:31
  • @WinEunuuchs2Unix You still need output file. Use `> new_file.txt` to redirect. Only `sed` has in-place editing and GNU awk, if I'm not mistaken. There's many `awk` variations, but just use redirection to be consistent and portable. As for kernel caching, I don't know - that's probably deserves a whole different question. – Sergiy Kolodyazhnyy Dec 20 '16 at 21:35
  • 1
    @Serg even with only a 60 MB file containing 1M lines, 1000 runs with `sed` take far over 3 minutes, whereas `tail` only needs around 20 seconds. That is not *that* big yet actually, definitely not in the GB range. – Byte Commander Dec 20 '16 at 21:41
  • @ByteCommander Well, aside from the fact that you're looping with `bash` , which itself is slow, in my test it gave for tail 47 seconds for `tail` and 12 minutes for `sed`. Again, I don't think looping is a good test, because you're running a command 1000 times, you're not processing one big file. – Sergiy Kolodyazhnyy Dec 20 '16 at 22:17
  • But even with Bash loops being slow, they'Re not *that* slow. According to [your link from chat](http://unix.stackexchange.com/a/303167/103151), a Bash `for` loop iteration takes some time in the order of magnitude of 10 µs. For 1000 iterations, this adds up to 10 ms. The total times we're measuring are in the order of magnitude of tens of seconds to minutes - I consider Bash loop performance loss insignificant here. – Byte Commander Dec 20 '16 at 22:30
  • @Serg it's not true that just because other things can happen at the same time, there's no objective way to tell which is better. If that were true, [timing attacks](https://en.wikipedia.org/wiki/Timing_attack), which must deal with the same sorts of noise you mention, would be unobjective, yet they aren't. There are ways to minimize the impact of noise and to increase the signal strength. One of the major uses of statistics is figuring out objectively what contributes to some effect in spite of noise. – Chai T. Rex Dec 21 '16 at 06:24
0

Other answers show well what is better to create a new file with first line missing. If you want to edit a file as opposed to create a new file though, I bet ed would be faster because it shouldn't create a new file at all. But you have to search how to remove a line with ed because I used it only once.

akostadinov
  • 769
  • 2
  • 7
  • 9