9

When grep or sed are used with the option --extended-regexp and the pattern {1,9999} is a part of the regexp that is used, the performance of these commands become low. To be more clear, below are applied few tests.[1] [2]

  • The relative performance of grep -E, egrep and sed -E is almost equal, so only the test that were made with grep -E are provided.

Test 1

$ time grep -E '[0-9]{1,99}' < /dev/null

real    0m0.002s

Test 2

$ time grep -E '[0-9]{1,9999}' < /dev/null

> real    0m0.494s

Test 3

$ time grep -E '[0123456789]{1,9999}' < /dev/null

> real    21m43.947s

Test 4

$ time grep -E '[0123456789]+' < /dev/null
$ time grep -E '[0123456789]*' < /dev/null
$ time grep -E '[0123456789]{1,}' < /dev/null
$ time grep -P '[0123456789]{1,9999}' < /dev/null

real    0m0.002s       

What is the reason of this significant difference of the performance?

pa4080
  • 29,351
  • 10
  • 85
  • 161
  • 3
    That's an interesting observation - my guess is you'd need to dig deep into grep's internals to find exactly how it's building the parse tree (it would be interesting to compare `[0-9]+` as well) – steeldriver Sep 29 '17 at 15:49
  • I think it doesn't goes to `grep` itself, it's shell related parse for `{...}`. – αғsнιη Sep 29 '17 at 16:09
  • 3
    The input doesn't matter. As @steeldriver suggests, the slowdown *precedes* matching. A simpler test is `time grep -E '[0-9]{1,99}' – Eliah Kagan Sep 29 '17 at 16:13
  • 3
    @αғsнιη No, the `{` `}` are [`'` `'` quoted](https://www.gnu.org/software/bash/manual/bash.html#Single-Quotes); the shell passes them unchanged to `grep`. Anyway, `{1,9999}` would be a very fast and simple [brace expansion](https://www.gnu.org/software/bash/manual/bash.html#Brace-Expansion). The shell would just expand it to `1 9999`. – Eliah Kagan Sep 29 '17 at 16:13
  • 2
    I guess the one-line answer is that `grep` has to build a much deeper parse tree for an explicit repetition range such as `[0-9]{n,m}` than for a generic quantifier such as `[0-9]+` or `[0-9]*` or even `[0-9]{1,}` - have a look at the source code, in particular `lib/regcomp.c` – steeldriver Sep 29 '17 at 16:14
  • @steeldriver That makes sense, but why is `grep -E 'x{1,9999}' – Eliah Kagan Sep 29 '17 at 16:26
  • @EliahKagan But OP 3rd Update with `sed` confirm it doesn't `grep` only performance, so I doubt it's can be shell related performance. – αғsнιη Sep 29 '17 at 16:31
  • 4
    @αғsнιη I don't know quite what you mean, but this definitely has nothing to do with the shell. During a long-running command, I used `ps` and `top` to verify `grep` was passed the expected arguments and that it, not `bash`, consumes lots of RAM and CPU. I expect `grep` and `sed` both use the [POSIX regex functions](http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/regex.h.html) implemented in [libc](https://www.gnu.org/software/libc/) for BRE/ERE matching; I shouldn't really have talked about `grep` design specifically, except insofar as the `grep` developers chose to use that library. – Eliah Kagan Sep 29 '17 at 16:39
  • 3
    I suggest that you replace the tests with `time grep ... < /dev/null`, so that people don't conflate the actual problem with the data fed to `grep` and other extraneous things. – muru Sep 29 '17 at 19:00
  • 1
    … and change the title as its not only `grep`, but also `sed` at least! – dessert Sep 29 '17 at 19:01
  • 1
    Let's continue the discussion in this chatroom: https://chat.stackexchange.com/rooms/66406/room-for-dessert-and-eliah-kagan – dessert Sep 29 '17 at 19:28
  • Hi, @muru, where I can read some explanation about `< /dev/null`? I see it is working, but for me is unclear what portion of data is redirected to the command, what actually happens? – pa4080 Oct 02 '17 at 10:28
  • 1
    @pa4080 that's the point: there's no data. grep reads nothing, so timing is almost entirely grep's startup – muru Oct 02 '17 at 10:43
  • @EliahKagan you can confirm what libraries are used with `ldd $(which grep) $(which awk)`, no need to guess. – pbhj Nov 09 '17 at 22:00
  • @pbhj I don't follow. Are you proposing that I run `ldd` on `grep` and `awk`, notice that both binaries link to `libc.so.6`, and consider that to "confirm" that POSIX regex functions from `` were used? `ldd` tells you what shared libraries a binary links to, not what it uses from each of them, nor even *that* it uses anything from them. – Eliah Kagan Nov 09 '17 at 22:10

1 Answers1

10

Note that it's not the matching that takes time, but the building of the RE. You'll find that it uses quite a lot of RAM as well:

$ valgrind grep -Eo '[0-9]{1,9999}' < /dev/null
==6518== HEAP SUMMARY:
==6518==     in use at exit: 1,603,530,656 bytes in 60,013 blocks
==6518==   total heap usage: 123,613 allocs, 63,600 frees, 1,612,381,621 bytes allocated
$ valgrind grep -Eo '[0-9]{1,99}' < /dev/null
==6578==     in use at exit: 242,028 bytes in 613 blocks
==6578==   total heap usage: 1,459 allocs, 846 frees, 362,387 bytes allocated
$ valgrind grep -Eo '[0-9]{1,999}' < /dev/null
==6594== HEAP SUMMARY:
==6594==     in use at exit: 16,429,496 bytes in 6,013 blocks
==6594==   total heap usage: 12,586 allocs, 6,573 frees, 17,378,572 bytes allocated

The number of allocs seems roughly proportional to the number of iterations, but the memory allocated seems to grow exponentially.

That's down to how GNU regexps are implemented. If you compile GNU grep with CPPFLAGS=-DDEBUG ./configure && make, and run those commands, you'll see the exponential effect in action. Going deeper than that would mean going through a lot of theory on DFA and dive into the gnulib regexp implementation.

Here, you can use PCREs instead that doesn't seem to have the same problem: grep -Po '[0-9]{1,65535}' (the maximum, though you can always do things like [0-9](?:[0-9]{0,10000}){100} for 1 to 1,000,001 repetitions) doesn't take more time nor memory than grep -Po '[0-9]{1,2}'.

Stéphane Chazelas
  • 1,356
  • 12
  • 18