0

I have the following commands:

time grep -F -f 'in2.txt' test.fastq
time zgrep -F -f 'in2.txt' test.fastq.gz

There are about 30 search terms on files with ~5 GB. However I notice that on one computer it takes over 3-5x time to finish searching, this is on an Amazon spinup. Thus I'm wondering what is impacting the speed? Should I spin up an ECS that has more memory or better CPU speed?

ahdee
  • 1
  • 1
  • 2
    An Amazon ecs could be running on any physical hardware, right? You might not have any guarantee of what it's really using, regardless of what it reports... but anyway zgrep searches compressed files, grep doesn't, so they're very different. – Xen2050 Mar 13 '18 at 04:20
  • Xen2050, you're right about grep and zgrep being distinct in performance profile. Most notably, you should find that if you are I/O constrained, but not CPU constrained, operating on well-compressed files should help by reducing the time required to pull data from media. – Slartibartfast Mar 18 '18 at 16:12

1 Answers1

1

CPU and I/O. If you are searching for a small (30 is quite small) set of terms, you are most likely to be I/O bound, and conceivably going to be CPU bound. You will not be memory bound.

[IMHO]

The right answer, of course, is to test it. You can do this a few ways, including having two terminals open and running 'dstat' while you run the command in question. If it takes a couple of seconds to complete, you should get an idea which resources are maxed out (to 100% or to some steady-state value), and which are not.

Slartibartfast
  • 7,978
  • 2
  • 25
  • 27
  • I haven't reviewed the `grep` source code, but I see no reason why `grep` would benefit from more memory in this case. Unless the search string is exceedingly long, `grep` would likely work with small buffers (which I guess would be memory mapped). – Edward Mar 14 '18 at 08:02