11

I have a number of web sites I am archiving in order to retain many of the linked files there, specifically a number of PDFs.

I haven't had a problem using the Heritrix crawler to collect the sites. However I haven't found a good solution to extracting the files from these .warc files.

Does any one have experience with this, or have a preferred way to get these individual files out?

wxs
  • 225
  • 1
  • 3
  • 12

7 Answers7

6

You could browse the WARC with Webarchive Player and save the files you want from your browser. Alternatively, upload the WARC to webrecorder.io and browse/download there.

4

ReplayWeb.page replaces Webrecorder Player which replaced WebArchivePlayer.

No app to install, just go to the page and browse to your file. All processing is local.

Andrew Olney
  • 199
  • 1
  • 2
4

I suggest to try warctools https://github.com/internetarchive/warctools it's python lib that is very easy to use.

2

I've found that 7-Zip by itself often doesn't work, but there is a plugin called eDecoder for it that can be used to enable warc support.

eDecoder can be downloaded for free from here.

Upon opening a warc with this plugin installed, it acts like any other archive in 7-Zip with a few exceptions:

  • an extra column is added that shows the original URL of each file.
  • each file gets prepended with a number to prevent filename collisions (eg index.html could get renamed to 000123 index.html).
  • folder structures are discarded, all of the files are visible in the main view regardless of what folder they were in originally, and there are in fact no folders at all.

While it can be downloaded for free, it does seem to be closed source, both in terms of the code and the license, and is therefore limited to Windows due to it being a compiled DLL.

rebane2001
  • 143
  • 8
1

I am using this project: https://github.com/chfoo/warcat

Example Run:

python3 -m warcat --help
python3 -m warcat list example/at.warc.gz
python3 -m warcat verify megawarc.warc.gz --progress
python3 -m warcat extract megawarc.warc.gz --output-dir /tmp/megawarc/ --progress
xLight
  • 11
  • 1
  • 1
    Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Dec 28 '21 at 08:06
0

I was looking for a solution suitable for terminal usage (Ubuntu). Unfortunately in my case (warc files created by using browsertrix-crawler) the previous answers did not work out.

I found warc-extractor to work best in my case. It is a python tool and extracting all HTML pages is as easy as calling:

$ warc-extractor http:content-type:text/html -dump content -error

in the directory containing the warc files. I needed the -error flag as my crawls contain quite a few problematic pages. For my usecase it is sufficient to successfully extract the major part which this tool does well enough.

0

I've used 7-Zip before to extract individual files or whole archives from Web Archive format files.

It's available from their site here.

Martin
  • 101
  • 3
  • Interesting. I'm on a Linux machine so I used the **p7zip** build. It doesn't seem to recognize the ``.warc`` as any sort of archive it can decompress (``p7zip -d web-archive.warc``). You were able to pull individual files out with 7-Zip though? – wxs Aug 09 '13 at 21:00
  • @walker I was indeed. Although the archive was not recognised, it did open with 7-Zip and the contents were displayed and were extracable. – Martin Aug 09 '13 at 21:19
  • Hm. I've gotten onto a Windows machine and am using 7-Zip 9.20. I have three different ``.warc`` files but none are extractable by the program. Not sure what the problem is. – wxs Aug 22 '13 at 21:41
  • I find this works but only with specific file types like images and html. Some file types don't appear. – Andrew Olney Apr 11 '21 at 14:33