I need to monitor 1000+ specific links for about one month to see if their content has changed, and I wonder if I can automate this somehow. One idea was to simply download these websites now and again in one month, and compare the source files. If I go down this route, do you guys know of a tool (browser extension?) that would make such a download easy? I've tried HTTrack, but it fails after the first 100 links or so. Alternatively, a (free?) web service which can monitor a set of websites might work as well. I've used https://visualping.io/ before, but it's not really intended for thousands of links.
-
Be more specific than just "it fails" it might actually be one possible tool. Alternatively you could use curl, a download manager or whatever. Yes, you can automate this. – Seth Jul 24 '18 at 10:42
-
`javascript:alert(document.lastModified)`? – Akina Jul 24 '18 at 10:55
-
You can try and use `curl` and output to a file then run a `diff` between files each day.To automate put your links in a file and use a script to read in each line as a variable. Then just loop through them all getting the source files. Then you can just compare the current day files with the day before and alert in which ever way you feel appropriate. Then you can delete the previous day's source files as a sort of cleanup. This is sort of a minimal external tool approach.Just bear in mind that windows default `curl` is a powershell alias for a different command so you'd need a linux curl – Gytis Jul 24 '18 at 11:23
1 Answers
I wonder if I can automate this somehow.
Hardly necessary but yes you could write some simple scripts.
do you guys know of a tool ... that would make such a download easy?
wget, curl, etc
You could put the 1000 specific URLs into a text file, create two directories, cd into the first directory and use a tool such as wget with the -i option to read the list of URLs and fetch them. A month later repeat this in the second directory, use diff e.g. diff -r /directory1 /directory2 to find any changes.
Be careful about using recursive options, they can overwhelm the server and get you banned or can overload your computer.
I'd try with a small set of URLs first (e.g. 2, then 10, then 1000)
A lower-cost option may be to use HTTP HEAD requests and trust that the server knows if a resource has been changed.
- 81,981
- 20
- 135
- 205