Friday, 16 August 2013

Nutch Re-crawling

Nutch allows you to crawl the web or a filesystem in order to build up index store of all its content. If your objective is to simply crawl the content once, it is fairly easy. But if you want to continuously monitor a site and crawl updates, it can be harder. Harder because the Nutch documentation does not have many details about it.

Nutch will update any previously indexed urls/files, delete any inactive ones and add any new ones it encounters while re-crawling. There are a few very simple settings that need changed if you wish Nutch to do this
Nutch stores a record of all the files/urls it has encountered whilst doing its crawl and is called the crawldb. Initially this is built from the list of urls/files provided by the user using the inject command which will be normally be taken from your seed.txt file, in our case, the nutch.txt file
Nutch uses a generate/fetch/update process:
generate: This command looks at the crawldb for all the urls/files that are due for fetching and regroups them in a segment. A url/file is due for fetch if if it is new or the time has expired for that url/file and is now due for recrawling (default is 30 days).
fetch: This command will go an fetch all the urls/files specified in the segment.
update: This command will add the results of the crawling, which have been stored in the segment, into the crawldb and each url/file will be updated to indicate the time it wad fetched and when its next scheduled fetch is. If any urls/files have been discovered, they will be added and marked as not fetched.
How does Nutch can detect if a page has changed or not? Each time a page is fetched, Nutch computes a signature for the page. At the next fetch, if the signature is the same (or if a 304 is returned by the web server because of the If-Modified-Since header), Nutch can tell if the page was modified or not (It is not just the content, if the http headers or metatags have changed it will be marked as modified). If a document no longer exists it returns a 404 and will be marked DB_GONE. During the update cycle Nutch has the ability to purge all the those urls/files that have been marked DB_GONE.
The linkdb stores the finalised indexes that Nutch has generated from the crawl and this is the data that Nutch passes to the Solr Server during the solrindex process.


The first thing is, you need to allow Nutch to re-crawl. You may have noticed that if you try to run a crawl on the same source using the same folder name as you used before that it will tell you there are no more urls/files to crawl. There is a default setting for a url/file that states it may not be re-crawled for 30 days. So edit your nutch-site.xml and edit the value to your liking:
  <description>The default number of seconds between re-fetches of a page (30 days).
If you set the number too low you may get into an infinite loop. During a crawl when it finishes one of its cycles, it will notice that it is time for that url to be re-crawled, and at the end of that cycle it notices again that it needs re-crawled. So ensure that the number of seconds you select will be longer than the nutch crawler will take to complete it crawl of everything
Adding the following code will delete all urls/files marked with DB_GONE and thus ensuring all urls/files are active and up-to-date.
  <description>If true, updatedb will add purge records with status DB_GONE
  from the CrawlDB.
You are now all set to re-crawl! Just issue the same command to crawl normally but ensure the folder containing the Nutch indexes is the one you used before, so if you performed the crawl previously and your folder was named "Test"
./nutch crawl urls -dir Test -solr http://localhost:8080/solr/ -depth 3 -topN 100
Then you will want to issue the exact same command again to recrawl, although you are free to change the depth and topN to your own liking
./nutch crawl urls -dir Test -solr http://localhost:8080/solr/ -depth 3 -topN 100


There is an issue when it comes to indexing this to Solr, it passes all the indexes over and updates the ones that require updating. If a url/file has been deleted, it has been purged from Nutch's indexes and when it passes them over to Solr, Solr does not know that it has been deleted so still stores a record of the old ur/filel. The issue can be combatted by deleting the Solr indexes, and then passing the Nutch indexes over to Solr, and it comes with very little additional performance cost

1 comment:

    Seem to be good resources on how to do re-crawls and adaptive scheduling so that frequently updated pages get re-crawled more often.