Pages

Tuesday 23 July 2013

Nutch crawling slowly - Multi-Threaded Solution

Expecting Nutch to be flying through all the websites you want it to crawl but in reality it is sequentially, one by one going through each page. Well you have not configured Nutch to use multiple threads! Multiple threads allows many threads to work in "parallel" and speed up the job significantly.

Preparation

At this point you should have Solr and Tika set-up as well as having Nutch ready to crawl . The guides can be found here:
The required changes that need to be done to make Nutch crawl a file-system require that Nutch is set-up to crawl the web so please ensure you have read the post.
To configure your multi-threaded Nutch crawler is easy:
In the nutch-site.xml add the code between the <configuration> tags add:
<property>
<name>fetcher.threads.per.queue</name>
   <value>10</value>
   <description></description>
</property>

<property>
<name>fetcher.threads.per.host</name>
   <value>10</value>
   <description></description>
</property> 
10 threads was the most efficient to use on the crawling data set I chose and would advise you to start with the number of threads around here, having too many threads will slow the computer down and you may be left working at a similar speed to that of the sequential crawler. 

No comments:

Post a Comment