Expecting Nutch to be flying through all the websites you want it to crawl but in reality it is sequentially, one by one going through each page. Well you have not configured Nutch to use multiple threads! Multiple threads allows many threads to work in "parallel" and speed up the job significantly.
Preparation
At this point you should have Solr and Tika set-up as well as having Nutch ready to crawl . The guides can be found here:
http://amac4.blogspot.co.uk/2013/07/setting-up-solr-with-apache-tomcat-be.html
http://amac4.blogspot.co.uk/2013/07/setting-up-tika-extracting-request.html
http://amac4.blogspot.co.uk/2013/07/configuring-nutch-to-crawl-urls.htmlhttp://amac4.blogspot.co.uk/2013/07/setting-up-nutch-to-crawl-filesystem.html
http://amac4.blogspot.co.uk/2013/07/setting-up-tika-extracting-request.html
http://amac4.blogspot.co.uk/2013/07/configuring-nutch-to-crawl-urls.htmlhttp://amac4.blogspot.co.uk/2013/07/setting-up-nutch-to-crawl-filesystem.html
The required changes that need to be done to make Nutch crawl a file-system require that Nutch is set-up to crawl the web so please ensure you have read the post.
To configure your multi-threaded Nutch crawler is easy:
In the
nutch-site.xml
add the code between the <configuration>
tags add:<property> <name>fetcher.threads.per.queue</name> <value>10</value> <description></description> </property> <property> <name>fetcher.threads.per.host</name> <value>10</value> <description></description> </property>10 threads was the most efficient to use on the crawling data set I chose and would advise you to start with the number of threads around here, having too many threads will slow the computer down and you may be left working at a similar speed to that of the sequential crawler.
No comments:
Post a Comment