Nutch is coded entirely in the Java programming language and is a crawler with a wide variety of features. Some of these features are:

highly scalable and feature rich crawler
features like politeness which obeys robots.txt rules
robust and scalable - Nutch can run on a cluster of up to 100 machines
quality - crawling can be biassed to fetch "important" pages first

For our purposes, it will allow us to crawl a source and will automatically index it over to our Solr server.

Preparation

You should have set-up Solr on Tomcat along with Tika's extracting request handler as shown in the previous two guides: http://amac4.blogspot.co.uk/2013/07/setting-up-solr-with-apache-tomcat-be.html http://amac4.blogspot.co.uk/2013/07/setting-up-tika-extracting-request.html

Download the binary Nutch from the Apache website. Some releases of Nutch were designed specifically to work with certain versions of Solr so be aware that the version of Nutch you try to integrate with Solr is important. For this example I am using Solr4.3 and Nutch1.4.

Downloads

Most of these will be set-up prior to this so you do not need to download them again

Download Java jre7 and jdk7
Download Tomcat 7
Download Solr 4.3
Download Cygwin - Run Set-up.exe and install all packages (Default may be sufficient)
Download Nutch 1.4 bin

Set-up

Cygwin should be installed to C:/cygwin or similar. Copy your Nutch download to the cygwin/home folder. This Nutch installation will be referred to as $NUTCH_HOME.

Set-up a system environment varibale called JAVA_HOME and set the location as the location of your JDK (e.g C:/Java/jdk1.7).

If cygwin does not recognise that you have set up an environment variable then you can issue the following instruction. Note that you will be required to type this instruction every time you wish to issue any commands that use the jdk .

export JAVA_HOME=[JDK Location]

In cygwin change directory into the $NUTCH_HOME/runtime/local/bin directory. If your environment variable has been set up correctly then when you run the command ./nutch the output should be Usage: nutch [-core] COMMAND

cd $NUTCH_HOME/runtime/local/bin
./nutch
 
#Output Should Be
Usage: nutch [-core] COMMAND

NOTE: cygpath cant convert empty path is not an error and will be displayed each time you run any Nutch command.

In the $NUTCH_HOME/runtime/local/bin folder create a new folder and give it any name, say urls. This folder will contain a text file that will be used to determine what sites will get crawled.

In the new folder create a text document say (nutch.txt) and add a list of the urls you wish to crawl (e.g http://amac4.blogspot.co.uk)

In $NUTCH_HOME/runtime/local/conf open the regex-urlfilter.txt and where it says # accept anything else add the line +^http://amac4.blogspot.co.uk/ if you wish to search anything that comes under the amac4.blogspot.co.uk domain.

Nutch can only extract data from certain types of file and cannot extract data from images or various other binary file types. You as a user may not want data extraction to occur when using certain file types so you have the option to ignore these files by adding their tags to the list in the same fashion as it shows (|xml|XML|jpeg|asp|.. etc)

Now open nutch-site.xml and add the following code between the <configuration> headers

<name>http.agent.name</name>
<value>My Nutch Spider</value> #(You can add any name here)

In nutch-default.xml there should be a line under the <property> tag which has <name>http.agent.name</name>. The <value>field below should be empty so you should add the name of your crawler that you specified before, so in this case it would be:

<value>My Nutch Spider</value>

You can test the crawl is working by navigating to the $NUTCH_HOME/runtime/local/bin folder and executing:

cd $NUTCH_HOME/runtime/local/bin
./nutch crawl urls -dir [dir name] -depth [depth] -topN [files]

The directory you will supply (dir name) will store the indexes from the crawl. Running the crawl again will also cause the files to be re-indexed if they are found. The depth is asking how deep down a hierarchy do you wish to go and the topN is asking how many pages on each level you wish to index.

To get it linking to Solr copy the schema.xml from $NUTCH_HOME/runtime/local/conf into the $SOLR_HOME/collection1/conf/folder which should overwrite the previous schema.xml file but only if you are using Solr 1/2/3. If it is Solr4 you are using, which we are in this case then copy the schema-solr4.xml from the $NUTCH_HOME/runtime/local/conf directory into the$SOLR_HOME/collection1/conf/ and then rename it in the Solr directory back to schema.xml which should overwrite the old one. Changes made to the schema.xml made in the Solr set-up may need to be re-done.

Add this line to the schema.xml in the Solr installation and also to the schema-solr4.xml in the Nutch installation.

<field name="_version_" type="long" stored="true" indexed="true" multiValued="false"/>

If you are NOT using the Solr4 schema file then edit the schema file you copied over and comment out the line like this:

<!--<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>-->

To test the crawl is indexing to Solr type into cygwin:

./nutch crawl urls -dir newCrawl -solr http://localhost:8080/solr/ -depth 3 -topN 4

Note Nutch is now configured to crawl over the web but there are issues that have turned up so changes need to be made and these changes can be found at:

Optimising Nutch Performance

You may notice if you try and run Nutch that it works its way through the crawl very slowly, that is because by default Nutch is set-up to use using one thread and doesn't take advantage of the Multi-threaded implementation. Nutch will use multi-threading to crawl various hosts simultaneously and therefore the settings must be changed in order for search times to be kept to a reasonable level.

In the nutch-site.xml add the code between the <configuration> tags

  <property>
  <name>fetcher.threads.per.queue</name>
     <value>10</value>
     <description></description>
  </property>
 
  <property>
  <name>fetcher.threads.per.host</name>
     <value>10</value>
     <description></description>
  </property>

Parsing Errors

You may find errors popping up every so often that look similar to

Error parsing: 192.168.0.42/test/AGENDA.doc: failed(2,0): Unable to read 512 bytes from 65536 in stream of length 65421

Fix By default there is a limit to how much data that Solr will parse so to take that limit away you need to set the content limit to -1. This has repercussions in terms of performance as large files will take longer to parse but no file should fail because of its length.

Add this code to the nutch-site.xml

  <property> 
  <name>http.content.limit</name> 
  <value>-1</value> 
  <description>The length limit for downloaded content, in bytes. 
               If this value is nonnegative (>=0), content longer than it 
  will be 
               truncated;otherwise, no truncation at all. 
  </description> 
  </property>

Failing to parse documents with a space in the title

URLs are not allowed to contain white-space and Nutch was not replacing the space character with %20 which meant that filename became invalid.

Fix

Add the following text to regex-normalise.xml:

  <regex> 
     <pattern>&#x20;</pattern> 
     <substitution>%20</substitution> 
  </regex>

Next Steps

You may now want to set-up Nutch to crawl a local filesystem, the guide can be found at:
http://amac4.blogspot.co.uk/2013/07/setting-up-nutch-to-crawl-filesystem.html
http://amac4.blogspot.co.uk/2013/07/web-service-to-query-solr-rest.html

Otherwise you may wish to check out some tweaks that can be made to Solr including deduplication and highlighting:
http://amac4.blogspot.co.uk/2013/08/setting-up-highlighting-for-solr-4.html
http://amac4.blogspot.co.uk/2013/08/deleting-dead-urls-files-that-no-longer.html

http://amac4.blogspot.co.uk/2013/08/solr-deduplication.html

17 comments:

Anonymous24 November 2013 at 03:20
Hi...I am getting the following error whem i try to execute the command ./nutch crawl urls -dir [dir name] -depth [depth] -topN [files]....
I have created a directory by name crawl under C:/...please tell the exact path where this directory has to be created...Thanks
Anonymous1 December 2013 at 12:46
Hi, i follow your step instalastion but i got this error.

$ ./nutch crawl urls -dir myCrawl -depth 5 -topN 15
cygpath: can't convert empty path
solrUrl is not set, indexing will be skipped...
crawl started in: myCrawl
rootUrlDir = urls
threads = 10
depth = 5
solrUrl=null
topN = 15
Injector: starting at 2013-12-01 19:45:14
Injector: crawlDb: myCrawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
SANTHOSH BHOYER29 March 2014 at 13:46
Hi Allan,

So informative your blog is really... am new to this area.. i just gone through your blog, and now am going to integrate Tika with Solr 4.3.1 then lets c what would be result while installation.

thank U...
Bhoyer.
Anonymous16 October 2014 at 09:06
This comment has been removed by the author.
Anonymous12 December 2014 at 14:21
Hi Allan,

Thanks for the tutorial. I'm a bit confused about how to configure Solr with Nutch. I've downloaded the latest version of Solr (4.10.2) to my Cygwin/home folder where my apache-nutch-2.2.1 resides. I cant find the collection1/conf folder your refrring to above i.e. $SOLR_HOME/collection1/conf/

Thanks,

Oran.
Rajat Srivastava7 April 2016 at 12:20
Awesome blog! it good to see how complex topic like this is explained in simple way. keep them coming...
Unknown18 November 2016 at 12:06
Excellent and easily understandable. Now I need to run Nutch with hadoop on windows. I have configured hadoop but stuck with how to do everything for Nutch. Can you please share with me some nice article like this one written by you or anyone? Or otherwise writing a fresh article is also a nice idea.
ruchi16 December 2016 at 07:32
can you tell me how to integrate tika-1.14 with nutch-2.3.1?
Anonymous28 September 2024 at 08:06
Respect and that i have a super proposal: What House Renovations Need Council Approval exterior house remodel

Share What You Know, Learn What You Don't

Pages

Monday, 22 July 2013

Installation Guide To Set Up Apache Nutch On Windows

Nutch is coded entirely in the Java programming language and is a crawler with a wide variety of features. Some of these features are:

Preparation

Downloads

Set-up

Optimising Nutch Performance

Parsing Errors

Failing to parse documents with a space in the title

Next Steps

You may now want to set-up Nutch to crawl a local filesystem, the guide can be found at:
http://amac4.blogspot.co.uk/2013/07/setting-up-nutch-to-crawl-filesystem.html
http://amac4.blogspot.co.uk/2013/07/web-service-to-query-solr-rest.html

17 comments:

Pages

Monday, 22 July 2013

Installation Guide To Set Up Apache Nutch On Windows

Nutch is coded entirely in the Java programming language and is a crawler with a wide variety of features. Some of these features are:

Preparation

Downloads

Set-up

Optimising Nutch Performance

Parsing Errors

Failing to parse documents with a space in the title

Next Steps

You may now want to set-up Nutch to crawl a local filesystem, the guide can be found at: http://amac4.blogspot.co.uk/2013/07/setting-up-nutch-to-crawl-filesystem.html http://amac4.blogspot.co.uk/2013/07/web-service-to-query-solr-rest.html

17 comments:

You may now want to set-up Nutch to crawl a local filesystem, the guide can be found at:
http://amac4.blogspot.co.uk/2013/07/setting-up-nutch-to-crawl-filesystem.html
http://amac4.blogspot.co.uk/2013/07/web-service-to-query-solr-rest.html