Pages

Monday 22 July 2013

Installation Guide To Set Up Apache Nutch On Windows

Nutch is coded entirely in the Java programming language and is a crawler with a wide variety of features. Some of these features are:

  • highly scalable and feature rich crawler
  • features like politeness which obeys robots.txt rules
  • robust and scalable - Nutch can run on a cluster of up to 100 machines
  • quality - crawling can be biassed to fetch "important" pages first
For our purposes, it will allow us to crawl a source and will automatically index it over to our Solr server.

Preparation

You should have set-up Solr on Tomcat along with Tika's extracting request handler as shown in the previous two guides:    http://amac4.blogspot.co.uk/2013/07/setting-up-solr-with-apache-tomcat-be.html       http://amac4.blogspot.co.uk/2013/07/setting-up-tika-extracting-request.html    
Download the binary Nutch from the Apache website. Some releases of Nutch were designed specifically to work with certain versions of Solr so be aware that the version of Nutch you try to integrate with Solr is important. For this example I am using Solr4.3 and Nutch1.4.

Downloads

Most of these will be set-up prior to this so you do not need to download them again
  • Download Java jre7 and jdk7
  • Download Tomcat 7
  • Download Solr 4.3
  • Download Cygwin - Run Set-up.exe and install all packages (Default may be sufficient)
  • Download Nutch 1.4 bin

Set-up

  • Cygwin should be installed to C:/cygwin or similar. Copy your Nutch download to the cygwin/home folder. This Nutch installation will be referred to as $NUTCH_HOME.
  • Set-up a system environment varibale called JAVA_HOME and set the location as the location of your JDK (e.g C:/Java/jdk1.7).
  • If cygwin does not recognise that you have set up an environment variable then you can issue the following instruction. Note that you will be required to type this instruction every time you wish to issue any commands that use the jdk .
export JAVA_HOME=[JDK Location]
  • In cygwin change directory into the $NUTCH_HOME/runtime/local/bin directory. If your environment variable has been set up correctly then when you run the command ./nutch the output should be Usage: nutch [-core] COMMAND
cd $NUTCH_HOME/runtime/local/bin
./nutch
 
#Output Should Be
Usage: nutch [-core] COMMAND
  • NOTE: cygpath cant convert empty path is not an error and will be displayed each time you run any Nutch command.
  • In the $NUTCH_HOME/runtime/local/bin folder create a new folder and give it any name, say urls. This folder will contain a text file that will be used to determine what sites will get crawled.
  • In the new folder create a text document say (nutch.txt) and add a list of the urls you wish to crawl (e.g http://amac4.blogspot.co.uk)
  • In $NUTCH_HOME/runtime/local/conf open the regex-urlfilter.txt and where it says # accept anything else add the line +^http://amac4.blogspot.co.uk/ if you wish to search anything that comes under the amac4.blogspot.co.uk domain.
  • Nutch can only extract data from certain types of file and cannot extract data from images or various other binary file types. You as a user may not want data extraction to occur when using certain file types so you have the option to ignore these files by adding their tags to the list in the same fashion as it shows (|xml|XML|jpeg|asp|.. etc)
  • Now open nutch-site.xml and add the following code between the <configuration> headers
<name>http.agent.name</name>
<value>My Nutch Spider</value> #(You can add any name here)
  • In nutch-default.xml there should be a line under the <property> tag which has <name>http.agent.name</name>. The <value>field below should be empty so you should add the name of your crawler that you specified before, so in this case it would be:
<value>My Nutch Spider</value>
  • You can test the crawl is working by navigating to the $NUTCH_HOME/runtime/local/bin folder and executing:
cd $NUTCH_HOME/runtime/local/bin
./nutch crawl urls -dir [dir name] -depth [depth] -topN [files]
  • The directory you will supply (dir name) will store the indexes from the crawl. Running the crawl again will also cause the files to be re-indexed if they are found. The depth is asking how deep down a hierarchy do you wish to go and the topN is asking how many pages on each level you wish to index.
  • To get it linking to Solr copy the schema.xml from $NUTCH_HOME/runtime/local/conf into the $SOLR_HOME/collection1/conf/folder which should overwrite the previous schema.xml file but only if you are using Solr 1/2/3. If it is Solr4 you are using, which we are in this case then copy the schema-solr4.xml from the $NUTCH_HOME/runtime/local/conf directory into the$SOLR_HOME/collection1/conf/ and then rename it in the Solr directory back to schema.xml which should overwrite the old one. Changes made to the schema.xml made in the Solr set-up may need to be re-done.
  • Add this line to the schema.xml in the Solr installation and also to the schema-solr4.xml in the Nutch installation.
<field name="_version_" type="long" stored="true" indexed="true" multiValued="false"/>  
  • If you are NOT using the Solr4 schema file then edit the schema file you copied over and comment out the line like this:
<!--<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>-->
  • To test the crawl is indexing to Solr type into cygwin:
./nutch crawl urls -dir newCrawl -solr http://localhost:8080/solr/ -depth 3 -topN 4
  • Note Nutch is now configured to crawl over the web but there are issues that have turned up so changes need to be made and these changes can be found at:

Optimising Nutch Performance

You may notice if you try and run Nutch that it works its way through the crawl very slowly, that is because by default Nutch is set-up to use using one thread and doesn't take advantage of the Multi-threaded implementation. Nutch will use multi-threading to crawl various hosts simultaneously and therefore the settings must be changed in order for search times to be kept to a reasonable level.
In the nutch-site.xml add the code between the <configuration> tags
  <property>
  <name>fetcher.threads.per.queue</name>
     <value>10</value>
     <description></description>
  </property>
 
  <property>
  <name>fetcher.threads.per.host</name>
     <value>10</value>
     <description></description>
  </property> 

Parsing Errors

You may find errors popping up every so often that look similar to
Error parsing: 192.168.0.42/test/AGENDA.doc: failed(2,0): Unable to read 512 bytes from 65536 in stream of length 65421
Fix By default there is a limit to how much data that Solr will parse so to take that limit away you need to set the content limit to -1. This has repercussions in terms of performance as large files will take longer to parse but no file should fail because of its length.
Add this code to the nutch-site.xml
  <property> 
  <name>http.content.limit</name> 
  <value>-1</value> 
  <description>The length limit for downloaded content, in bytes. 
               If this value is nonnegative (>=0), content longer than it 
  will be 
               truncated;otherwise, no truncation at all. 
  </description> 
  </property>

Failing to parse documents with a space in the title

URLs are not allowed to contain white-space and Nutch was not replacing the space character with %20 which meant that filename became invalid.
Fix
Add the following text to regex-normalise.xml:
  <regex> 
     <pattern>&#x20;</pattern> 
     <substitution>%20</substitution> 
  </regex> 

Next Steps

You may now want to set-up Nutch to crawl a local filesystem, the guide can be found at:     
http://amac4.blogspot.co.uk/2013/07/setting-up-nutch-to-crawl-filesystem.html    
http://amac4.blogspot.co.uk/2013/07/web-service-to-query-solr-rest.html    

Otherwise you may wish to check out some tweaks that can be made to Solr including deduplication and highlighting:
http://amac4.blogspot.co.uk/2013/08/setting-up-highlighting-for-solr-4.html 
http://amac4.blogspot.co.uk/2013/08/deleting-dead-urls-files-that-no-longer.html    

16 comments:

  1. Hi...I am getting the following error whem i try to execute the command ./nutch crawl urls -dir [dir name] -depth [depth] -topN [files]....
    I have created a directory by name crawl under C:/...please tell the exact path where this directory has to be created...Thanks

    ReplyDelete
    Replies
    1. Error message is
      C:\Program:not found

      Delete
    2. Be sure to execute your command using cygwin. If you execute using ./nutch crawl [blah]... then you will need to be in the folder containing the crawl script (should be "bin" i think). If you want to be elsewhere then execute bin/nutch crawl [blah] ....

      Delete
    3. So you should move your JDK to a folder not containing a space. I suggest moving it into the /home folder of cygwin

      Delete
    4. It is because your Java folder is stored under "Program Files(x86)" and environment variables shouldn't have a space in the absolute directory. So that is why you must move it to a folder whose path will contain no spaces

      Delete
    5. its not c:program files/nutch in environment variables give it is as c:\progra..
      follow tis vedios you may get it solved "runnung nutch and solr on windows"

      Delete
  2. Hi, i follow your step instalastion but i got this error.

    $ ./nutch crawl urls -dir myCrawl -depth 5 -topN 15
    cygpath: can't convert empty path
    solrUrl is not set, indexing will be skipped...
    crawl started in: myCrawl
    rootUrlDir = urls
    threads = 10
    depth = 5
    solrUrl=null
    topN = 15
    Injector: starting at 2013-12-01 19:45:14
    Injector: crawlDb: myCrawl/crawldb
    Injector: urlDir: urls
    Injector: Converting injected urls to crawl db entries.
    Exception in thread "main" java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
    at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

    ReplyDelete
    Replies
    1. Found the problem. Check out this line

      In $NUTCH_HOME/runtime/local/conf open the regex-urlfilter.txt and where it says # accept anything else add the line +^http://amac4.blogspot.co.uk/ if you wish to search anything that comes under the amac4.blogspot.co.uk domain.

      Delete
    2. This comment has been removed by the author.

      Delete
  3. Hi Allan,

    So informative your blog is really... am new to this area.. i just gone through your blog, and now am going to integrate Tika with Solr 4.3.1 then lets c what would be result while installation.

    thank U...
    Bhoyer.

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. Hi Allan,

    Thanks for the tutorial. I'm a bit confused about how to configure Solr with Nutch. I've downloaded the latest version of Solr (4.10.2) to my Cygwin/home folder where my apache-nutch-2.2.1 resides. I cant find the collection1/conf folder your refrring to above i.e. $SOLR_HOME/collection1/conf/

    Thanks,

    Oran.

    ReplyDelete
    Replies
    1. Hii Allan,
      I have the solution of your nutch and solr configuration. for that plz specify more about your error.

      Delete
  6. Rajat Srivastava7 April 2016 at 12:20

    Awesome blog! it good to see how complex topic like this is explained in simple way. keep them coming...

    ReplyDelete
  7. Excellent and easily understandable. Now I need to run Nutch with hadoop on windows. I have configured hadoop but stuck with how to do everything for Nutch. Can you please share with me some nice article like this one written by you or anyone? Or otherwise writing a fresh article is also a nice idea.

    ReplyDelete
  8. can you tell me how to integrate tika-1.14 with nutch-2.3.1?

    ReplyDelete