Nutch is coded entirely in the Java programming language and is a crawler with a wide variety of features. Some of these features are:
- highly scalable and feature rich crawler
- features like politeness which obeys robots.txt rules
- robust and scalable - Nutch can run on a cluster of up to 100 machines
- quality - crawling can be biassed to fetch "important" pages first
For our purposes, it will allow us to crawl a source and will automatically index it over to our Solr server.
Preparation
You should have set-up Solr on Tomcat along with Tika's extracting request handler as shown in the previous two guides: http://amac4.blogspot.co.uk/2013/07/setting-up-solr-with-apache-tomcat-be.html http://amac4.blogspot.co.uk/2013/07/setting-up-tika-extracting-request.html
Download the binary Nutch from the Apache website. Some releases of Nutch were designed specifically to work with certain versions of Solr so be aware that the version of Nutch you try to integrate with Solr is important. For this example I am using Solr4.3 and Nutch1.4.
Downloads
Most of these will be set-up prior to this so you do not need to download them again
- Download Java jre7 and jdk7
- Download Tomcat 7
- Download Solr 4.3
- Download Cygwin - Run Set-up.exe and install all packages (Default may be sufficient)
- Download Nutch 1.4 bin
Set-up
- Cygwin should be installed to
C:/cygwin
or similar. Copy your Nutch download to thecygwin/home
folder. This Nutch installation will be referred to as$NUTCH_HOME
.
- Set-up a system environment varibale called
JAVA_HOME
and set the location as the location of your JDK (e.gC:/Java/jdk1.7
).
- If cygwin does not recognise that you have set up an environment variable then you can issue the following instruction. Note that you will be required to type this instruction every time you wish to issue any commands that use the jdk .
export JAVA_HOME=[JDK Location]
- In cygwin change directory into the
$NUTCH_HOME/runtime/local/bin
directory. If your environment variable has been set up correctly then when you run the command./nutch
the output should beUsage: nutch [-core] COMMAND
cd $NUTCH_HOME/runtime/local/bin
./nutch
#Output Should Be
Usage: nutch [-core] COMMAND
- NOTE:
cygpath cant convert empty path
is not an error and will be displayed each time you run any Nutch command.
- In the
$NUTCH_HOME/runtime/local/bin
folder create a new folder and give it any name, sayurls
. This folder will contain a text file that will be used to determine what sites will get crawled.
- In the new folder create a text document say (nutch.txt) and add a list of the urls you wish to crawl (
e.g http://amac4.blogspot.co.uk
)
- In
$NUTCH_HOME/runtime/local/conf
open the regex-urlfilter.txt and where it says# accept anything else
add the line+^http://amac4.blogspot.co.uk/
if you wish to search anything that comes under the amac4.blogspot.co.uk domain.
- Nutch can only extract data from certain types of file and cannot extract data from images or various other binary file types. You as a user may not want data extraction to occur when using certain file types so you have the option to ignore these files by adding their tags to the list in the same fashion as it shows (|xml|XML|jpeg|asp|.. etc)
- Now open
nutch-site.xml
and add the following code between the<configuration>
headers
<name>http.agent.name</name>
<value>My Nutch Spider</value> #(You can add any name here)
- In
nutch-default.xml
there should be a line under the<property>
tag which has<name>http.agent.name</name>
. The<value>
field below should be empty so you should add the name of your crawler that you specified before, so in this case it would be:
<value>My Nutch Spider</value>
- You can test the crawl is working by navigating to the
$NUTCH_HOME/runtime/local/bin
folder and executing:
cd $NUTCH_HOME/runtime/local/bin
./nutch crawl urls -dir [dir name] -depth [depth] -topN [files]
- The directory you will supply (dir name) will store the indexes from the crawl. Running the crawl again will also cause the files to be re-indexed if they are found. The depth is asking how deep down a hierarchy do you wish to go and the topN is asking how many pages on each level you wish to index.
- To get it linking to Solr copy the
schema.xml
from$NUTCH_HOME/runtime/local/conf
into the$SOLR_HOME/collection1/conf/
folder which should overwrite the previousschema.xml
file but only if you are using Solr 1/2/3. If it is Solr4 you are using, which we are in this case then copy theschema-solr4.xml
from the$NUTCH_HOME/runtime/local/conf
directory into the$SOLR_HOME/collection1/conf/
and then rename it in the Solr directory back toschema.xml
which should overwrite the old one. Changes made to theschema.xml
made in the Solr set-up may need to be re-done.
- Add this line to the
schema.xml
in the Solr installation and also to theschema-solr4.xml
in the Nutch installation.
<field name="_version_" type="long" stored="true" indexed="true" multiValued="false"/>
- If you are NOT using the Solr4 schema file then edit the schema file you copied over and comment out the line like this:
<!--<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>-->
- To test the crawl is indexing to Solr type into cygwin:
./nutch crawl urls -dir newCrawl -solr http://localhost:8080/solr/ -depth 3 -topN 4
- Note Nutch is now configured to crawl over the web but there are issues that have turned up so changes need to be made and these changes can be found at:
Optimising Nutch Performance
You may notice if you try and run Nutch that it works its way through the crawl very slowly, that is because by default Nutch is set-up to use using one thread and doesn't take advantage of the Multi-threaded implementation. Nutch will use multi-threading to crawl various hosts simultaneously and therefore the settings must be changed in order for search times to be kept to a reasonable level.
In the nutch-site.xml
add the code between the <configuration>
tags
<property>
<name>fetcher.threads.per.queue</name>
<value>10</value>
<description></description>
</property>
<property>
<name>fetcher.threads.per.host</name>
<value>10</value>
<description></description>
</property>
You may notice if you try and run Nutch that it works its way through the crawl very slowly, that is because by default Nutch is set-up to use using one thread and doesn't take advantage of the Multi-threaded implementation. Nutch will use multi-threading to crawl various hosts simultaneously and therefore the settings must be changed in order for search times to be kept to a reasonable level.
In the
nutch-site.xml
add the code between the <configuration>
tags <property>
<name>fetcher.threads.per.queue</name>
<value>10</value>
<description></description>
</property>
<property>
<name>fetcher.threads.per.host</name>
<value>10</value>
<description></description>
</property>
Parsing Errors
You may find errors popping up every so often that look similar to
Error parsing: 192.168.0.42/test/AGENDA.doc: failed(2,0): Unable to read 512 bytes from 65536 in stream of length 65421
Fix By default there is a limit to how much data that Solr will parse so to take that limit away you need to set the content limit to -1. This has repercussions in terms of performance as large files will take longer to parse but no file should fail because of its length.
Add this code to the
nutch-site.xml
<property>
<name>http.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it
will be
truncated;otherwise, no truncation at all.
</description>
</property>
Failing to parse documents with a space in the title
URLs are not allowed to contain white-space and Nutch was not replacing the space character with %20 which meant that filename became invalid.
Fix
Add the following text to
regex-normalise.xml
: <regex>
<pattern> </pattern>
<substitution>%20</substitution>
</regex>
Next Steps
You may now want to set-up Nutch to crawl a local filesystem, the guide can be found at:
http://amac4.blogspot.co.uk/2013/07/setting-up-nutch-to-crawl-filesystem.html
http://amac4.blogspot.co.uk/2013/07/web-service-to-query-solr-rest.html
You may now want to set-up Nutch to crawl a local filesystem, the guide can be found at:
http://amac4.blogspot.co.uk/2013/07/setting-up-nutch-to-crawl-filesystem.html
http://amac4.blogspot.co.uk/2013/07/web-service-to-query-solr-rest.html
http://amac4.blogspot.co.uk/2013/07/setting-up-nutch-to-crawl-filesystem.html
http://amac4.blogspot.co.uk/2013/07/web-service-to-query-solr-rest.html
Otherwise you may wish to check out some tweaks that can be made to Solr including deduplication and highlighting:
http://amac4.blogspot.co.uk/2013/08/setting-up-highlighting-for-solr-4.html
http://amac4.blogspot.co.uk/2013/08/deleting-dead-urls-files-that-no-longer.html
http://amac4.blogspot.co.uk/2013/08/setting-up-highlighting-for-solr-4.html
http://amac4.blogspot.co.uk/2013/08/deleting-dead-urls-files-that-no-longer.html
Hi...I am getting the following error whem i try to execute the command ./nutch crawl urls -dir [dir name] -depth [depth] -topN [files]....
ReplyDeleteI have created a directory by name crawl under C:/...please tell the exact path where this directory has to be created...Thanks
Error message is
DeleteC:\Program:not found
Be sure to execute your command using cygwin. If you execute using ./nutch crawl [blah]... then you will need to be in the folder containing the crawl script (should be "bin" i think). If you want to be elsewhere then execute bin/nutch crawl [blah] ....
DeleteSo you should move your JDK to a folder not containing a space. I suggest moving it into the /home folder of cygwin
DeleteIt is because your Java folder is stored under "Program Files(x86)" and environment variables shouldn't have a space in the absolute directory. So that is why you must move it to a folder whose path will contain no spaces
Deleteits not c:program files/nutch in environment variables give it is as c:\progra..
Deletefollow tis vedios you may get it solved "runnung nutch and solr on windows"
Hi, i follow your step instalastion but i got this error.
ReplyDelete$ ./nutch crawl urls -dir myCrawl -depth 5 -topN 15
cygpath: can't convert empty path
solrUrl is not set, indexing will be skipped...
crawl started in: myCrawl
rootUrlDir = urls
threads = 10
depth = 5
solrUrl=null
topN = 15
Injector: starting at 2013-12-01 19:45:14
Injector: crawlDb: myCrawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
Found the problem. Check out this line
DeleteIn $NUTCH_HOME/runtime/local/conf open the regex-urlfilter.txt and where it says # accept anything else add the line +^http://amac4.blogspot.co.uk/ if you wish to search anything that comes under the amac4.blogspot.co.uk domain.
This comment has been removed by the author.
DeleteHi Allan,
ReplyDeleteSo informative your blog is really... am new to this area.. i just gone through your blog, and now am going to integrate Tika with Solr 4.3.1 then lets c what would be result while installation.
thank U...
Bhoyer.
This comment has been removed by the author.
ReplyDeleteHi Allan,
ReplyDeleteThanks for the tutorial. I'm a bit confused about how to configure Solr with Nutch. I've downloaded the latest version of Solr (4.10.2) to my Cygwin/home folder where my apache-nutch-2.2.1 resides. I cant find the collection1/conf folder your refrring to above i.e. $SOLR_HOME/collection1/conf/
Thanks,
Oran.
Hii Allan,
DeleteI have the solution of your nutch and solr configuration. for that plz specify more about your error.
Awesome blog! it good to see how complex topic like this is explained in simple way. keep them coming...
ReplyDeleteExcellent and easily understandable. Now I need to run Nutch with hadoop on windows. I have configured hadoop but stuck with how to do everything for Nutch. Can you please share with me some nice article like this one written by you or anyone? Or otherwise writing a fresh article is also a nice idea.
ReplyDeletecan you tell me how to integrate tika-1.14 with nutch-2.3.1?
ReplyDeleteRespect and that i have a super proposal: What House Renovations Need Council Approval exterior house remodel
ReplyDelete