Pages

Monday 22 July 2013

Installation Guide To Set Up Nutch To Crawl A Filesystem Or Intranet (Windows)

Preparation

At this point you should have Solr and Tika set-up as well as having Nutch ready to crawl the web. The guides can be found here:    http://amac4.blogspot.co.uk/2013/07/setting-up-solr-with-apache-tomcat-be.html         http://amac4.blogspot.co.uk/2013/07/setting-up-tika-extracting-request.html       http://amac4.blogspot.co.uk/2013/07/configuring-nutch-to-crawl-urls.html     
The required changes that need to be done to make Nutch crawl a file-system require that Nutch is set-up to crawl the web so please ensure you have read the post.

URLs/nutch.txt Format

This is where you specify where you would like to crawl, so add your locations with a new line after each one. If you wish to crawl a fileshare then it must be in a similar format to :
file:////189.189.1.42/test

Or if you wish to crawl the local filesystem it must be in this format:
file:/C:/Users/alamil/test/

Required Changes

Change 1 (regex-urlfilter.txt)

The default setting for Nutch is to skip anything that begins with "file" so our first change is to get Nutch to parse files. Nutch is currently set-up to parse anything that it finds via http so we need to change this line:
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
to say
# skip http: ftp: and mailto: urls
-^(http|ftp|mailto):

Change 2 (regex-urlfilter.txt)

This change is not necessary but may make your life easier.  Any file types you do not want to index need to be added to the list otherwise Nutch will often try to parse them and fail in doing so as it doesnt know how to deal with a lot of binary file types.:
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin

-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS
|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|gz|GZ|rpm|RPM|tgz|TGZ|mov
|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS|asp|ASP|xxx|XXX|yyy|YYY
|cs|CS|dll|DLL|refresh|REFRESH)$

Change 3 (regex-urlfilter.txt)

Our final change to Nutch is to filter out unwanted files from being parsed.  Often you will only want the subdirectories and files of the initial directory you supplied to be indexed. For example, if I supply Nutch with the file:////189.189.1.42/test In the nutch.txt file then I only want the files within any of its subdirectories to be indexed so I would filter by:
# accept anything else
+^file:////189.189.1.42/test

Change 4 (nutch-site.xml)

To allow Nutch to recognise that it is dealing with files it needs a plugin to be activated. So adding the protocol-file plugin to the list of plugins that are used allows Nutch to know how to deal with files:
<property>
<name>plugin.includes</name>
<value>protocol-file|protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|
  scoring-opic|urlnormalizer-(pass|regex|basic)|index-more</value>
  <description>Regular expression naming plugin directory names to 
    include.  Any plugin not matching this expression is excluded. 
    In any case you need at least include the nutch-extensionpoints plugin.
  </description>
</property>

Change 5 (nutch-site.xml)

We encountered the problem of file size earlier when doing the web crawler, if you set a limit on the content size it can cause the read to fail and the same applies here, so not matter how big the file is we want it to be parsed.
<property>
 <name>file.content.limit</name>
 <value>-1</value>
 <description> Needed to stop buffer overflow errors - Unable to read.....</description>
</property>

Change 6 (regex-normalise.xml)

We do not want duplicate slashes to be replaced by single slashes other the filenames we specify will not be valid. Therefore we need to comment out the code that does the replacement:
<!-- removes duplicate slashes 
<regex>
  <pattern>(?&lt;!:)/{2,}</pattern>
  <substitution>/</substitution>
</regex>
-->

Running the Crawl

Ensure you add the location you wish to crawl to the nutch.txt file 

To run the crawl, like before, you issue the command:
./nutch crawl urls -dir [dir_name] -solr http://localhost:8080/solr/ -depth [depth] -topN [topN]
Be sure to set a depth and a topN number of files to index otherwise nutch will use its default setting of depth 3 which will mean that you may have a lot of files that are untouched.

Next Steps

An issue I have noticed that occurs is with files that contain an apostrophe in the filepath.  Nutch was designed to run on Unix, which is why in this tutorial you are required to download cygwin, but this also has its complications.  Unix by default does not allow any filepaths to contain an apostrophe, but Windows filepaths do.  This means that when Nutch comes across an apostrophe, it thinks that is the end of the file.  I have written a Java program that will solve the problem and it can be found here:

http://amac4.blogspot.co.uk/2013/07/nutch-apostrophesingle-quotes-issue_23.html 

You should now have everything set-up to work now and the next step is to integrate your crawler and search server into your web application.  I have written a C# Web Service that makes calls to the Solr server and returns the xml response, have a look, copy and tweak it to your own liking to create your own search engine:

http://amac4.blogspot.co.uk/2013/07/web-service-to-query-solr-rest.html



1 comment:

  1. You really make it seem really easy along with
    your presentation however I find this topic to be actually something that I think I would never understand.
    It seems too complicated and extremely vast for me. I am
    looking ahead to your next submit, I will attempt to get
    the hang of it!

    Feel free to visit my webpage: Louis Vuitton Outlet Store

    ReplyDelete