Preparation
At this point you should have Solr and Tika set-up as well as having Nutch ready to crawl the web. The guides can be found here: http://amac4.blogspot.co.uk/2013/07/setting-up-solr-with-apache-tomcat-be.html http://amac4.blogspot.co.uk/2013/07/setting-up-tika-extracting-request.html http://amac4.blogspot.co.uk/2013/07/configuring-nutch-to-crawl-urls.html
The required changes that need to be done to make Nutch crawl a file-system require that Nutch is set-up to crawl the web so please ensure you have read the post.
At this point you should have Solr and Tika set-up as well as having Nutch ready to crawl the web. The guides can be found here: http://amac4.blogspot.co.uk/2013/07/setting-up-solr-with-apache-tomcat-be.html http://amac4.blogspot.co.uk/2013/07/setting-up-tika-extracting-request.html http://amac4.blogspot.co.uk/2013/07/configuring-nutch-to-crawl-urls.html
The required changes that need to be done to make Nutch crawl a file-system require that Nutch is set-up to crawl the web so please ensure you have read the post.
URLs/nutch.txt Format
This is where you specify where you would like to crawl, so add your locations with a new line after each one. If you wish to crawl a fileshare then it must be in a similar format to :
file:////189.189.1.42/test
This is where you specify where you would like to crawl, so add your locations with a new line after each one. If you wish to crawl a fileshare then it must be in a similar format to :
file:////189.189.1.42/test
Or if you wish to crawl the local filesystem it must be in this format:
file:/C:/Users/alamil/test/
Required Changes
Change 1 (regex-urlfilter.txt)
The default setting for Nutch is to skip anything that begins with "file" so our first change is to get Nutch to parse files. Nutch is currently set-up to parse anything that it finds via http so we need to change this line:
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
to say
# skip http: ftp: and mailto: urls
-^(http|ftp|mailto):
The default setting for Nutch is to skip anything that begins with "file" so our first change is to get Nutch to parse files. Nutch is currently set-up to parse anything that it finds via http so we need to change this line:
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
to say
# skip http: ftp: and mailto: urls
-^(http|ftp|mailto):
Change 2 (regex-urlfilter.txt)
This change is not necessary but may make your life easier. Any file types you do not want to index need to be added to the list otherwise Nutch will often try to parse them and fail in doing so as it doesnt know how to deal with a lot of binary file types.:
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS
|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|gz|GZ|rpm|RPM|tgz|TGZ|mov
|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS|asp|ASP|xxx|XXX|yyy|YYY
|cs|CS|dll|DLL|refresh|REFRESH)$
This change is not necessary but may make your life easier. Any file types you do not want to index need to be added to the list otherwise Nutch will often try to parse them and fail in doing so as it doesnt know how to deal with a lot of binary file types.:
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS
|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|gz|GZ|rpm|RPM|tgz|TGZ|mov
|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS|asp|ASP|xxx|XXX|yyy|YYY
|cs|CS|dll|DLL|refresh|REFRESH)$
Change 3 (regex-urlfilter.txt)
Our final change to Nutch is to filter out unwanted files from being parsed. Often you will only want the subdirectories and files of the initial directory you supplied to be indexed. For example, if I supply Nutch with the file:////189.189.1.42/test In the nutch.txt file then I only want the files within any of its subdirectories to be indexed so I would filter by:
# accept anything else
+^file:////189.189.1.42/test
Our final change to Nutch is to filter out unwanted files from being parsed. Often you will only want the subdirectories and files of the initial directory you supplied to be indexed. For example, if I supply Nutch with the file:////189.189.1.42/test In the nutch.txt file then I only want the files within any of its subdirectories to be indexed so I would filter by:
# accept anything else
+^file:////189.189.1.42/test
Change 4 (nutch-site.xml)
To allow Nutch to recognise that it is dealing with files it needs a plugin to be activated. So adding the protocol-file plugin to the list of plugins that are used allows Nutch to know how to deal with files:
<property>
<name>plugin.includes</name>
<value>protocol-file|protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|
scoring-opic|urlnormalizer-(pass|regex|basic)|index-more</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin.
</description>
</property>
To allow Nutch to recognise that it is dealing with files it needs a plugin to be activated. So adding the protocol-file plugin to the list of plugins that are used allows Nutch to know how to deal with files:
<property> <name>plugin.includes</name> <value>protocol-file|protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)| scoring-opic|urlnormalizer-(pass|regex|basic)|index-more</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. </description> </property>
Change 5 (nutch-site.xml)
We encountered the problem of file size earlier when doing the web crawler, if you set a limit on the content size it can cause the read to fail and the same applies here, so not matter how big the file is we want it to be parsed.
<property>
<name>file.content.limit</name>
<value>-1</value>
<description> Needed to stop buffer overflow errors - Unable to read.....</description>
</property>
We encountered the problem of file size earlier when doing the web crawler, if you set a limit on the content size it can cause the read to fail and the same applies here, so not matter how big the file is we want it to be parsed.
<property> <name>file.content.limit</name> <value>-1</value> <description> Needed to stop buffer overflow errors - Unable to read.....</description> </property>
Change 6 (regex-normalise.xml)
We do not want duplicate slashes to be replaced by single slashes other the filenames we specify will not be valid. Therefore we need to comment out the code that does the replacement:
<!-- removes duplicate slashes
<regex>
<pattern>(?<!:)/{2,}</pattern>
<substitution>/</substitution>
</regex>
-->
We do not want duplicate slashes to be replaced by single slashes other the filenames we specify will not be valid. Therefore we need to comment out the code that does the replacement:
<!-- removes duplicate slashes
<regex>
<pattern>(?<!:)/{2,}</pattern>
<substitution>/</substitution>
</regex>
-->
Running the Crawl
Ensure you add the location you wish to crawl to the nutch.txt file
To run the crawl, like before, you issue the command:
./nutch crawl urls -dir [dir_name] -solr http://localhost:8080/solr/ -depth [depth] -topN [topN]
Be sure to set a depth and a topN number of files to index otherwise nutch will use its default setting of depth 3 which will mean that you may have a lot of files that are untouched.
Ensure you add the location you wish to crawl to the nutch.txt file
To run the crawl, like before, you issue the command:
To run the crawl, like before, you issue the command:
./nutch crawl urls -dir [dir_name] -solr http://localhost:8080/solr/ -depth [depth] -topN [topN]
Be sure to set a depth and a topN number of files to index otherwise nutch will use its default setting of depth 3 which will mean that you may have a lot of files that are untouched.
Next Steps
An issue I have noticed that occurs is with files that contain an apostrophe in the filepath. Nutch was designed to run on Unix, which is why in this tutorial you are required to download cygwin, but this also has its complications. Unix by default does not allow any filepaths to contain an apostrophe, but Windows filepaths do. This means that when Nutch comes across an apostrophe, it thinks that is the end of the file. I have written a Java program that will solve the problem and it can be found here:
http://amac4.blogspot.co.uk/2013/07/nutch-apostrophesingle-quotes-issue_23.html
An issue I have noticed that occurs is with files that contain an apostrophe in the filepath. Nutch was designed to run on Unix, which is why in this tutorial you are required to download cygwin, but this also has its complications. Unix by default does not allow any filepaths to contain an apostrophe, but Windows filepaths do. This means that when Nutch comes across an apostrophe, it thinks that is the end of the file. I have written a Java program that will solve the problem and it can be found here:
http://amac4.blogspot.co.uk/2013/07/nutch-apostrophesingle-quotes-issue_23.html
You should now have everything set-up to work now and the next step is to integrate your crawler and search server into your web application. I have written a C# Web Service that makes calls to the Solr server and returns the xml response, have a look, copy and tweak it to your own liking to create your own search engine:
http://amac4.blogspot.co.uk/2013/07/web-service-to-query-solr-rest.html
You should now have everything set-up to work now and the next step is to integrate your crawler and search server into your web application. I have written a C# Web Service that makes calls to the Solr server and returns the xml response, have a look, copy and tweak it to your own liking to create your own search engine:
http://amac4.blogspot.co.uk/2013/07/web-service-to-query-solr-rest.html
http://amac4.blogspot.co.uk/2013/07/web-service-to-query-solr-rest.html
You really make it seem really easy along with
ReplyDeleteyour presentation however I find this topic to be actually something that I think I would never understand.
It seems too complicated and extremely vast for me. I am
looking ahead to your next submit, I will attempt to get
the hang of it!
Feel free to visit my webpage: Louis Vuitton Outlet Store