Tuesday, 23 July 2013

Setting Up Tika & Extracting Request Handler

Setting Up Tika's Extracting Request Handler

Some of this is covered in the set-up of Solr

Sometimes indexing prepared text files (such as XML, CSV, JSON, etc) is not enough. There are numerous situations where you need to extract data from binary files. For example, indexing PDF files – actually their contents. To do that we can use Apache Tika which comes built in with Apache Solr by using its ExtractingRequestHandler.

Preparation

You should have worked through the set-up for Solr prior to this point and can be found at:

http://amac4.blogspot.co.uk/2013/07/setting-up-solr-with-apache-tomcat-be.html

If you wish to have a fully functioning file or web crawler using Nutch that Indexes to Solr then follow the next steps of the guide at:

http://amac4.blogspot.co.uk/2013/07/configuring-nutch-to-crawl-urls.html

http://amac4.blogspot.co.uk/2013/07/setting-up-nutch-to-crawl-filesystem.html

http://amac4.blogspot.co.uk/2013/07/web-service-to-query-solr-rest.html

Set-Up Guide

In the $SOLR_HOME/collection1/conf/solrconfig.xml file there will be a section with heading - Solr Cell Update Request Handler. The code there should be updated or replaced to say:

<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler">
  <lst name="defaults">
    <str name="fmap.content">text</str>
   <str name="lowernames">true</str>
   <str name="uprefix">attr_</str>
   <str name="captureAttr">true</str>
 </lst>
</requestHandler>

Create an "extract" folder anywhere in the system, one option would be putting it in the solr_home folder. Then place the solr-cell-4.3.0.jar file in it from the $SOLR/dist. Then copy the contents of the $SOLR/contrib/extraction/lib/ folder into your extract folder.

In the solrconfig.xml file add code for the directory you have chosen:

<lib dir="$SOLR_HOME/extract" regex=".*\.jar" />

In the schema.xml file the <field name="text"…..> line needs edited to say

<field name="text" type="text_general" indexed="true" stored="true" multiValued="true"/>

To test that it works open command prompt and navigate to any directory containing a pdf file and execute the following code replacing the filename with the file to be used:

curl "http://localhost:8080/solr/update/extract?literal.id=1&commit=true" -F "myfile=@FILENAME.pdf"

If all has worked correctly then the following output should be displayed

<?xml version="1.0" encoding="UTF-8"?>
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">578</int>
  </lst>
</response>

Next Steps

You now have Solr configured properly and ready to use Tika to extract the data that you need. The next step is configure Nutch, an open source web crawler that will crawl the web to find pages to index:

http://amac4.blogspot.co.uk/2013/07/configuring-nutch-to-crawl-urls.html

How It Works

Binary file parsing is implemented using the Apache Tika framework. Tika is a toolkit for detecting and extracting metadata and structured text from various types of documents, not only binary files but also HTML and XML files. To add a handler that uses Apache Tika, we need to add a handler based on the solr.extraction.ExtractingRequestHandler class to our solrconfig.xml file as shown in the example. In addition to the handler definition, we need to specify where Solr should look for the additional libraries we placed in the extract directory that we created. The dir attribute of the lib tag should be pointing to the path of the created directory. The regex attribute is the regular expression telling Solr which files to load. Let's now discuss the default configuration parameters. The fmap.content parameter tells Solr what field content of the parsed document should be extracted. In our case, the parsed content will go to the field named text. The next parameter lowernames is set to true; this tells Solr to lower all names that come from Tika and have them lowercased. The next parameter, uprefix, is very important. It tells Solr how to handle fields that are not defined in the schema.xml file. The name of the field returned from Tika will be added to the value of the parameter and sent to Solr. For example, if Tika returned a field named creator, and we don't have such a field in our index, then Solr would try to index it under a field named attrcreator which is a dynamic field. The last parameter tells Solr to index Tika XHTML elements into separate fields named after those elements. Next we have a command that sends a PDF file to Solr. We are sending a file to the /update/ extract handler with two parameters. First we define a unique identifier. It's useful to be able to do that during document sending because most of the binary document won't have an identifier in its contents. To pass the identifier we use the literal.id parameter. The second parameter we send to Solr is the information to perform the commit right after document processing.

Source Code

If you are unsure of anything then pop me an email and i can send you sample schema.xml and solrconfig.xml for you to use.

16 comments:

pandahands19 December 2013 at 23:46
how do I set up tika to parse word docs or several different types of files mainly word docs, pdfs, text files .. they are all Curriculum vitaes? Have set up solr with only two errors left ... deprecated updatehandlers.. I have not indexed any documents yet as I am not sure how to do this. I have 12,000 cvs I would like to parse and search through.. can you help me?
ReplyDelete
Replies
Unknown23 January 2014 at 20:29
Hi, i setup Solr and Nutch already. Now i want to integrate Tika while i'm following this guide, i'm stuck where i suppose to do
(In the schema.xml file the line needs edited to say
)
but i replace the schema file with nutch schema file already.. Now there is no field with name="text" ... i need help
ReplyDelete
Replies
Unknown25 January 2014 at 15:11
Can u guide me how to integrate or configure Apache Tika with/in Apache Nutch????
ReplyDelete
Replies
Unknown3 February 2014 at 20:05
OMG.... Thanks Man I was searching this form Last week... Thanks for the help...
ReplyDelete
Replies
SANTHOSH BHOYER29 March 2014 at 13:39
Hi Allan,

How are you doing? really very informative it is, today am gonna integrate Tika with Solr 4.3.1 and let you know if anything goes wrong with the installation and all.

Thank you,
Boyer.
ReplyDelete
Replies
SANTHOSH KUMAR BHOYER18 April 2014 at 10:39
Hi Allan,

Please help me out on this issue, i have configured all the things you have mentioned above to configure tika, then i have passed the command to check out, but dint go well. it has given below error as result. I just went to the directory where put the pdf file then i did so.

java.lang.NoSuchMethodError: org.apache.tika.config.TikaConfig.<init>(Ljava/lang/ClassLoader;)Vjava.lang.RuntimeException: java.lang.NoSuchMethodError: org.apache.tika.config.TikaConfig.<init>(Ljava/lang/ClassLoader;)V
at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1040)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:607)
at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:313)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NoSuchMethodError: org.apache.tika.config.TikaConfig.<init>(Ljava/lang/ClassLoader;)V
at org.apache.solr.handler.extraction.ExtractingRequestHandler.getDefaultConfig(ExtractingRequestHandler.java:107)
at org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:96)
at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:253)
at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
... 16 more
500

I will be looking forward to c from you.

Thank you...
ReplyDelete
Replies
Rakesh8 August 2015 at 20:14
Hi,

Please help me out on this issue. While indexing a single doc (test.doc) using the below commands work fine
C:\solr-home\exampledocs>java -Durl=http://localhost:8080/solr/update/extract -Dparams=literal.id=test -Dtype=application/doc -jar post.jar test.doc
C:\solr-home\exampledocs>java -Durl=http://localhost:8080/solr/update/extract?literal.id=test -Dtype=application/doc -jar post.jar test.doc

But now i want to index a folder having more than one doc file with unique id to each document then i am facing problem

C:\solr-home\exampledocs>java -Durl=http://localhost:8080/solr/update/extract -Dparams=literal.id=test -Dtype=application/doc -jar post.jar *.doc

i want to assign unique id to each document. Pls. help me out ....

Thanks..
ReplyDelete
Replies
Anonymous10 November 2015 at 13:40
Hi
For Version 5.* you need to update the row
curl "http://localhost:8080/solr/update/extract?literal.id=1&commit=true" -F "myfile=@FILENAME.pdf"
to
curl "http://localhost:8080/solr/collection1/update/extract?literal.id=1&commit=true" -F "myfile=@FILENAME.pdf"
ReplyDelete
Replies
Unknown18 July 2016 at 06:57
This comment has been removed by the author.
ReplyDelete
Replies
Unknown18 July 2016 at 06:58
I'm using SOLR 5.4.1 to index pdf and other type of documents using the ExtractingRequestHandler. The indexing itself works fine but I don't know how to get the page number and paragraph of a search result.

I would like to display the page number along with hits in a document, e.g. "term foo was found in bar.pdf on pages 2, 3 and 5."

can you please tell me how to do it?
ReplyDelete
Replies
Nagendra Reddy9 November 2016 at 08:21
It is not easy to Setting Up Tika's Extracting Request Handler.Here you are mention useful steps but some errors coming.But i resolved them in my apache tomcat training sessions.Thank you.
ReplyDelete
Replies

Share What You Know, Learn What You Don't

Pages

Tuesday, 23 July 2013

Setting Up Tika & Extracting Request Handler

Setting Up Tika's Extracting Request Handler

Preparation

Set-Up Guide

Next Steps

You now have Solr configured properly and ready to use Tika to extract the data that you need. The next step is configure Nutch, an open source web crawler that will crawl the web to find pages to index:

http://amac4.blogspot.co.uk/2013/07/configuring-nutch-to-crawl-urls.html

How It Works

Source Code

16 comments:

Pages

Tuesday, 23 July 2013

Setting Up Tika & Extracting Request Handler

Setting Up Tika's Extracting Request Handler

Preparation

Set-Up Guide

Next Steps

You now have Solr configured properly and ready to use Tika to extract the data that you need. The next step is configure Nutch, an open source web crawler that will crawl the web to find pages to index: http://amac4.blogspot.co.uk/2013/07/configuring-nutch-to-crawl-urls.html

How It Works

Source Code

16 comments:

You now have Solr configured properly and ready to use Tika to extract the data that you need. The next step is configure Nutch, an open source web crawler that will crawl the web to find pages to index:

http://amac4.blogspot.co.uk/2013/07/configuring-nutch-to-crawl-urls.html