Pages

Tuesday 23 July 2013

Setting Up Tika & Extracting Request Handler

Setting Up Tika's Extracting Request Handler

Some of this is covered in the set-up of Solr
Sometimes indexing prepared text files (such as XML, CSV, JSON, etc) is not enough. There are numerous situations where you need to extract data from binary files. For example, indexing PDF files – actually their contents. To do that we can use Apache Tika which comes built in with Apache Solr by using its ExtractingRequestHandler.


Preparation

You should have worked through the set-up for Solr prior to this point and can be found at:

If you wish to have a fully functioning file or web crawler using Nutch that Indexes to Solr then follow the next steps of the guide at:   

Set-Up Guide

  • In the $SOLR_HOME/collection1/conf/solrconfig.xml file there will be a section with heading - Solr Cell Update Request Handler. The code there should be updated or replaced to say:
<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler">
  <lst name="defaults">
    <str name="fmap.content">text</str>
   <str name="lowernames">true</str>
   <str name="uprefix">attr_</str>
   <str name="captureAttr">true</str>
 </lst>
</requestHandler>
  • Create an "extract" folder anywhere in the system, one option would be putting it in the solr_home folder. Then place the solr-cell-4.3.0.jar file in it from the $SOLR/dist. Then copy the contents of the $SOLR/contrib/extraction/lib/ folder into your extract folder.
  • In the solrconfig.xml file add code for the directory you have chosen:
<lib dir="$SOLR_HOME/extract" regex=".*\.jar" />
  • In the schema.xml file the <field name="text"…..> line needs edited to say
<field name="text" type="text_general" indexed="true" stored="true" multiValued="true"/>
  • To test that it works open command prompt and navigate to any directory containing a pdf file and execute the following code replacing the filename with the file to be used:
curl "http://localhost:8080/solr/update/extract?literal.id=1&commit=true" -F "myfile=@FILENAME.pdf"
  • If all has worked correctly then the following output should be displayed
<?xml version="1.0" encoding="UTF-8"?>
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">578</int>
  </lst>
</response>

Next Steps

You now have Solr configured properly and ready to use Tika to extract the data that you need.  The next step is configure Nutch, an open source web crawler that will crawl the web to find pages to index:  

How It Works

Binary file parsing is implemented using the Apache Tika framework. Tika is a toolkit for detecting and extracting metadata and structured text from various types of documents, not only binary files but also HTML and XML files. To add a handler that uses Apache Tika, we need to add a handler based on the solr.extraction.ExtractingRequestHandler class to our solrconfig.xml file as shown in the example. In addition to the handler definition, we need to specify where Solr should look for the additional libraries we placed in the extract directory that we created. The dir attribute of the lib tag should be pointing to the path of the created directory. The regex attribute is the regular expression telling Solr which files to load. Let's now discuss the default configuration parameters. The fmap.content parameter tells Solr what field content of the parsed document should be extracted. In our case, the parsed content will go to the field named text. The next parameter lowernames is set to true; this tells Solr to lower all names that come from Tika and have them lowercased. The next parameter, uprefix, is very important. It tells Solr how to handle fields that are not defined in the schema.xml file. The name of the field returned from Tika will be added to the value of the parameter and sent to Solr. For example, if Tika returned a field named creator, and we don't have such a field in our index, then Solr would try to index it under a field named attrcreator which is a dynamic field. The last parameter tells Solr to index Tika XHTML elements into separate fields named after those elements. Next we have a command that sends a PDF file to Solr. We are sending a file to the /update/ extract handler with two parameters. First we define a unique identifier. It's useful to be able to do that during document sending because most of the binary document won't have an identifier in its contents. To pass the identifier we use the literal.id parameter. The second parameter we send to Solr is the information to perform the commit right after document processing.

Source Code

If you are unsure of anything then pop me an email and i can send you sample schema.xml and solrconfig.xml for you to use.

16 comments:

  1. how do I set up tika to parse word docs or several different types of files mainly word docs, pdfs, text files .. they are all Curriculum vitaes? Have set up solr with only two errors left ... deprecated updatehandlers.. I have not indexed any documents yet as I am not sure how to do this. I have 12,000 cvs I would like to parse and search through.. can you help me?

    ReplyDelete
    Replies
    1. Look at my guide on running Nutch on a local filesystem, it should do it all for you

      Delete
  2. Hi, i setup Solr and Nutch already. Now i want to integrate Tika while i'm following this guide, i'm stuck where i suppose to do
    (In the schema.xml file the line needs edited to say
    )
    but i replace the schema file with nutch schema file already.. Now there is no field with name="text" ... i need help

    ReplyDelete
    Replies
    1. I would suggest possibly skipping that step then, if it is not working then try adding it, and if none of those work then message me again :)

      text Should point at the "content" of what you are trying to index which is why that field is needed

      Delete
    2. Thanks for your reply!
      i'm all doing this on windows using CYGWIN.
      Q1: Do i just have to configure this, while i don't install TIKA or i need TIKA ?
      Q2: Where i have to run this commond windows cmd or CYGWIN termial ?
      curl "http://localhost:8080/solr/update/extract?literal.id=1&commit=true" -F "myfile=@FILENAME.pdf" ...

      Delete
    3. The Tika jars come with your Solr installation so no need to "install" anything

      You can run it in either, there is a possibility you may need to download the curl command as it isnt a standard cygwin command. It is availible through the windows command line, so would suggest you do it there. Just make sure that you point it at a real file, so ensure FILENAME.pdf is changed to the name of a file you have, it can be .pdf .doc etc

      Delete
    4. Thanks allot Allan Macmillan.... Now i got it :)

      Delete
  3. Can u guide me how to integrate or configure Apache Tika with/in Apache Nutch????

    ReplyDelete
  4. OMG.... Thanks Man I was searching this form Last week... Thanks for the help...

    ReplyDelete
  5. Hi Allan,

    How are you doing? really very informative it is, today am gonna integrate Tika with Solr 4.3.1 and let you know if anything goes wrong with the installation and all.

    Thank you,
    Boyer.

    ReplyDelete
  6. Hi Allan,

    Please help me out on this issue, i have configured all the things you have mentioned above to configure tika, then i have passed the command to check out, but dint go well. it has given below error as result. I just went to the directory where put the pdf file then i did so.



    java.lang.NoSuchMethodError: org.apache.tika.config.TikaConfig.<init>(Ljava/lang/ClassLoader;)Vjava.lang.RuntimeException: java.lang.NoSuchMethodError: org.apache.tika.config.TikaConfig.<init>(Ljava/lang/ClassLoader;)V
    at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
    at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
    at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1040)
    at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:607)
    at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:313)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
    Caused by: java.lang.NoSuchMethodError: org.apache.tika.config.TikaConfig.<init>(Ljava/lang/ClassLoader;)V
    at org.apache.solr.handler.extraction.ExtractingRequestHandler.getDefaultConfig(ExtractingRequestHandler.java:107)
    at org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:96)
    at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:253)
    at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
    at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
    ... 16 more
    500


    I will be looking forward to c from you.

    Thank you...

    ReplyDelete
  7. Hi,

    Please help me out on this issue. While indexing a single doc (test.doc) using the below commands work fine
    C:\solr-home\exampledocs>java -Durl=http://localhost:8080/solr/update/extract -Dparams=literal.id=test -Dtype=application/doc -jar post.jar test.doc
    C:\solr-home\exampledocs>java -Durl=http://localhost:8080/solr/update/extract?literal.id=test -Dtype=application/doc -jar post.jar test.doc

    But now i want to index a folder having more than one doc file with unique id to each document then i am facing problem

    C:\solr-home\exampledocs>java -Durl=http://localhost:8080/solr/update/extract -Dparams=literal.id=test -Dtype=application/doc -jar post.jar *.doc

    i want to assign unique id to each document. Pls. help me out ....

    Thanks..

    ReplyDelete
  8. Hi
    For Version 5.* you need to update the row
    curl "http://localhost:8080/solr/update/extract?literal.id=1&commit=true" -F "myfile=@FILENAME.pdf"
    to
    curl "http://localhost:8080/solr/collection1/update/extract?literal.id=1&commit=true" -F "myfile=@FILENAME.pdf"

    ReplyDelete
  9. This comment has been removed by the author.

    ReplyDelete
  10. I'm using SOLR 5.4.1 to index pdf and other type of documents using the ExtractingRequestHandler. The indexing itself works fine but I don't know how to get the page number and paragraph of a search result.

    I would like to display the page number along with hits in a document, e.g. "term foo was found in bar.pdf on pages 2, 3 and 5."

    can you please tell me how to do it?

    ReplyDelete
  11. It is not easy to Setting Up Tika's Extracting Request Handler.Here you are mention useful steps but some errors coming.But i resolved them in my apache tomcat training sessions.Thank you.

    ReplyDelete