lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastián Ramírez <sebastian.rami...@senseta.com>
Subject Re: Tika not extracting content from ODT / ODS (open document / libreoffice) in Solr 4.2.1
Date Fri, 10 May 2013 18:34:47 GMT
Thanks for your reply Jack!

First: LOL

Second: I'm using the latest version of libreoffice, but with the
"extractOnly" param in the Solr request it shows the content of the file so
Tika is being able to read and extract the data but Solr isn't indexing
that data.

Third: I already did that with no luck, I tried
"application/vnd.oasis.opendocument.spreadsheet", "application/ods" and
"application/octet-stream" but always got the same result.

Following the documentation for
"ExtractingRequestHandler<http://wiki.apache.org/solr/ExtractingRequestHandler#Concepts>"
I see that Tika reads the file and feeds it to a "SAX ContentHandler", and
"Solr then reacts to Tika's SAX events and creates the fields to index". I
think that the problem might be somewhere in that process of feeding the
"SAX ContentHandler" or the reaction of Solr to those "SAX events".

Do you (or anyone else) know how could one configure / debug that "SAX
ContentHandler"?


Thanks,

Sebastián Ramírez



On Fri, May 10, 2013 at 10:57 AM, Jack Krupansky <jack@basetechnology.com>wrote:

> Switching to Microsoft Office will probably solve your problem!
>
> Sorry, I couldn't resist.
>
> Are you using a really new or really old version of the ODT/ODS software?
> I mean, maybe Tika doesn't have support for that version.
>
> Check the mime type that Tika generates - maybe you just need to override
> it to force Tika to use the proper format.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Sebastián Ramírez
> Sent: Friday, May 10, 2013 11:24 AM
> To: solr-user@lucene.apache.org
> Subject: Tika not extracting content from ODT / ODS (open document /
> libreoffice) in Solr 4.2.1
>
>
> Hello everyone,
>
> I'm having a problem indexing content from "opendocument format" files. The
> files created with OpenOffice and LibreOffice (odt, ods...).
>
> Tika is being able to read the files but Solr is not indexing the content.
>
> It's not a problem of commiting or something like that, after I post a file
> it is indexed and all the metadata is indexed/stored but the content isn't
> there.
>
>
>   - I modified the solrconfig.xml file to catch everything:
>
>
> <requestHandler name="/update/extract"...
>
>    <!-- here is the interesting part -->
>
>    <!-- <str name="uprefix">ignored_</str> -->
>    <str name="defaultField">all_txt</**str>
>
>
>
>   - Then I submitted the file to Solr:
>
>
> curl '
> http://localhost:8983/solr/**update/extract?commit=true&**
> literal.id=newods<http://localhost:8983/solr/update/extract?commit=true&literal.id=newods>'
> -H
> 'Content-type: application/vnd.oasis.**opendocument.spreadsheet'
> --data-binary @test_ods.ods
>
>
>
>   - Now when I do a search in Solr I get this result, there is something
>
>   in the "content", but that's not the actual content of the original file:
>
> <result name="response" numFound="1" start="0">
>  <doc>
>    <str name="id">newods</str>
>    <arr name="all_txt">
>      <str>1</str>
>      <str>2013-05-03T10:02:10.58</**str>
>      <str>2013-05-03T10:02:50.54</**str>
>      <str>2013-05-03T10:02:50.54</**str>
>      <str>1</str>
>      <str>2013-05-03T10:02:10.58</**str>
>      <str>1</str>
>      <str>2013-05-03T10:02:50.54</**str>
>      <str>2013-05-03T10:02:50.54</**str>
>      <str>0</str>
>      <str>P0D</str>
>      <str>2013-05-03T10:02:10.58</**str>
>      <str>1</str>
>      <str>0</str>
>      <str>application/ods</str>
>      <str>0</str>
>      <str>7322</str>
>      <str>LibreOffice/4.0.2.2$**Windows_x86
> LibreOffice_project/**4c82dcdd6efcd48b1d8bba66bfe198**9deee49c3</str>
>      <str>2013-05-03T10:02:50.54</**str>
>    </arr>
>    <date name="last_modified">2013-05-**03T10:02:50Z</date>
>    <arr name="content_type">
>      <str>application/vnd.oasis.**opendocument.spreadsheet</str>
>    </arr>
>    <arr name="content">
>      <str> ???  Page   ??? (???)  00/00/0000, 00:00:00  Page  /    </str>
>    </arr>
>    <long name="_version_">**1434658995848609792</long></**
> doc></result></response>
>
>
>   - I ask Solr to show me the extracted content from Tika doing this:
>
>
> curl 'http://localhost:8983/solr/**update/extract?extractOnly=**true<http://localhost:8983/solr/update/extract?extractOnly=true>'
> -H
> 'Content-type: application/vnd.oasis.**opendocument.spreadsheet'
> --data-binary @test_ods.ods
>
>
>
>   - And I get the XHTML extracted from Tika, including the original file
>
>   contents and that final part that Solr is indeed indexing, so, Tika is
>   being able to read the file but Solr is not indexing the real content, it
>   only indexes the rest:
>
> <body>
> <table>
> <tr>
>    <td>
>        <p>test</p>
>    </td>
> </tr>
> <tr>
>    <td>
>        <p>de</p>
>    </td>
> </tr>
> <tr>
>    <td>
>        <p>ods</p>
>    </td>
> </tr>
> </table>
>
> <p xmlns="http://www.w3.org/1999/**xhtml <http://www.w3.org/1999/xhtml>
> ">???</p>
> <p>Page</p>
> <p>??? (???)</p>
> <p>00/00/0000, 00:00:00</p>
> <p>Page / </p>
> </body>
>
> Do any of you know how to fix/workaround this problem?
>
> Thanks!
>
> Sebastián Ramírez
>
> --
> *-----------------------------**-----------------------*
> *This e-mail transmission, including any attachments, is intended only for
> the named recipient(s) and may contain information that is privileged,
> confidential and/or exempt from disclosure under applicable law. If you
> have received this transmission in error, or are not the named
> recipient(s), please notify Senseta immediately by return e-mail and
> permanently delete this transmission, including any attachments.*
>

-- 
*----------------------------------------------------*
*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named 
recipient(s), please notify Senseta immediately by return e-mail and 
permanently delete this transmission, including any attachments.*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message