lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastián Ramírez <sebastian.rami...@senseta.com>
Subject Re: Tika not extracting content from ODT / ODS (open document / libreoffice) in Solr 4.2.1
Date Fri, 10 May 2013 23:52:25 GMT
OK Jack, I'll switch to MS Office ...hahaha

Many thanks for your interest and help... and the bug report in JIRA.

Best,

Sebastián Ramírez


On Fri, May 10, 2013 at 5:48 PM, Jack Krupansky <jack@basetechnology.com>wrote:

> I filed  SOLR-4809 - "OpenOffice document body is not indexed by
> SolrCell", including some test files.
>
> https://issues.apache.org/**jira/browse/SOLR-4809<https://issues.apache.org/jira/browse/SOLR-4809>
>
> Yeah, at this stage, switching to Microsoft Office seems like the best bet!
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: Sebastián Ramírez
> Sent: Friday, May 10, 2013 6:33 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Tika not extracting content from ODT / ODS (open document /
> libreoffice) in Solr 4.2.1
>
>
> Many thanks Jack for your attention and effort on solving the problem.
>
> Best,
>
> Sebastián Ramírez
>
>
> On Fri, May 10, 2013 at 5:23 PM, Jack Krupansky <jack@basetechnology.com>*
> *wrote:
>
>  I downloaded the latest Apache OpenOffice 3.4.1 and it does in fact fail
>> to index the proper content, both for .ODP and .ODT files.
>>
>> If I do extractOnly=true&****extractFormat=text, I see the extracted text
>>
>> clearly in addition to the metadata.
>>
>> I tested on 4.3, and then tested on Solr 3.6.1 and it also exhibited the
>> problem. I just see spaces in both cases.
>>
>> But whether the problem is due to Solr or Tika, is not apparent.
>>
>> In any case, a Jira is warranted.
>>
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Sebastián Ramírez
>> Sent: Friday, May 10, 2013 11:24 AM
>> To: solr-user@lucene.apache.org
>> Subject: Tika not extracting content from ODT / ODS (open document /
>> libreoffice) in Solr 4.2.1
>>
>> Hello everyone,
>>
>> I'm having a problem indexing content from "opendocument format" files.
>> The
>> files created with OpenOffice and LibreOffice (odt, ods...).
>>
>> Tika is being able to read the files but Solr is not indexing the content.
>>
>> It's not a problem of commiting or something like that, after I post a
>> file
>> it is indexed and all the metadata is indexed/stored but the content isn't
>> there.
>>
>>
>>   - I modified the solrconfig.xml file to catch everything:
>>
>>
>> <requestHandler name="/update/extract"...
>>
>>    <!-- here is the interesting part -->
>>
>>    <!-- <str name="uprefix">ignored_</str> -->
>>    <str name="defaultField">all_txt</****str>
>>
>>
>>
>>
>>   - Then I submitted the file to Solr:
>>
>>
>> curl '
>> http://localhost:8983/solr/****update/extract?commit=true&**<http://localhost:8983/solr/**update/extract?commit=true&**>
>> literal.id=newods<http://**localhost:8983/solr/update/**
>> extract?commit=true&literal.**id=newods<http://localhost:8983/solr/update/extract?commit=true&literal.id=newods>
>> >'
>> -H
>> 'Content-type: application/vnd.oasis.****opendocument.spreadsheet'
>>
>> --data-binary @test_ods.ods
>>
>>
>>
>>   - Now when I do a search in Solr I get this result, there is something
>>
>>   in the "content", but that's not the actual content of the original
>> file:
>>
>> <result name="response" numFound="1" start="0">
>>  <doc>
>>    <str name="id">newods</str>
>>    <arr name="all_txt">
>>      <str>1</str>
>>      <str>2013-05-03T10:02:10.58</****str>
>>      <str>2013-05-03T10:02:50.54</****str>
>>      <str>2013-05-03T10:02:50.54</****str>
>>      <str>1</str>
>>      <str>2013-05-03T10:02:10.58</****str>
>>      <str>1</str>
>>      <str>2013-05-03T10:02:50.54</****str>
>>
>>      <str>2013-05-03T10:02:50.54</****str>
>>      <str>0</str>
>>      <str>P0D</str>
>>      <str>2013-05-03T10:02:10.58</****str>
>>
>>      <str>1</str>
>>      <str>0</str>
>>      <str>application/ods</str>
>>      <str>0</str>
>>      <str>7322</str>
>>      <str>LibreOffice/4.0.2.2$****Windows_x86
>> LibreOffice_project/****4c82dcdd6efcd48b1d8bba66bfe198****9deee49c3</str>
>>      <str>2013-05-03T10:02:50.54</****str>
>>    </arr>
>>    <date name="last_modified">2013-05-****03T10:02:50Z</date>
>>    <arr name="content_type">
>>      <str>application/vnd.oasis.****opendocument.spreadsheet</str>
>>
>>    </arr>
>>    <arr name="content">
>>      <str> ???  Page   ??? (???)  00/00/0000, 00:00:00  Page  /    </str>
>>    </arr>
>>    <long name="_version_">****1434658995848609792</long></**
>>
>> doc></result></response>
>>
>>
>>   - I ask Solr to show me the extracted content from Tika doing this:
>>
>>
>> curl 'http://localhost:8983/solr/****update/extract?extractOnly=****true<http://localhost:8983/solr/**update/extract?extractOnly=**true>
>> <http://localhost:8983/**solr/update/extract?**extractOnly=true<http://localhost:8983/solr/update/extract?extractOnly=true>
>> >'
>> -H
>> 'Content-type: application/vnd.oasis.****opendocument.spreadsheet'
>>
>> --data-binary @test_ods.ods
>>
>>
>>
>>   - And I get the XHTML extracted from Tika, including the original file
>>
>>   contents and that final part that Solr is indeed indexing, so, Tika is
>>   being able to read the file but Solr is not indexing the real content,
>> it
>>   only indexes the rest:
>>
>> <body>
>> <table>
>> <tr>
>>    <td>
>>        <p>test</p>
>>    </td>
>> </tr>
>> <tr>
>>    <td>
>>        <p>de</p>
>>    </td>
>> </tr>
>> <tr>
>>    <td>
>>        <p>ods</p>
>>    </td>
>> </tr>
>> </table>
>>
>> <p xmlns="http://www.w3.org/1999/****xhtml<http://www.w3.org/1999/**xhtml><
>> http://www.w3.org/1999/xhtml>
>>
>> ">???</p>
>> <p>Page</p>
>> <p>??? (???)</p>
>> <p>00/00/0000, 00:00:00</p>
>> <p>Page / </p>
>> </body>
>>
>> Do any of you know how to fix/workaround this problem?
>>
>> Thanks!
>>
>> Sebastián Ramírez
>>
>> --
>> *-----------------------------****-----------------------*
>>
>> *This e-mail transmission, including any attachments, is intended only for
>> the named recipient(s) and may contain information that is privileged,
>> confidential and/or exempt from disclosure under applicable law. If you
>> have received this transmission in error, or are not the named
>> recipient(s), please notify Senseta immediately by return e-mail and
>> permanently delete this transmission, including any attachments.*
>>
>>
> --
> *-----------------------------**-----------------------*
> *This e-mail transmission, including any attachments, is intended only for
> the named recipient(s) and may contain information that is privileged,
> confidential and/or exempt from disclosure under applicable law. If you
> have received this transmission in error, or are not the named
> recipient(s), please notify Senseta immediately by return e-mail and
> permanently delete this transmission, including any attachments.*
>

-- 
*----------------------------------------------------*
*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named 
recipient(s), please notify Senseta immediately by return e-mail and 
permanently delete this transmission, including any attachments.*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message