lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shyam R <shyam.reme...@gmail.com>
Subject Re: Indexing PDF and MS Office files
Date Wed, 15 Apr 2015 04:14:38 GMT
Vijay,

You could try different excel files with different formats to rule out the
issue is with TIKA version being used.

Thanks
Murthy

On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes <trhodes314@gmail.com> wrote:

> Perhaps the PDF is protected and the content can not be extracted?
>
> i have an unverified suspicion that the tika shipped with solr 4.10.2 may
> not support some/all office 2013 document formats.
>
>
>
>
>
> On 4/14/2015 8:18 PM, Jack Krupansky wrote:
>
>> Try doing a manual extraction request directly to Solr (not via SolrJ) and
>> use the extractOnly option to see if the content is actually extracted.
>>
>> See:
>> https://cwiki.apache.org/confluence/display/solr/
>> Uploading+Data+with+Solr+Cell+using+Apache+Tika
>>
>> Also, some PDF files actually have the content as a bitmap image, so no
>> text is extracted.
>>
>>
>> -- Jack Krupansky
>>
>> On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy <
>> vijaya.bhoomireddy@whishworks.com> wrote:
>>
>>  Hi,
>>>
>>> I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
>>> .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues.
>>> Request to please let me know what is going wrong with the indexing
>>> process.
>>>
>>> I am using solr 4.10.2 and using the default example server configuration
>>> that comes with Solr distribution.
>>>
>>> PDF Files - Indexing as such works fine, but when I query using *.* in
>>> the
>>> Solr Query console, metadata information is displayed properly. However,
>>> the PDF content field is empty. This is happening for all PDF files I
>>> have
>>> tried. I have tried with some proprietary files, PDF eBooks etc. Whatever
>>> be the PDF file, content is not being displayed.
>>>
>>> MS Office files -  For some office files, everything works perfect and
>>> the
>>> extracted content is visible in the query console. However, for others, I
>>> see the below error message during the indexing process.
>>>
>>> *Exception in thread "main"
>>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>>> org.apache.tika.exception.TikaException: Unexpected RuntimeException
>>> from
>>> org.apache.tika.parser.microsoft.OfficeParser*
>>>
>>>
>>> I am using SolrJ to index the documents and below is the code snippet
>>> related to indexing. Please let me know where the issue is occurring.
>>>
>>>                          static String solrServerURL = "
>>> http://localhost:8983/solr";
>>> static SolrServer solrServer = new HttpSolrServer(solrServerURL);
>>>                          static ContentStreamUpdateRequest indexingReq =
>>> new
>>>
>>>      ContentStreamUpdateRequest("/update/extract");
>>>
>>>                          indexingReq.addFile(file, fileType);
>>> indexingReq.setParam("literal.id", literalId);
>>> indexingReq.setParam("uprefix", "attr_");
>>> indexingReq.setParam("fmap.content", "content");
>>> indexingReq.setParam("literal.fileurl", fileURL);
>>> indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
>>> solrServer.request(indexingReq);
>>>
>>> Thanks & Regards
>>> Vijay
>>>
>>> --
>>> The contents of this e-mail are confidential and for the exclusive use of
>>> the intended recipient. If you receive this e-mail in error please delete
>>> it from your system immediately and notify us either by e-mail or
>>> telephone. You should not copy, forward or otherwise disclose the content
>>> of the e-mail. The views expressed in this communication may not
>>> necessarily be the view held by WHISHWORKS.
>>>
>>>
>


-- 
Ph: 9845704792

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message