lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Terry Rhodes <trhodes...@gmail.com>
Subject Re: Indexing PDF and MS Office files
Date Wed, 15 Apr 2015 04:05:07 GMT
Perhaps the PDF is protected and the content can not be extracted?

i have an unverified suspicion that the tika shipped with solr 4.10.2 
may not support some/all office 2013 document formats.




On 4/14/2015 8:18 PM, Jack Krupansky wrote:
> Try doing a manual extraction request directly to Solr (not via SolrJ) and
> use the extractOnly option to see if the content is actually extracted.
>
> See:
> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
>
> Also, some PDF files actually have the content as a bitmap image, so no
> text is extracted.
>
>
> -- Jack Krupansky
>
> On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy <
> vijaya.bhoomireddy@whishworks.com> wrote:
>
>> Hi,
>>
>> I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
>> .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues.
>> Request to please let me know what is going wrong with the indexing
>> process.
>>
>> I am using solr 4.10.2 and using the default example server configuration
>> that comes with Solr distribution.
>>
>> PDF Files - Indexing as such works fine, but when I query using *.* in the
>> Solr Query console, metadata information is displayed properly. However,
>> the PDF content field is empty. This is happening for all PDF files I have
>> tried. I have tried with some proprietary files, PDF eBooks etc. Whatever
>> be the PDF file, content is not being displayed.
>>
>> MS Office files -  For some office files, everything works perfect and the
>> extracted content is visible in the query console. However, for others, I
>> see the below error message during the indexing process.
>>
>> *Exception in thread "main"
>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
>> org.apache.tika.parser.microsoft.OfficeParser*
>>
>>
>> I am using SolrJ to index the documents and below is the code snippet
>> related to indexing. Please let me know where the issue is occurring.
>>
>>                          static String solrServerURL = "
>> http://localhost:8983/solr";
>> static SolrServer solrServer = new HttpSolrServer(solrServerURL);
>>                          static ContentStreamUpdateRequest indexingReq = new
>>
>>      ContentStreamUpdateRequest("/update/extract");
>>
>>                          indexingReq.addFile(file, fileType);
>> indexingReq.setParam("literal.id", literalId);
>> indexingReq.setParam("uprefix", "attr_");
>> indexingReq.setParam("fmap.content", "content");
>> indexingReq.setParam("literal.fileurl", fileURL);
>> indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
>> solrServer.request(indexingReq);
>>
>> Thanks & Regards
>> Vijay
>>
>> --
>> The contents of this e-mail are confidential and for the exclusive use of
>> the intended recipient. If you receive this e-mail in error please delete
>> it from your system immediately and notify us either by e-mail or
>> telephone. You should not copy, forward or otherwise disclose the content
>> of the e-mail. The views expressed in this communication may not
>> necessarily be the view held by WHISHWORKS.
>>


Mime
View raw message