Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of jack.krupansky@gmail.com
 designates 74.125.82.46 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAAfTDLX8O6NEFXiqEMLUhtw3kjW_KpMkhy6W+3Rs5182U2ttbg@mail.gmail.com>
References: 
 <CAAfTDLX8O6NEFXiqEMLUhtw3kjW_KpMkhy6W+3Rs5182U2ttbg@mail.gmail.com>
Date: Tue, 14 Apr 2015 23:18:35 -0400
Message-ID: 
 <CAOxAL63HRGCo8kaoO7em6Q8Rk3GwNyq7s7O8SCVZnUwzVZfdFA@mail.gmail.com>
Subject: Re: Indexing PDF and MS Office files
From: Jack Krupansky <jack.krupansky@gmail.com>
To: solr-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=001a113612bc729a490513bacdb1

--001a113612bc729a490513bacdb1
Content-Type: text/plain; charset=UTF-8

Try doing a manual extraction request directly to Solr (not via SolrJ) and
use the extractOnly option to see if the content is actually extracted.

See:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

Also, some PDF files actually have the content as a bitmap image, so no
text is extracted.


-- Jack Krupansky

On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy <
vijaya.bhoomireddy@whishworks.com> wrote:

> Hi,
>
> I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
> .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues.
> Request to please let me know what is going wrong with the indexing
> process.
>
> I am using solr 4.10.2 and using the default example server configuration
> that comes with Solr distribution.
>
> PDF Files - Indexing as such works fine, but when I query using *.* in the
> Solr Query console, metadata information is displayed properly. However,
> the PDF content field is empty. This is happening for all PDF files I have
> tried. I have tried with some proprietary files, PDF eBooks etc. Whatever
> be the PDF file, content is not being displayed.
>
> MS Office files -  For some office files, everything works perfect and the
> extracted content is visible in the query console. However, for others, I
> see the below error message during the indexing process.
>
> *Exception in thread "main"
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.microsoft.OfficeParser*
>
>
> I am using SolrJ to index the documents and below is the code snippet
> related to indexing. Please let me know where the issue is occurring.
>
>                         static String solrServerURL = "
> http://localhost:8983/solr";
> static SolrServer solrServer = new HttpSolrServer(solrServerURL);
>                         static ContentStreamUpdateRequest indexingReq = new
>
>     ContentStreamUpdateRequest("/update/extract");
>
>                         indexingReq.addFile(file, fileType);
> indexingReq.setParam("literal.id", literalId);
> indexingReq.setParam("uprefix", "attr_");
> indexingReq.setParam("fmap.content", "content");
> indexingReq.setParam("literal.fileurl", fileURL);
> indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
> solrServer.request(indexingReq);
>
> Thanks & Regards
> Vijay
>
> --
> The contents of this e-mail are confidential and for the exclusive use of
> the intended recipient. If you receive this e-mail in error please delete
> it from your system immediately and notify us either by e-mail or
> telephone. You should not copy, forward or otherwise disclose the content
> of the e-mail. The views expressed in this communication may not
> necessarily be the view held by WHISHWORKS.
>

--001a113612bc729a490513bacdb1--