lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Krupansky <>
Subject Re: Indexing PDF and MS Office files
Date Wed, 15 Apr 2015 03:18:35 GMT
Try doing a manual extraction request directly to Solr (not via SolrJ) and
use the extractOnly option to see if the content is actually extracted.


Also, some PDF files actually have the content as a bitmap image, so no
text is extracted.

-- Jack Krupansky

On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy <> wrote:

> Hi,
> I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
> .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues.
> Request to please let me know what is going wrong with the indexing
> process.
> I am using solr 4.10.2 and using the default example server configuration
> that comes with Solr distribution.
> PDF Files - Indexing as such works fine, but when I query using *.* in the
> Solr Query console, metadata information is displayed properly. However,
> the PDF content field is empty. This is happening for all PDF files I have
> tried. I have tried with some proprietary files, PDF eBooks etc. Whatever
> be the PDF file, content is not being displayed.
> MS Office files -  For some office files, everything works perfect and the
> extracted content is visible in the query console. However, for others, I
> see the below error message during the indexing process.
> *Exception in thread "main"
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> I am using SolrJ to index the documents and below is the code snippet
> related to indexing. Please let me know where the issue is occurring.
>                         static String solrServerURL = "
> http://localhost:8983/solr";
> static SolrServer solrServer = new HttpSolrServer(solrServerURL);
>                         static ContentStreamUpdateRequest indexingReq = new
>     ContentStreamUpdateRequest("/update/extract");
>                         indexingReq.addFile(file, fileType);
> indexingReq.setParam("", literalId);
> indexingReq.setParam("uprefix", "attr_");
> indexingReq.setParam("fmap.content", "content");
> indexingReq.setParam("literal.fileurl", fileURL);
> indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
> solrServer.request(indexingReq);
> Thanks & Regards
> Vijay
> --
> The contents of this e-mail are confidential and for the exclusive use of
> the intended recipient. If you receive this e-mail in error please delete
> it from your system immediately and notify us either by e-mail or
> telephone. You should not copy, forward or otherwise disclose the content
> of the e-mail. The views expressed in this communication may not
> necessarily be the view held by WHISHWORKS.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message