Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C0A6417804 for ; Wed, 15 Apr 2015 03:19:07 +0000 (UTC) Received: (qmail 46620 invoked by uid 500); 15 Apr 2015 03:19:02 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 46540 invoked by uid 500); 15 Apr 2015 03:19:02 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 46527 invoked by uid 99); 15 Apr 2015 03:19:01 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Apr 2015 03:19:01 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jack.krupansky@gmail.com designates 74.125.82.46 as permitted sender) Received: from [74.125.82.46] (HELO mail-wg0-f46.google.com) (74.125.82.46) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Apr 2015 03:18:36 +0000 Received: by wgsk9 with SMTP id k9so32167137wgs.3 for ; Tue, 14 Apr 2015 20:18:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=8t+2JdXJOjxil2pfO9XZUpDfk3UdHoiau9mVX2RSwgo=; b=f2Sd4Q0jiBugTl0DEuK+mylUfLy+0k6qFhz6CmMbI2XX4mbV+nDV4FQH0/lh7CQALU QUUgDW/TaZ04uQC97dyj1647P0OngyxG8wF5vT2gSTsQ5eNJ4UJbsfLWKrmUSXezREgt Ie5APe1JyYIq6hX+rw4q49z9HuvMczWkEcEmrUku1ncsq3WWQr1wj4eADDYAwBYtAnUN 8hJDy+Z87tA6MJG8fU4VFGkUDMP26R5CT/8pvIn8TV3r9a6jGbyLfWcDjwmEWcRwyhBi l10haojuqI1EskA65BZG6GOW4iOUoIV1W/0wTDcXylKS0/16r94y5uHea0kYFPmwU6c9 g16g== MIME-Version: 1.0 X-Received: by 10.180.230.226 with SMTP id tb2mr37380089wic.64.1429067915236; Tue, 14 Apr 2015 20:18:35 -0700 (PDT) Received: by 10.27.18.9 with HTTP; Tue, 14 Apr 2015 20:18:35 -0700 (PDT) In-Reply-To: References: Date: Tue, 14 Apr 2015 23:18:35 -0400 Message-ID: Subject: Re: Indexing PDF and MS Office files From: Jack Krupansky To: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001a113612bc729a490513bacdb1 X-Virus-Checked: Checked by ClamAV on apache.org --001a113612bc729a490513bacdb1 Content-Type: text/plain; charset=UTF-8 Try doing a manual extraction request directly to Solr (not via SolrJ) and use the extractOnly option to see if the content is actually extracted. See: https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika Also, some PDF files actually have the content as a bitmap image, so no text is extracted. -- Jack Krupansky On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy < vijaya.bhoomireddy@whishworks.com> wrote: > Hi, > > I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt, > .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues. > Request to please let me know what is going wrong with the indexing > process. > > I am using solr 4.10.2 and using the default example server configuration > that comes with Solr distribution. > > PDF Files - Indexing as such works fine, but when I query using *.* in the > Solr Query console, metadata information is displayed properly. However, > the PDF content field is empty. This is happening for all PDF files I have > tried. I have tried with some proprietary files, PDF eBooks etc. Whatever > be the PDF file, content is not being displayed. > > MS Office files - For some office files, everything works perfect and the > extracted content is visible in the query console. However, for others, I > see the below error message during the indexing process. > > *Exception in thread "main" > org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser* > > > I am using SolrJ to index the documents and below is the code snippet > related to indexing. Please let me know where the issue is occurring. > > static String solrServerURL = " > http://localhost:8983/solr"; > static SolrServer solrServer = new HttpSolrServer(solrServerURL); > static ContentStreamUpdateRequest indexingReq = new > > ContentStreamUpdateRequest("/update/extract"); > > indexingReq.addFile(file, fileType); > indexingReq.setParam("literal.id", literalId); > indexingReq.setParam("uprefix", "attr_"); > indexingReq.setParam("fmap.content", "content"); > indexingReq.setParam("literal.fileurl", fileURL); > indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); > solrServer.request(indexingReq); > > Thanks & Regards > Vijay > > -- > The contents of this e-mail are confidential and for the exclusive use of > the intended recipient. If you receive this e-mail in error please delete > it from your system immediately and notify us either by e-mail or > telephone. You should not copy, forward or otherwise disclose the content > of the e-mail. The views expressed in this communication may not > necessarily be the view held by WHISHWORKS. > --001a113612bc729a490513bacdb1--