Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4EF1E9831 for ; Tue, 28 Feb 2012 02:54:07 +0000 (UTC) Received: (qmail 45967 invoked by uid 500); 28 Feb 2012 02:54:04 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 45807 invoked by uid 500); 28 Feb 2012 02:54:03 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 45780 invoked by uid 99); 28 Feb 2012 02:54:02 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Feb 2012 02:54:02 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of erickerickson@gmail.com designates 209.85.210.176 as permitted sender) Received: from [209.85.210.176] (HELO mail-iy0-f176.google.com) (209.85.210.176) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Feb 2012 02:53:56 +0000 Received: by iagw33 with SMTP id w33so1899155iag.35 for ; Mon, 27 Feb 2012 18:53:35 -0800 (PST) Received-SPF: pass (google.com: domain of erickerickson@gmail.com designates 10.50.180.231 as permitted sender) client-ip=10.50.180.231; Authentication-Results: mr.google.com; spf=pass (google.com: domain of erickerickson@gmail.com designates 10.50.180.231 as permitted sender) smtp.mail=erickerickson@gmail.com; dkim=pass header.i=erickerickson@gmail.com Received: from mr.google.com ([10.50.180.231]) by 10.50.180.231 with SMTP id dr7mr1031484igc.56.1330397615544 (num_hops = 1); Mon, 27 Feb 2012 18:53:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=esAetr/fvTAfdeNTZgQL3e5Jppq7o18WZaoKpov1Gq4=; b=wc7XVCoQVe+wGHeL1glrB9IqOijUORWw7oQbgDPOf7yfjHsr488jk32cjdLMi2tH8+ MZXczwVmhel19285+wisO6Ze/KCNG1/qs/iy1EMlK/9I32NozYIU6MtJfiUbuwBK7uIr dMMBGop0X5yW87WQQPO15Ulus5CcWufGb870E= MIME-Version: 1.0 Received: by 10.50.180.231 with SMTP id dr7mr889417igc.56.1330397615504; Mon, 27 Feb 2012 18:53:35 -0800 (PST) Received: by 10.43.46.70 with HTTP; Mon, 27 Feb 2012 18:53:35 -0800 (PST) In-Reply-To: References: Date: Mon, 27 Feb 2012 21:53:35 -0500 Message-ID: Subject: Re: TIKA Errors Importing MS Word Documents into SOLR Cloud From: Erick Erickson To: solr-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org You *probaby* can update the Tika libraries in Solr, but it'll be "interest= ing" to get all the right ones updated, there are a bunch of them in Tika. And I make no guarantees. If it proves difficult, it's not too hard to write a SolrJ program that doe= s the Tika extraction and run it on a client totally separated from the Solr server. Best Erick On Sun, Feb 26, 2012 at 7:33 PM, Matthew Parker wrote: > I tried to import some documents into SOLR Cloud using Apache Manifold. > > TIKA started throwing exceptions for various documents > > The exception reads like the following: > > org.apache.solr.common.SolrException > at org.apache.solr.handler.extraction.ExtractionDocumentLoader.load( > ExtractingDocumentLoader.java: 213) > .......... > > Caused by: =A0org.apache.tika.exception.TikaException: > UnexpectedRuntimeException from > org.apche.tika.parser.microsoft.OfficeParser@d394424 > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) > ........... > Caused by: java.lang.ArrayIndexOutOfBoundsException > at java.lang.System.arraycopy(NativeMethod) > at > org.apache.poi.hwpf.usermodel.Picture.fillRawImageContent(Picture.java:36= 3) > > It seems to be related to the following fix now in Tika 1.1 > > https://issues.apache.org/bugzilla/show_bug.cgi?id=3D51902 > > Can the Tika libraries in the SOLR trunk be updated? > > ------------------------------ > This e-mail and any files transmitted with it may be proprietary. =A0Plea= se note that any views or opinions presented in this e-mail are solely thos= e of the author and do not necessarily represent those of Apogee Integratio= n.