lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: Unexcpected RuntimeException when indexing with Solr 4.0 Beta
Date Wed, 29 Aug 2012 14:39:21 GMT
Understood. Well, you could always manually convert old docs to a newer doc 
format. Or use a tool such as:
http://download.cnet.com/Docx-to-Doc-Converter/3000-2079_4-75206386.html

-- Jack Krupansky

-----Original Message----- 
From: Alexander Cougarman
Sent: Wednesday, August 29, 2012 9:59 AM
To: 'solr-user@lucene.apache.org'
Subject: RE: Unexcpected RuntimeException when indexing with Solr 4.0 Beta

I believe these are the older Word 97 docs (*.doc) files. The problem was 
that Solr 3.6.1 blew up on *.MSG files when doing extractOnly=true. So we 
upgraded to Solr 4.0, and now run into this; if we use Tika 1.0, I'm afraid 
the DOC files will be fixed but the MSG files will break!

Sincerely,
Alex Cougarman

Bahá'í World Centre
Haifa, Israel
Office: +972-4-835-8683
Cell: +972-54-241-4742
acougarm@bwc.org


-----Original Message-----
From: Jack Krupansky [mailto:jack@basetechnology.com]
Sent: 29 August 2012 4:55 PM
To: solr-user@lucene.apache.org
Subject: Re: Unexcpected RuntimeException when indexing with Solr 4.0 Beta

Sounds like this POI bug (SolrCell invokes Tika which invokes POI):
https://issues.apache.org/bugzilla/show_bug.cgi?id=53380

Are these in fact Office 97 documents that are failing?

Solr 4.0 includes Tika 1.1, while Solr 3.6.1 includes Tika 1.0.

It may be possible for you to drop the old Tika 1.0 into Solr 4.0, but I 
wouldn't try to guarantee that.

In any case, this should be filed in Jira as a bug in Solr 4.0-BETA 
(SolrCell/Extraction component).

-- Jack Krupansky

-----Original Message-----
From: Alexander Cougarman
Sent: Wednesday, August 29, 2012 9:05 AM
To: solr-user@lucene.apache.org
Subject: Unexcpected RuntimeException when indexing with Solr 4.0 Beta

Hi. I'm using Solr 4.0 Beta (no modifications to default installation) to 
index, and it's blowing up on some Word docs:

  curl
"http://localhost:8983/solr/update/extract?literal.id=doc15&commit=true" -F 
"myfile=@15.doc"

Here's the exception. And the same files go through Solr 3.6.1 just fine.

    <?xml version="1.0" encoding="UTF-8"?>
    <response>
    <lst name="responseHeader"><int name="status">500</int><int 
name="QTime">18</int
    ></lst><lst name="error"><str
name="msg">org.apache.tika.exception.TikaException
    : Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser
    @328c62ce</str><str name="trace">org.apache.solr.common.SolrException:
org.apach
    e.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika
    .parser.microsoft.OfficeParser@328c62ce
            at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extr
    actingDocumentLoader.java:230)
            at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Co
    ntentStreamHandlerBase.java:74)
            at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl
    erBase.java:129)
            at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handle
    Request(RequestHandlers.java:240)
            at org.apache.solr.core.SolrCore.execute(SolrCore.java:1656)
            at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter
    .java:454)
            at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte
    r.java:275)
            at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(Servlet
    Handler.java:1337)
            at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java
    :484)
            at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.j
    ava:119)
            at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
            at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandl
    er.java:233)
            at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandl
    er.java:1065)
            at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:
    413)
            at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandle
    r.java:192)
            at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandle
    r.java:999)
            at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.j
    ava:117)
            at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(Cont
    extHandlerCollection.java:250)
            at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerColl
    ection.java:149)
            at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper
    .java:111)
            at org.eclipse.jetty.server.Server.handle(Server.java:351)
            at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(Abstrac
    tHttpConnection.java:454)
            at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(Blockin
    gHttpConnection.java:47)
            at
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(Abstra
    ctHttpConnection.java:890)
            at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.header
    Complete(AbstractHttpConnection.java:944)
            at
org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:642)
            at
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)

            at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpCo
    nnection.java:66)
            at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(So
    cketConnector.java:254)
            at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPoo
    l.java:599)
            at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool
    .java:534)
            at java.lang.Thread.run(Unknown Source)
    Caused by: org.apache.tika.exception.TikaException: Unexpected 
RuntimeException
    from org.apache.tika.parser.microsoft.OfficeParser@328c62ce
            at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244
    )
            at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
    )
            at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
    20)
            at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extr
    actingDocumentLoader.java:224)
            ... 31 more
    Caused by: java.lang.ArrayIndexOutOfBoundsException: 7
            at
org.apache.poi.util.LittleEndian.getInt(LittleEndian.java:163)
            at
org.apache.poi.hwpf.model.Colorref.&lt;init&gt;(Colorref.java:81)
            at
org.apache.poi.hwpf.model.types.SHDAbstractType.fillFields(SHDAbstrac
    tType.java:56)
            at
org.apache.poi.hwpf.usermodel.ShadingDescriptor.&lt;init&gt;(ShadingD
    escriptor.java:38)
            at
org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.unCompressCHPOpera
    tion(CharacterSprmUncompressor.java:582)
            at
org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.uncompressCHP(Char
    acterSprmUncompressor.java:65)
            at
org.apache.poi.hwpf.model.StyleSheet.createChp(StyleSheet.java:288)
            at
org.apache.poi.hwpf.model.StyleSheet.&lt;init&gt;(StyleSheet.java:121
    )
            at
org.apache.poi.hwpf.HWPFDocument.&lt;init&gt;(HWPFDocument.java:346)
            at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.ja
    va:77)
            at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java
    :185)
            at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java
    :160)
            at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
    )
            ... 34 more
    </str><int name="code">500</int></lst>
    </response>

Sincerely,
Alex 


Mime
View raw message