lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: tikaparser docx file fails with exception
Date Fri, 06 Nov 2015 15:42:19 GMT
Agree with all below, and don't hesitate to open a ticket on Tika's Jira and/or POI's bugzilla...especially
if you can share the triggering document.

-----Original Message-----
From: Alexandre Rafalovitch [mailto:arafalov@gmail.com] 
Sent: Thursday, November 05, 2015 6:05 PM
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: tikaparser docx file fails with exception

It is quite clear actually that the problem is this:
Caused by: java.io.CharConversionException: Characters larger than 4 bytes are not supported:
byte 0xb7 implies a length of more than 4 bytes
      at org.apache.xmlbeans.impl.piccolo.xml.UTF8XMLDecoder.decode(UTF8XMLDecoder.java:162)
      at org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader$FastStreamDecoder.read(XMLStreamReader.java:762)
      at org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader.read(XMLStreamReader.java:162)
      at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yy_refill(PiccoloLexer.java:3477)

If you search for something like: PiccoloLexer.yy_refill Characters larger than 4 bytes are
not supported:
you get lots of various matches in different forums for different (java-based? tika-based?)
software. Most likely Tika found something obscure in the document that there is no implementations
for yet. E.g.
an image inside a text field inside a footer section. Just as an example....

I would basically try standalone Tika and look for the most expressive debug flag. It should
tell you which file inside the zip that docx actually is caused the problem. That should give
you some hint.

Regards,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 5 November 2015 at 17:36, Aswath Srinivasan (TMS) <aswath.srinivasan@toyota.com>
wrote:
> Thank you for attempting to answer. I will try out with solrj and standalone java with
tika parser. I completely understand that a bad document could cause this, however, when I
opened up the document I couldn't find anything suspicious expect for some binary images/pictures
embedded into the document.
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Wednesday, November 04, 2015 4:33 PM
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Re: tikaparser docx file fails with exception
>
> Possibly a corrupt file? Tika does its best, but bad data is...bad data.
>
> You can experiment a bit with using Tika in Java, that might give you a better idea of
what's really going on, here's a SolrJ example:
>
> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>
> Best,
> Erick
>
> On Wed, Nov 4, 2015 at 3:49 PM, Aswath Srinivasan (TMS) <aswath.srinivasan@toyota.com>
wrote:
>>
>> Trying to index a document. A docx file. Ending up with the below exception. Not
sure why it is erroring out. When I opened the docx I was able to see lots of binary data
like embedded pictures etc., Is there a possible solution to this or is it a bug? Only one
such file fails. Rest of the files are smoothly indexed.
>>
>> 2015-11-04 23:16:11.549 INFO  (coreLoadExecutor-6-thread-1) [   x:tika] o.a.s.c.CoreContainer
registering core: tika
>> 2015-11-04 23:16:11.549 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [ 
 x:tika] o.a.s.c.SolrCore QuerySenderListener sending requests to Searcher@1eb69b2[tika] main{ExitableDirectoryReader(UninvertingDirectoryReader())}
>> 2015-11-04 23:16:11.585 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [ 
 x:tika] o.a.s.c.S.Request [tika] webapp=null path=null params={q=static+firstSearcher+warming+in+solrconfig.xml&distrib=false&event=firstSearcher}
hits=0 status=0 QTime=34
>> 2015-11-04 23:16:11.586 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [ 
 x:tika] o.a.s.c.SolrCore QuerySenderListener done.
>> 2015-11-04 23:16:11.586 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [ 
 x:tika] o.a.s.h.c.SpellCheckComponent Loading spell index for spellchecker: default
>> 2015-11-04 23:16:11.586 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [ 
 x:tika] o.a.s.h.c.SpellCheckComponent Loading spell index for spellchecker: wordbreak
>> 2015-11-04 23:16:11.586 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [ 
 x:tika] o.a.s.h.c.SuggestComponent buildOnStartup: mySuggester
>> 2015-11-04 23:16:11.586 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [ 
 x:tika] o.a.s.s.s.SolrSuggester SolrSuggester.build(mySuggester)
>> 2015-11-04 23:16:11.605 INFO  (searcherExecutor-7-thread-1-processing-x:tika) [ 
 x:tika] o.a.s.c.SolrCore [tika] Registered new searcher Searcher@1eb69b2[tika] main{ExitableDirectoryReader(UninvertingDirectoryReader())}
>> 2015-11-04 23:16:25.923 INFO  (qtp7980742-16) [   x:tika] o.a.s.h.d.DataImporter
Loading DIH Configuration: tika-data-config.xml
>> 2015-11-04 23:16:25.937 INFO  (qtp7980742-16) [   x:tika] o.a.s.h.d.DataImporter
Data Configuration loaded successfully
>> 2015-11-04 23:16:25.947 INFO  (qtp7980742-16) [   x:tika] o.a.s.c.S.Request [tika]
webapp=/solr path=/dataimport params={debug=false&optimize=false&indent=true&commit=true&clean=true&wt=json&command=full-import&verbose=false}
status=0 QTime=28
>> 2015-11-04 23:16:25.948 INFO  (Thread-17) [   x:tika] o.a.s.h.d.DataImporter Starting
Full Import
>> 2015-11-04 23:16:25.961 INFO  (Thread-17) [   x:tika] o.a.s.h.d.SimplePropertiesWriter
Read dataimport.properties
>> 2015-11-04 23:16:25.966 INFO  (qtp7980742-14) [   x:tika] o.a.s.c.S.Request [tika]
webapp=/solr path=/dataimport params={indent=true&wt=json&command=status&_=1446678985952}
status=0 QTime=1
>> 2015-11-04 23:16:25.998 INFO  (Thread-17) [   x:tika] o.a.s.c.SolrCore [tika] REMOVING
ALL DOCUMENTS FROM INDEX
>> 2015-11-04 23:16:26.728 ERROR (Thread-17) [   x:tika] o.a.s.h.d.EntityProcessorWrapper
Exception in entity : documentImport:org.apache.solr.handler.dataimport.DataImportHandlerException:
Unable to read content Processing Document # 1
>>
>>       at
>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAnd
>> T
>> hrow(DataImportHandlerException.java:70)
>>
>>       at
>> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEn
>> t
>> ityProcessor.java:168)
>>
>>       at
>> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
>> i
>> tyProcessorWrapper.java:243)
>>
>>       at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
>> r
>> .java:475)
>>
>>       at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
>> r
>> .java:514)
>>
>>       at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
>> r
>> .java:414)
>>
>>       at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j
>> a
>> va:329)
>>
>>       at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:
>> 232)
>>
>>       at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo
>> r
>> ter.java:416)
>>
>>       at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j
>> a
>> va:480)
>>
>>       at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja
>> v
>> a:461)
>>
>> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
>> IOException from 
>> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1b3e0a6<mailto:org.
>> apache.tika.parser.microsoft.ooxml.OOXMLParser@1b3e0a6>
>>
>>       at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:262
>> )
>>
>>       at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256
>> )
>>
>>       at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
>> 2
>> 0)
>>
>>       at
>> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEn
>> t
>> ityProcessor.java:162)
>>
>>       ... 9 more
>>
>> Caused by: java.io.CharConversionException: Characters larger than 4 
>> bytes are not supported: byte 0xb7 implies a length of more than 4 
>> bytes
>>
>>       at
>> org.apache.xmlbeans.impl.piccolo.xml.UTF8XMLDecoder.decode(UTF8XMLDec
>> o
>> der.java:162)
>>
>>       at
>> org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader$FastStreamDecode
>> r
>> .read(XMLStreamReader.java:762)
>>
>>       at
>> org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader.read(XMLStreamRe
>> a
>> der.java:162)
>>
>>       at
>> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yy_refill(PiccoloLe
>> x
>> er.java:3477)
>>
>>       at
>> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.
>> j
>> ava:3962)
>>
>>       at
>> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
>>
>>       at
>> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:140
>> 0
>> )
>>
>>       at
>> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
>>
>>       at
>> org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479
>> )
>>
>>       at
>> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:12
>> 7
>> 7)
>>
>>       at
>> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:12
>> 6
>> 4)
>>
>>       at
>> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaType
>> L
>> oaderBase.java:345)
>>
>>       at
>> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocume
>> n
>> t$Factory.parse(Unknown Source)
>>
>>       at
>> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocumen
>> t
>> .java:136)
>>
>>       at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:166)
>>
>>       at
>> org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:1
>> 1
>> 8)
>>
>>       at
>> org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtrac
>> t
>> or.java:59)
>>
>>       at
>> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFa
>> c
>> tory.java:181)
>>
>>       at
>> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OO
>> X
>> MLExtractorFactory.java:86)
>>
>>       at
>> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.
>> j
>> ava:82)
>>
>>       at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256
>> )
>>
>>       ... 12 more
>>
>>
>> 2015-11-04 23:16:26.729 INFO  (Thread-17) [   x:tika] o.a.s.h.d.DocBuilder Import
completed successfully
>>
Mime
View raw message