lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gytis Mikuciunas <gyt...@gmail.com>
Subject Re: Solr 6.4. Can't index MS Visio vsdx files
Date Wed, 12 Apr 2017 04:59:45 GMT
when 1.15 will be released? maybe you have some beta version and I could
test it :)

SAX sounds interesting, and from info that I found in google it could solve
my issues.

On Tue, Apr 11, 2017 at 10:48 PM, Allison, Timothy B. <tallison@mitre.org>
wrote:

> It depends.  We've been trying to make parsers more, erm, flexible, but
> there are some problems from which we cannot recover.
>
> Tl;dr there isn't a short answer.  :(
>
> My sense is that DIH/ExtractingDocumentHandler is intended to get people
> up and running with Solr easily but it is not really a great idea for
> production.  See Erick's gem: https://lucidworks.com/2012/
> 02/14/indexing-with-solrj/
>
> As for the Tika portion... at the very least, Tika _shouldn't_ cause the
> ingesting process to crash.  At most, it should fail at the file level and
> not cause greater havoc.  In practice, if you're processing millions of
> files from the wild, you'll run into bad behavior and need to defend
> against permanent hangs, oom, memory leaks.
>
> Also, at the least, if there's an exception with an embedded file, Tika
> should catch it and keep going with the rest of the file.  If this doesn't
> happen let us know!  We are aware that some types of embedded file stream
> problems were causing parse failures on the entire file, and we now catch
> those in Tika 1.15-SNAPSHOT and don't let them percolate up through the
> parent file (they're reported in the metadata though).
>
> Specifically for your stack traces:
>
> For your initial problem with the missing class exceptions -- I thought we
> used to catch those in docx and log them.  I haven't been able to track
> this down, though.  I can look more if you have a need.
>
> For "Caused by: org.apache.poi.POIXMLException: Invalid 'Row_Type' name
> 'PolylineTo' ", this problem might go away if we implemented a pure SAX
> parser for vsdx.  We just did this for docx and pptx (coming in 1.15) and
> these are more robust to variation because they aren't requiring a match
> with the ooxml schema.  I haven't looked much at vsdx, but that _might_
> help.
>
> For "TODO Support v5 Pointers", this isn't supported and would require
> contributions.  However, I agree that POI shouldn't throw a Runtime
> exception.  Perhaps open an issue in POI, or maybe we should catch this
> special example at the Tika level?
>
> For "Caused by: java.lang.ArrayIndexOutOfBoundsException:", the POI team
> _might_ be able to modify the parser to ignore a stream if there's an
> exception, but that's often a sign that something needs to be fixed with
> the parser.  In short, the solution will come from POI.
>
> Best,
>
>              Tim
>
> -----Original Message-----
> From: Gytis Mikuciunas [mailto:gytmkc@gmail.com]
> Sent: Tuesday, April 11, 2017 1:56 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr 6.4. Can't index MS Visio vsdx files
>
> Thanks for your responses.
> Are there any posibilities to ignore parsing errors and continue indexing?
> because now solr/tika stops parsing whole document if it finds any
> exception
>
> On Apr 11, 2017 19:51, "Allison, Timothy B." <tallison@mitre.org> wrote:
>
> > You might want to drop a note to the dev or user's list on Apache POI.
> >
> > I'm not extremely familiar with the vsd(x) portion of our code base.
> >
> > The first item ("PolylineTo") may be caused by a mismatch btwn your
> > doc and the ooxml spec.
> >
> > The second item appears to be an unsupported feature.
> >
> > The third item may be an area for improvement within our codebase...I
> > can't tell just from the stacktrace.
> >
> > You'll probably get more helpful answers over on POI.  Sorry, I can't
> > help with this...
> >
> > Best,
> >
> >            Tim
> >
> > P.S.
> > >  3.1. ooxml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar
> >
> > You shouldn't need both. Ooxml-schemas-1.3.jar should be a super set
> > of poi-ooxml-schemas-3.15.jar
> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message