lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Goetz <>
Subject Re: Format Stripping [ was: XLS parser ]
Date Sat, 19 Jan 2002 22:20:06 GMT
> > Unfortunately, the Lucene Document class is defined in such a way that 
> > would make it hard to collect both the body and the other fields in one 
> > pass on the input file.  While that irks the performance-weenie in me, in 
> > reality keeping one document in memory when scanning it probably isn't so 
> > horrible.  (I think Doug convinced me of that.)
> > 
> POI::HSSF (Excel) and POIFS (OLE2CDF) may have a solution to that.  I'll
> have to look this over a bit harder.  We've implemented an event based
> system for reading documents (so you register for what you care about
> and then kick it off and it throws events to listeners as it runs into
> them).  Not sure if there is a clean way to graft those ideas onto
> Lucene for a single pass read.  I'll take a more comprehensive look at
> it.

Lucene doesn't make this easy, except in the degenerate case where you
can extract all the short fields at once relatively easily (like if
all the metadata is at the top of the file.)  Otherwise, Lucene wants
you to load up a Document with Strings (for short fields) and
InputStreams (for long fields), and then you hand the whole thing over
to Lucene for indexing.  But since you don't know what order it will
read the InputStreams in, basically you're going to have to make at
least one pass on the file, and buffer anything besides the data for
that field.  Except in the case of really simple file formats, like
those with a body only, it doesn't seem like its worth doing a lot of
work to try avoiding buffering the whole thing.  On the other hand, I
think the interface I proposed would let a smart filter for an easy
file format do it without buffering, maybe.

> Great.  I'll start work on this for Excel.  I'll include the interfaces
> above as you suggest (unless someone else would rather), I'll leave the
> HTMLDigester to someone who understands javacc a bit better. I'll ask
> the poi-devel if anyone would kindly write an interpreter for Document
> Summary information (not to bore you but its a "separate" document
> within an OLE 2 Compound Document).

Bear in mind this is just a proposal, and I'd like to hear comments
from the rest of the list, but I think the basic concept is clear enough
that you can start on an Excel implementation.  

> Just a suggestion but you might at some point in the future consider
> inverting this a tad for a future release of Lucene.  

I like the idea of being able to add fields to a Document after the
Document is indexed.  Then, for documents with a long 'body' and short
metadata fields, you could process the body through an InputStream
adapter, which would, as a side effect, store the other fields
somewhere, and then add them.  Doug, how hard would this be to support
adding some new fields to an already indexed document?  

But even in the absence of such a feature, you could avoid the
memory-hogging problem like this: make two passes on the file, first
extracting the short fields as Strings and putting them into the
Document, and then make another pass with a
SuckTheTextOutOfAnExcelDocumentInputStream filter.  Again, this could
be hidden behind the DocumentDigester interface.  It requires two
passes, but for structured documents like Office documents, that's not
bad as you can go right to where the header fields are stored, and
then handle the text through an input stream without buffering.

So the trade-off is -- full buffering, or two passes.  For most file
formats, the two passes are probably not too bad.  

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message