lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew C. Oliver" <>
Subject Re: Format Stripping [ was: XLS parser ]
Date Fri, 18 Jan 2002 13:15:43 GMT
On Thu, 2002-01-17 at 23:50, Brian Goetz wrote:
> >Welcome!  And POI looks great.
> Here here!


> >I would like to see folks come to some agreement on an API to use for such
> >extensions, perhaps via an interface or base class - although I'm partial to
> >the interface so as to not interfere with other folks object hierarchy and
> >such.
> The ability to easily handle various document file formats is an area where 
> a little up-front work will make life a _lot_ easier for future (and 
> current) users.
> At first glance, the missing abstraction appears to be something like 
> FormatStripper, which is basically like a FilterInputStream; it would take 
> in a document of a specific format (Word, XLS, PDF, XML, etc) and return a 
> stream of text corresponding to the "body" of the document.  This is an 
> 80-90% solution, and is probably worth doing because it is so easy.  Then 
> we could hook up the HTML parser to a format stripper and include an 
> HtmlFormatStripper in the core, which would be nice.
> The remaining 10-20% come from the ability to extract metadata from the 
> document as well.  Word documents have Author, Title, and CreationDate 
> metadata, which are pretty useful in a multi-field text indexing system 
> like Lucene; HTML documents have META tags which contain keywords, which 
> are also very useful.

I'll go ahead and write or ask someone on the poi list to write a
summary info interpreter.  It looks trivial, we've just not needed it so

> Unfortunately, the Lucene Document class is defined in such a way that 
> would make it hard to collect both the body and the other fields in one 
> pass on the input file.  While that irks the performance-weenie in me, in 
> reality keeping one document in memory when scanning it probably isn't so 
> horrible.  (I think Doug convinced me of that.)

POI::HSSF (Excel) and POIFS (OLE2CDF) may have a solution to that.  I'll
have to look this over a bit harder.  We've implemented an event based
system for reading documents (so you register for what you care about
and then kick it off and it throws events to listeners as it runs into
them).  Not sure if there is a clean way to graft those ideas onto
Lucene for a single pass read.  I'll take a more comprehensive look at

> I would say that there are two stages to defining how an arbitrary document 
> will map into Lucene.  Lets call them the "Document Digester" and the 
> "Document Mapper".  The document digester processes a document (say, a Word 
> document) and breaks it down into multiple fields, like Author, Title, 
> CreationDate, and Body.  The document mapper maps the fields created by the 
> DD into fields that will actually get put into a Lucene index.
> public interface DocumentDigester {
>    /** Retrieve the fields supported by this Digester */
>    public Set getFields();
>    /** Digest a document into memory */
>    public void digestDocument(InputStream in);
>    /** Digest a document into a Lucene Document.  Could optimize the
>      * loading process by not storing fields that are not desired by
>      * the mapper */
>    public Document digestDocument(InputStream in,
>                                   DocumentMapper mapper);
>    /** Retrieve a digested field */
>    public String getFieldAsString(String field);
>    public InputStream getFieldAsStream(String field);
> }
> public interface DocumentMapper {
>    /** Indicate that we want to map a given field from the document into
>     *  the given Lucene field.  These routines together basically define
>     *  Map-like functionality.  There could be a convenience constructor
>     *  that sets up the field mappings using a Properties */
>    public void mapField(String documentField, String luceneField);
>    public String getFieldMapping(String documentField);
> }
> Now, someone can wrap any given document format processor with a 
> DocumentDigester, and then in order to import documents into Lucene, the 
> user only need create the Mapper showing which fields he cares 
> about.  Again, we can create an HtmlDigester from the existing parser and 
> include it in the kit, along with adapters for various document parsing 
> packages.

Great.  I'll start work on this for Excel.  I'll include the interfaces
above as you suggest (unless someone else would rather), I'll leave the
HTMLDigester to someone who understands javacc a bit better. I'll ask
the poi-devel if anyone would kindly write an interpreter for Document
Summary information (not to bore you but its a "separate" document
within an OLE 2 Compound Document).

Just a suggestion but you might at some point in the future consider
inverting this a tad for a future release of Lucene.  Meaning if your
streams followed a similar pattern and threw "events" instead of
building fields, the index writer could listen for these fields based on
passed in parameters and the (file-type filter) reader or stream could
throw them as it came across them.  When we did this for POI::HSSF the
memory usage for reading a really large spreadsheet went down tenfold
and performance was roughly equal or perhaps a bit better (lets put it
this way...I could probably run POI::HSSF on my palm pilot :-D, but the
sample file wouldn't fit).  Anyhow this is just a thought we shamelessly
stole from SAX. :-)

For an example of how we do this check out:



> --
> Brian Goetz
> Quiotix Corporation
>           Tel: 650-843-1300            Fax: 650-324-8032
> --
> To unsubscribe, e-mail:   <>
> For additional commands, e-mail: <>
-- - port of Excel format to java 
			- fix java generics!

The avalanche has already started. It is too late for the pebbles to
-Ambassador Kosh

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message