lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Goetz <>
Subject Format Stripping [ was: XLS parser ]
Date Fri, 18 Jan 2002 04:50:19 GMT

>Welcome!  And POI looks great.

Here here!

>I would like to see folks come to some agreement on an API to use for such
>extensions, perhaps via an interface or base class - although I'm partial to
>the interface so as to not interfere with other folks object hierarchy and

The ability to easily handle various document file formats is an area where 
a little up-front work will make life a _lot_ easier for future (and 
current) users.

At first glance, the missing abstraction appears to be something like 
FormatStripper, which is basically like a FilterInputStream; it would take 
in a document of a specific format (Word, XLS, PDF, XML, etc) and return a 
stream of text corresponding to the "body" of the document.  This is an 
80-90% solution, and is probably worth doing because it is so easy.  Then 
we could hook up the HTML parser to a format stripper and include an 
HtmlFormatStripper in the core, which would be nice.

The remaining 10-20% come from the ability to extract metadata from the 
document as well.  Word documents have Author, Title, and CreationDate 
metadata, which are pretty useful in a multi-field text indexing system 
like Lucene; HTML documents have META tags which contain keywords, which 
are also very useful.

Unfortunately, the Lucene Document class is defined in such a way that 
would make it hard to collect both the body and the other fields in one 
pass on the input file.  While that irks the performance-weenie in me, in 
reality keeping one document in memory when scanning it probably isn't so 
horrible.  (I think Doug convinced me of that.)

I would say that there are two stages to defining how an arbitrary document 
will map into Lucene.  Lets call them the "Document Digester" and the 
"Document Mapper".  The document digester processes a document (say, a Word 
document) and breaks it down into multiple fields, like Author, Title, 
CreationDate, and Body.  The document mapper maps the fields created by the 
DD into fields that will actually get put into a Lucene index.

public interface DocumentDigester {
   /** Retrieve the fields supported by this Digester */
   public Set getFields();

   /** Digest a document into memory */
   public void digestDocument(InputStream in);

   /** Digest a document into a Lucene Document.  Could optimize the
     * loading process by not storing fields that are not desired by
     * the mapper */
   public Document digestDocument(InputStream in,
                                  DocumentMapper mapper);

   /** Retrieve a digested field */
   public String getFieldAsString(String field);

   public InputStream getFieldAsStream(String field);

public interface DocumentMapper {
   /** Indicate that we want to map a given field from the document into
    *  the given Lucene field.  These routines together basically define
    *  Map-like functionality.  There could be a convenience constructor
    *  that sets up the field mappings using a Properties */
   public void mapField(String documentField, String luceneField);
   public String getFieldMapping(String documentField);

Now, someone can wrap any given document format processor with a 
DocumentDigester, and then in order to import documents into Lucene, the 
user only need create the Mapper showing which fields he cares 
about.  Again, we can create an HtmlDigester from the existing parser and 
include it in the kit, along with adapters for various document parsing 

Brian Goetz
Quiotix Corporation           Tel: 650-843-1300            Fax: 650-324-8032

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message