lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erik Hatcher" <>
Subject Re: [Help with upcomming submission/RFC] Excel Parser
Date Fri, 18 Jan 2002 04:24:37 GMT

Welcome!  And POI looks great.

I look forward to its Jakarta integration - as well as the Lucene documents
that you are writing.  I'm hoping that Lucene will set up a contrib area of
their CVS were such extensions can go, especially for things like MS Office
documents through POI and things like text, XML, and HTML.

I would like to see folks come to some agreement on an API to use for such
extensions, perhaps via an interface or base class - although I'm partial to
the interface so as to not interfere with other folks object hierarchy and

I haven't submitted my latest creation yet, but its about time I get ready
to - I've created an Ant task that is used to index a <fileset> (for those
of you that know Ant terminology).  Its extensible in that it uses a
"documentHandler" specified classname and allows a document handler to
return back Lucene Document objects when passed a object.  For
example, my current document handler is a FileExtensionDocumentHandler, and
retrieves a Document from a TextDocument class if the file ends with ".txt"
or from an HtmlDocument class if it ends with ".html".  I've submitted my
HtmlDocument class to this list a while back (it uses JTidy under the covers
to DOM-ify an HTML file) - this extension handler could be morphed into
something even more dynamic to look up Document creating objects based off
of a lookup table.  Or a completely different document handler could be
implemented with other logic to return back a Document from a File.

If we could all come to some consensus on these interfaces or at least spell
out ways we can all work together to bring these various pieces into a
unified product we'll have something very powerful on our hands (not that
Lucene and POI aren't already). The synergy possible between Lucene, POI,
Ant, and of course Slide and others is staggering.  Ant fits into the
equation too because there may very well be documents that you want indexed
at build time that are static from there on out (i.e. application
documentation, "static" websites - although could be dynamicly generated at
build time, etc).

Thoughts?  How's the contrib plans going?  I definitely think Lucene should
be careful of such a contrib area or "document" area by not accepting things
that are tightly coupled with vendors or other products that are more of
niche products.  Ant has collected too much cruft by tossing in lots of
vendor-specific things when it really should just be the definer of the
interfaces and allow things to plug in more easily.  But I do think that
Lucene should have builtin capability to do standard stuff with text, HTML,
XML, and now MS Office documents through POI's API.

I'm just about ready to toss my stuff out there for public review and
hopefully adoption.  Its got some rough edges, but overall its a step in the
right direction and with some Lucene expert eyes looking at it could really
improve quickly.

Has anyone else written an Ant task that builds a Lucene index?  If so, is
it publicly available?  Where?


----- Original Message -----
From: "Andrew C. Oliver" <>
To: <>
Sent: Thursday, January 17, 2002 8:57 PM
Subject: [Help with upcomming submission/RFC] Excel Parser

> Hi All,
> I know some of you from the general@jakarta list.  For those of you I
> don't know I'm Andy Oliver, a developer on the POI project.
> The POI project seeks to port all of the Microsoft OLE 2 Compound
> Document format based file formats to Java.  Its APL.
> (
> We've already completed our first iteration of the Excel File format.
> We'll begin our port of the Word file format shortly.
> I'd like to begin collaborating on adding this functionality to Lucene.
> Would this be acceptable (I guess for quite some time I'd gone under the
> assumption it would be due to the FAQ mentioning "wouldn't it be nice"
> #12)?
> I think I'm ready to write the ExcelDocument class for Lucene but it is
> not immediately apparent to me where one should package it..?  Should it
> just be a "demo" (like HTMLParser)?  Would a "documents" package be
> appropriate?
> I'd also like to fit it into the current (or upcoming) development
> structure.
> Lastly there is the matter of where to put the POI jars.  Any guidance
> you can provide will be greatly appreciated.
> -Andy
> --
> - port of Excel format to java
> - fix java generics!
> The avalanche has already started. It is too late for the pebbles to
> vote.
> -Ambassador Kosh
> --
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message