lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erik Hatcher" <>
Subject Re: Contributor Document class repository proposal
Date Sat, 01 Dec 2001 02:58:06 GMT
Ok, my HtmlDocument is attached. I package scoped it differently in my
current code, but renamed it on this attachement with a mocked up package.
Feel free to attach the Apache license, rename the package, and use it and
abuse it.... its nothing special, but does the trick.  JTidy has an annoying
habit of outputting to System.out (or System.err??) some informational
stuff - and I'd like to find a way around that, or to capture it internally
to make it available if desired.  If folks fix it up and improve it, let me
know so I can keep my current code improving too.

I'd like to see Apache Lucene sort of follow what Struts does - there is a
top-level contrib directory in CVS where sub-projects could go.  We could
separate them by "type" somehow (HTML, XML, Word, PDF), or make a
contrib/handlers directory where they all get dumped for now.  I'd rather
see these extensions under the Lucene umbrella, at least for standard types
of documents that the lucene-dev folks could manage relatively painlessly.
If someone has some non-common or one-shot kind of extension then lucene-dev
should do like Ant and post it to a Resources page, but let someone else
deal with hosting it for download.

The build could be enhanced to build a lucene-extensions.jar too for easily
downloading!  :)


----- Original Message -----
From: "Otis Gospodnetic" <>
To: "Lucene Developers List" <>
Sent: Friday, November 30, 2001 5:57 PM
Subject: Re: Contributor Document class repository proposal

> I like this idea.  Ant's resource page looks like a good example and
> I've used it a number of times, so it does serve a purpose.
> I also know that I could use Erik's HTML parsing (JTidy stuff) code
> today for a little application that uses Lucene and needs an HTML
> parser.
> I also like that generic XML -> Lucene Document idea from Mr. Ogren.
> My Lucene folder is full of code/attachments that various people sent
> to the list, but that never got included into Lucene for one reason or
> the other.
> I think this Resources area would solve these types of issues.
> One thing I would suggest though is to keep pointers to external
> projects, and bring them under the Lucene roof only if the project
> looks like it is very closely tied to Lucene, does not have too many
> additional dependencies, and if the project owner wants to be a part of
> Lucene.
> I would not enforce/require that.
> I have this little application that uses Lucene that I thought may be
> an okay contribution to Lucene as a demo, but it requires some
> additional libs, so it would just be a pain for new Lucene users to use
> it.  So I didn't contribute it yet.
> I am already associated with Lucene/Jakarta, so it wouldn't be hard for
> me to move this project under Lucene if people asked for it, but for
> others that may be too big of a change.
> Anyhow, the point is that I don't think that this should be enforced:
> > The source of these contributions will be added into the CVS at the
> > Apache/jakarta-lucene/contributions  level (to be added).
> My 2 liras.
> Otis
> --- wrote:
> > Contributor Document class repository proposal
> >
> > Issue: One of the areas that many developers are duplicating efforts
> > is
> > in the creation of Documents. Although creating a Document is
> > specific
> > to the data, most people deal with common formats such as XML, csv,
> > text, pdf, HTML, databases...
> >
> > Potential solution:
> > Allow developers / users to contribute there own Document classes.
> > How:
> > Create a new area called "resources" (this seems to be consistent
> > with
> > other Jakarta projects) under the About area on the main Lucene web
> > page.
> > This will link to a page which will includes contributions by other
> > people that is not part of the main Lucene distribution.
> > I think a good example of this is
> >
> > or
> >
> >
> > The contribution will be organized with a name, author, contribution
> > date and description. The name will be a link to download the tar or
> > zip
> > file.
> > One question is if the source becomes part of the Lucene project, or
> > if
> > it owned and maintained by the contributor. In many of the other
> > projects, there is a link to an external website and an email address
> > of
> > who maintains it. I would suggest that it becomes part of the Lucene
> > project.
> > This web page will be maintained by me.
> >
> > The source of these contributions will be added into the CVS at the
> > Apache/jakarta-lucene/contributions  level (to be added).
> > This repository will be maintained by me. Being a Document class
> > contributor does not give you write access to the cvs tree.
> >
> > Please let me know if people think this is a valuable contribution
> > and
> > are willing to support it.
> > Also, any part of the solution is open for revision based on
> > feedback.
> >
> > Thanks
> >
> > --Peter
> __________________________________________________
> Do You Yahoo!?
> Buy the perfect holiday gifts at Yahoo! Shopping.
> --
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message