lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Kimber <>
Subject Re: Text extraction from HTML
Date Fri, 29 Jul 2005 08:14:30 GMT
Hi Giovanni
We are using the Neko HTML parser.  Some simple example code can be
found in the "Lucene in Action" book.

For more information:


On 29/07/05, Giovanni Novelli <> wrote:
> Hello,
> I'm working to the development of a multi-agents software that
> involves some information indexing, information retrieval and
> information categorization tasks. I want to build the training set for
> categorization using a set of HTML pages fetched from DMOZ RDF dumps.
> I have tried the HtmlParser coming with Nutch but I wasn't able to
> make it work without adjusting global configuration Nutch's xml;
> perhaps it's the only way to make such plugin work? Does Lucene expose
> any good HTML parser in the contrib section to parse web pages found
> in the wild?
> Best regards,
> Giovanni Novelli
> P.S.: This is a crosspost as I'm relying on both Lucene and Nutch.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message