lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Kimber <mailing.patrick.kim...@gmail.com>
Subject Re: Text extraction from HTML
Date Fri, 29 Jul 2005 08:14:30 GMT
Hi Giovanni
We are using the Neko HTML parser.  Some simple example code can be
found in the "Lucene in Action" book.

For more information:
http://www.manning.com/books/hatcher2
http://www.apache.org/~andyc/neko/doc/html/

Patrick

On 29/07/05, Giovanni Novelli <giovanni.novelli@gmail.com> wrote:
> Hello,
> I'm working to the development of a multi-agents software that
> involves some information indexing, information retrieval and
> information categorization tasks. I want to build the training set for
> categorization using a set of HTML pages fetched from DMOZ RDF dumps.
> I have tried the HtmlParser coming with Nutch but I wasn't able to
> make it work without adjusting global configuration Nutch's xml;
> perhaps it's the only way to make such plugin work? Does Lucene expose
> any good HTML parser in the contrib section to parse web pages found
> in the wild?
> 
> Best regards,
> Giovanni Novelli
> 
> P.S.: This is a crosspost as I'm relying on both Lucene and Nutch.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message