lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Giovanni Novelli <>
Subject Text extraction from HTML
Date Fri, 29 Jul 2005 07:17:45 GMT
I'm working to the development of a multi-agents software that
involves some information indexing, information retrieval and
information categorization tasks. I want to build the training set for
categorization using a set of HTML pages fetched from DMOZ RDF dumps.
I have tried the HtmlParser coming with Nutch but I wasn't able to
make it work without adjusting global configuration Nutch's xml;
perhaps it's the only way to make such plugin work? Does Lucene expose
any good HTML parser in the contrib section to parse web pages found
in the wild?

Best regards,
Giovanni Novelli

P.S.: This is a crosspost as I'm relying on both Lucene and Nutch.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message