lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Janssen <>
Subject Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?
Date Fri, 11 Mar 2011 07:42:22 GMT
shrinath.m <> wrote:

> Consider we've offline HTML pages, no parsing while crawling, now what ?
> Any tokenizer someone has built for this ?

In UpLib, which uses PyLucene, I use BeautifulSoup to simplify Web pages
by selecting only text between certain tags, before indexing them.
These are offline Web pages, as in your application.  Take a look at 


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message