lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sreejith S <srssreej...@gmail.com>
Subject Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?
Date Sat, 12 Mar 2011 03:39:46 GMT
I suggest you Jsoup Html parser,which is fast ,easy and simple html
parser.I used many html parsers and out of which i am comfortable with
Jsoup.

http://jsoup.org/

IBM ICU provides the best tokenizers.



On 3/11/11, Bill Janssen <janssen@parc.com> wrote:
> shrinath.m <shrinath.m@webyog.com> wrote:
>
>> Consider we've offline HTML pages, no parsing while crawling, now what ?
>> Any tokenizer someone has built for this ?
>
> In UpLib, which uses PyLucene, I use BeautifulSoup to simplify Web pages
> by selecting only text between certain tags, before indexing them.
> These are offline Web pages, as in your application.  Take a look at
> <http://uplib.parc.com/hg/uplib/file/2a204fc2dd1a/extensions/FilterWebPage.py>.
>
> Bill
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
*********************************
Sreejith.S

http://sreejiths.emurse.com/
http://srijiths.wordpress.com/
tweet2sree@twitter

*********************************
ILUGCBE
http://ilugcbe.techstud.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message