lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "shrinath.m" <>
Subject Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?
Date Fri, 11 Mar 2011 11:50:28 GMT
On Fri, Mar 11, 2011 at 5:06 PM, Li Li [via Lucene] <> wrote:

>   But I think the parser will most be used when crawling. So you can use
> these parsers when crawling and save parsed result only.

Consider we've offline HTML pages, no parsing while crawling, now what ?
Any tokenizer someone has built for this ?

How does Solr do it ?


View this message in context:
Sent from the Lucene - Java Users mailing list archive at
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message