lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <yo...@apache.org>
Subject Re: How to not tokenize HTML tag from input string
Date Thu, 08 Feb 2007 21:12:08 GMT
On 2/8/07, Peter W. <peter@marketingbrokers.com> wrote:
> Using a parser to get text out of HTML, XML (including RSS, ATOM) is
> only
> easy if you control the source documents.
>
> HTML pages in the wild are much different, generating exceptions you
> must
> catch and deal with.

Yes, that's why the Solr version isn't an HTML parser and assumes a mixed bag
of HTML, javascript, comments, processing instructions, and non-escaped text.
It doesn't throw exceptions because there are no "correctness" assumptions.

If you *know* you have well formed HTML documents, then you definitely
should parse (and extract important entities out to separate fields)
before sending to Lucene/Solr .

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message