lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Watkins <rwatk...@foo-bar.org>
Subject Re: html parsers and numers of terms
Date Tue, 13 Dec 2005 16:36:38 GMT
Aha! I had, indeed, been fooled by Luke into thinking that the entities
had been converted upon analysis, but you have set me straight.

Thanks,
-- Robert

On Tue, 13 Dec 2005, J.J. Larrea wrote:

> Beware of HTML/XML entities in your input stream!  The Lucene analyzers (including StandardAnalyzer)
do not interpret these representation-specific encodings, and assume the & and ; delimiters
are punctuation.  How they deal with punctuation depends on the specific Analyzer logic.
>
> [ snipped ]
>
> PS: Also note that when using Luke to see what is indexed, it uses NCRs eg. &#233;
to display non-ASCII characters, allowing one to be easily confused as to whether the NCRs
were indexed or the Unicode characters were indexed.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message