lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Watkins <rwatk...@foo-bar.org>
Subject html parsers and numers of terms
Date Tue, 13 Dec 2005 12:33:18 GMT
I have been experimenting with a couple of HTML parsers, primarily to
compare performance, but have discovered a difference in the index for
which I haven't, with assurance discovered the cause.

The difference is in the number of terms reported by Luke. The indexes
created with the content parsed using JTidy generally have about 30%
fewer terms than those created with content parsed using HTMLParser
(htmlparser.org).

The only difference I can discern (using debug logs and diff) is with
the way entities are handled by the two parsers. Using JTidy, any HTML
entities are converted to the literal character; using HTMLParser they
are left as an entity (named or numeric). In the fields that are
tokenized, entities not already converted are done so in the index, which
leaves only the fields not tokenized. It does not seem likely to me that
this could account for 30% of the terms indexed.

Is it possible to use Luke (or some other tool) to make a more detailed
comparison of the two indexes? I have tried to find a difference in the
top terms indexed, and while the order of the top terms does change, the
numbers do not. Am I missing something obvious?

Thanks,
-- Robert

--------------------
Robert Watkins
rwatkins@foo-bar.org
--------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message