lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: html parsers and numers of terms
Date Tue, 13 Dec 2005 13:54:41 GMT
How about taking a single simple HTML file, running it through each  
parser, dumping the tokens into separate collections (or output to a  
single text file) and diff them?

	Erik

On Dec 13, 2005, at 7:33 AM, Robert Watkins wrote:

> I have been experimenting with a couple of HTML parsers, primarily to
> compare performance, but have discovered a difference in the index for
> which I haven't, with assurance discovered the cause.
>
> The difference is in the number of terms reported by Luke. The indexes
> created with the content parsed using JTidy generally have about 30%
> fewer terms than those created with content parsed using HTMLParser
> (htmlparser.org).
>
> The only difference I can discern (using debug logs and diff) is with
> the way entities are handled by the two parsers. Using JTidy, any HTML
> entities are converted to the literal character; using HTMLParser they
> are left as an entity (named or numeric). In the fields that are
> tokenized, entities not already converted are done so in the index,  
> which
> leaves only the fields not tokenized. It does not seem likely to me  
> that
> this could account for 30% of the terms indexed.
>
> Is it possible to use Luke (or some other tool) to make a more  
> detailed
> comparison of the two indexes? I have tried to find a difference in  
> the
> top terms indexed, and while the order of the top terms does  
> change, the
> numbers do not. Am I missing something obvious?
>
> Thanks,
> -- Robert
>
> --------------------
> Robert Watkins
> rwatkins@foo-bar.org
> --------------------
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message