lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manuel LeNormand <>
Subject Too many unique terms
Date Wed, 24 Apr 2013 22:29:12 GMT
Hi there,
Looking at my index (about 1M docs) i see lot of unique terms, more
than 8M which is a significant part of my total term count. These are very
likely useless terms, binaries or other meaningless numbers that come with
few of my docs.
I am totally fine with deleting them so these terms would be unsearchable.
Thinking about it i get that
1. It is impossible apriori knowing if it is unique term or not, so i
cannot add them to my stop words.
2. I have a performance decrease cause my cached "hot spot" chuncks (4kb)
do contain useless data. It's a problem for me as im short on memory.

Assuming a constant index, is there a way of deleting all terms that are
unique from at least the dictionary tim and tip files? Do i need to enter
the source code for this, and if yes what par of it?
 Will i get significant query time performance increase beside the better
RAM use benefit?
Are there any written updateProcessor classes that identify non human
readable terms?

Thanks in advance,

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message