lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manuel Le Normand <manuel.lenorm...@gmail.com>
Subject Re: Too many unique terms
Date Sat, 27 Apr 2013 18:41:39 GMT
Hi, real thanks for the previous reply.
For now i'm not able to make a separation between these useless words,
whether they contain words or digits.
I liked the idea of iterating with TermsEnum. Will it also delete the
occurances of these terms in the other file formats (termVectors etc.)?

As i understand, the strField implementation is a kind of TrieField ordered
by the leading char (as searches support wildcards), every term in the
Dictionnary points to the inverted file (frq) to find the list (not bitmap)
of the docs containing the term.

Let's say i query for the term "hello" many times within different queries,
the O.S will load into memory the matching 4k chunk from the Dictionary and
frq. If most of my terms are garbage, much of the Dictionnary chunk will be
useless, whereas the frq chunk will be more efficiently used as it contains
all the <termFreq> list. Still i'm not sure a typical <termFreqs,skipData>
chunk per term gets to 4k.

If my assumption's right, i should lower down the memory chunks (through
the OS) to about the 0.9th percentile of the <termFreq,skipData> chunk for
a single term in the frq (neglecting for instance the use of prx and
termVectors). Any cons to the idea? Do you have any estimation of the
magnitude of a frq chunk for a N-times occuring term, or how can i check it
on my own.

Thanks,
Manu


On Thu, Apr 25, 2013 at 2:04 AM, Adrien Grand <jpountz@gmail.com> wrote:

> Hi Manuel,
>
> On Thu, Apr 25, 2013 at 12:29 AM, Manuel LeNormand
> <manuel.lenormand@gmail.com> wrote:
> > Hi there,
> > Looking at my index (about 1M docs) i see lot of unique terms, more
> > than 8M which is a significant part of my total term count. These are
> very
> > likely useless terms, binaries or other meaningless numbers that come
> with
> > few of my docs.
>
> If you are only interested in letters, one option is to change your
> analysis chain to use LetterTokenizer. This tokenizer will split on
> everything that is not a letter, filtering out numbers and binary
> data.
>
> > I am totally fine with deleting them so these terms would be
> unsearchable.
> > Thinking about it i get that
> > 1. It is impossible apriori knowing if it is unique term or not, so i
> > cannot add them to my stop words.
> > 2. I have a performance decrease cause my cached "hot spot" chuncks (4kb)
> > do contain useless data. It's a problem for me as im short on memory.
> >t
> > Q:
> > Assuming a constant index, is there a way of deleting all terms that are
> > unique from at least the dictionary tim and tip files? Do i need to enter
> > the source code for this, and if yes what par of it?
>
> If frequencies are indexed, you can pull a TermsEnum, iterate through
> the terms dictionary and delete terms that are less frequent than a
> given threshold. As you said, this will however prevent your users
> from searching for these terms anymore.
>
> >  Will i get significant query time performance increase beside the better
> > RAM use benefit?
>
> This is hard to answer. Having fewer terms in the terms dictionary
> should make search a little faster but I can't tell you by how much.
> You should also try to disable features that you don't use. For
> example, if you don't need positional information or frequencies,
> IndexOptions.DOCS_ONLY will make your postings lists smaller.
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message