lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kristofer Karlsson <k...@spotify.com>
Subject Re: Lucene handling of duplicate terms
Date Thu, 05 Sep 2013 07:57:09 GMT
On Thu, Sep 5, 2013 at 9:46 AM, Adrien Grand <jpountz@gmail.com> wrote:

> Hi,
>
> On Thu, Sep 5, 2013 at 9:28 AM, Kristofer Karlsson <krka@spotify.com>
> wrote:
> > I have a use case where some of my documents have duplicate terms in
> > various fields or within the same field.
> >
> > For an example, I may have a million documents with just the term "foo"
> in
> > field A, and one particular document with the term "foo" in both field A
> > and B, or have two terms "foo" in the same field.
> >
> > If I search for "foo foo" I would like to filter out all the documents
> with
> > only one matching term - is this possible?
>
> I don't think we have existing queries that allow for doing it
> efficiently (if someone reads this and knows it is wrong, please
> correct!). However, it should be doable to implement such a query
> rather easily by iterating over the postings lists of the 'foo' term
> in all the fields you are interested in, suming up frequencies (the
> index must have been created with IndexOptions.DOCS_AND_FREQS or
> higher) and only keeping documents whose sum of frequencies is at
> least 2.
>
> --
> Adrien
>
> Thanks for the quick reply!
So I'd have to manually count each term after tokenizing the search query
and keep a map of term to count. I will definitely try this.

---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message