lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manuel Le Normand <manuel.lenorm...@gmail.com>
Subject Re: Indexing useful N-grams (phrases & entities) and adding payloads
Date Wed, 12 Mar 2014 14:13:20 GMT
SynonymFilter makes sense.

The planned payloads are indeed not needed. I guess a better solution would
be making out of the boost an attribute during query time that will be
consumed in the queryParser in order to boost these n-gram terms.

Thanks for the hints.
Manuel


On Wed, Mar 12, 2014 at 12:17 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> You could also use SynonymFilter?
>
> Why does the boost need to be encoded in the index (in a payload) vs
> at query time when you create the TermQuery for that term?  Does the
> boost vary depending on the surrounding context / document?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Mar 12, 2014 at 5:27 AM, Manuel Le Normand
> <manuel.lenormand@gmail.com> wrote:
> > Hi,
> > I posted this question on the Solr mailing list but it has more to do
> with
> > Lucene.
> >
> > I have a performance and scoring problem for phrase queries
> >
> >    1. Performance - phrase queries involving frequent terms are very slow
> >    due to the reading of large positions posting list.
> >    2. Scoring - I want to control the boost of phrase and entity (in
> >    gazetteers) matches
> >
> > Indexing all terms as bi-grams and unigrams is not possible in my use
> case,
> > so I plan indexing only the useful bi-grams. Part of it will be achieved
> by
> > the CommonGram filter in which I put the frequent words.
> >
> > I think of going a step further and index phrase queries (extracted from
> my
> > query log) entities (from gazetteers). In order to control the boost on
> > these N-gram matches I plan adding payloads to these terms.
> >
> > I'm thinking of two different implementations:
> >
> >    1. Using MappingCharFilter - the mapping.txt would be
> >
> > #phrase-query
> >
> > term1 term2 term3 => term1_term2_term3|1
> >
> > #entity
> >
> > firstName lastName => firstName_lastName|2
> >
> >
> > Very simple to implement but an issue might be that I have 100k-1M
> > (depending on frequency) phrases/entities as above. I saw that
> > MappingCharFilter is implemented as an FST, so I'm not concerned with
> > memory footprint, but I'm concerned that iterating on the charBuffer for
> > long documents might cause problems.
> >
> > 2. Using the shingleTokenFilter - customizing it to compare the output
> > against my gazetteers. It would demand and FST implementation in this
> > TokenFilter.
> >
> >
> > Will I get a quick win with opt.1? How hard would be implementing opt.2?
> >
> > General question: Is the above N-gram + payload resolution a common
> > practice?
> >
> > Thanks in advance,
> > Manuel
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message