lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Indexing useful N-grams (phrases & entities) and adding payloads
Date Wed, 12 Mar 2014 10:17:51 GMT
You could also use SynonymFilter?

Why does the boost need to be encoded in the index (in a payload) vs
at query time when you create the TermQuery for that term?  Does the
boost vary depending on the surrounding context / document?

Mike McCandless

On Wed, Mar 12, 2014 at 5:27 AM, Manuel Le Normand
<> wrote:
> Hi,
> I posted this question on the Solr mailing list but it has more to do with
> Lucene.
> I have a performance and scoring problem for phrase queries
>    1. Performance - phrase queries involving frequent terms are very slow
>    due to the reading of large positions posting list.
>    2. Scoring - I want to control the boost of phrase and entity (in
>    gazetteers) matches
> Indexing all terms as bi-grams and unigrams is not possible in my use case,
> so I plan indexing only the useful bi-grams. Part of it will be achieved by
> the CommonGram filter in which I put the frequent words.
> I think of going a step further and index phrase queries (extracted from my
> query log) entities (from gazetteers). In order to control the boost on
> these N-gram matches I plan adding payloads to these terms.
> I'm thinking of two different implementations:
>    1. Using MappingCharFilter - the mapping.txt would be
> #phrase-query
> term1 term2 term3 => term1_term2_term3|1
> #entity
> firstName lastName => firstName_lastName|2
> Very simple to implement but an issue might be that I have 100k-1M
> (depending on frequency) phrases/entities as above. I saw that
> MappingCharFilter is implemented as an FST, so I'm not concerned with
> memory footprint, but I'm concerned that iterating on the charBuffer for
> long documents might cause problems.
> 2. Using the shingleTokenFilter - customizing it to compare the output
> against my gazetteers. It would demand and FST implementation in this
> TokenFilter.
> Will I get a quick win with opt.1? How hard would be implementing opt.2?
> General question: Is the above N-gram + payload resolution a common
> practice?
> Thanks in advance,
> Manuel

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message