lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manuel Le Normand <manuel.lenorm...@gmail.com>
Subject Indexing useful N-grams (phrases & entities) and adding payloads
Date Wed, 12 Mar 2014 09:27:10 GMT
Hi,
I posted this question on the Solr mailing list but it has more to do with
Lucene.

I have a performance and scoring problem for phrase queries

   1. Performance - phrase queries involving frequent terms are very slow
   due to the reading of large positions posting list.
   2. Scoring - I want to control the boost of phrase and entity (in
   gazetteers) matches

Indexing all terms as bi-grams and unigrams is not possible in my use case,
so I plan indexing only the useful bi-grams. Part of it will be achieved by
the CommonGram filter in which I put the frequent words.

I think of going a step further and index phrase queries (extracted from my
query log) entities (from gazetteers). In order to control the boost on
these N-gram matches I plan adding payloads to these terms.

I'm thinking of two different implementations:

   1. Using MappingCharFilter - the mapping.txt would be

#phrase-query

term1 term2 term3 => term1_term2_term3|1

#entity

firstName lastName => firstName_lastName|2


Very simple to implement but an issue might be that I have 100k-1M
(depending on frequency) phrases/entities as above. I saw that
MappingCharFilter is implemented as an FST, so I'm not concerned with
memory footprint, but I'm concerned that iterating on the charBuffer for
long documents might cause problems.

2. Using the shingleTokenFilter - customizing it to compare the output
against my gazetteers. It would demand and FST implementation in this
TokenFilter.


Will I get a quick win with opt.1? How hard would be implementing opt.2?

General question: Is the above N-gram + payload resolution a common
practice?

Thanks in advance,
Manuel

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message