lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com.INVALID>
Subject Re: Query Expansion for Synonyms
Date Thu, 28 Apr 2016 16:30:47 GMT
Hi Daniel,

Since you are restricting inOrder=true and proximity=0 in the top level query, there is no
problem in your particular example.

If you weren't restricting, injecting synonyms with plain OR, sometimes cause 'query drift':
injection/addition of one term changes result list drastically.

When there is a big term statistics (document frequency, collection frequency, etc) difference
between the injected term and the original term, there can be unexpected results.

BlendedTermQuery and SynonymQuery implementations could be used.

Ahmet

On Thursday, April 28, 2016 6:26 PM, Daniel Bigham <danielb@wolfram.com> wrote:
I'm investigating various ways of supporting synonyms in Lucene.

One such approach that looks potentially interesting is to do a kind of 
"query expansion".

For example, if the user searches for "us 1888", one might expand the 
query as follows:

     SpanNearQuery query =
     new SpanNearQuery(
         new SpanQuery[]
         {
             new SpanOrQuery(
                 new SpanTermQuery(new Term("Plaintext", "us")),
                 new SpanNearQuery(
                     new SpanQuery[]
                     {
                         new SpanTermQuery(new Term("Plaintext", "united")),
                         new SpanTermQuery(new Term("Plaintext", "states"))
                     },
                     0,
                     true
                 )
             ),
             new SpanTermQuery(new Term("Plaintext", "1888"))
         },
         0,
         true
     );

A couple of questions:

- Is this approach in use within the community?
- Are there "gotchas" with this approach that make it undesirable?

I've done a few quick tests wrt query performance on a test index and 
found that a query can indeed take 10x longer if enough synonyms are 
used, but if the baseline search time is around 1 ms, then 10 ms is 
still plently fast enough. (that said, my test was on a 70 MB index, so 
my 10 ms might turn into something nasty with a 7 GB index)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message