lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Preetam Rao (JIRA)" <>
Subject [jira] Updated: (LUCENE-1853) PhraseQuery Scorer for scoring sub phrase matches
Date Tue, 25 Aug 2009 10:26:59 GMT


Preetam Rao updated LUCENE-1853:

    Attachment: LUCENE-1853.patch

Attached a patch with test cases. Position increment and offset always assumed to be incremented
by 1. May not work with increments other than 

> PhraseQuery Scorer for scoring sub phrase matches
> -------------------------------------------------
>                 Key: LUCENE-1853
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>         Environment: Lucene/Java
>            Reporter: Preetam Rao
>            Priority: Minor
>         Attachments: LUCENE-1853.patch
>   Original Estimate: 336h
>  Remaining Estimate: 336h
> For a query like "homes in new york with swimming pool", if a document's field matches
only "new york" it should get scored and it should get scored higher than two separate matches
"new" and "york".  Also, a 3 word sub phrase match must gets scored considerably higher than
a 2 word sub phrase match. (boost factor should be configurable)
> If a user query is taken as is without parsing and is searched against multiple fields,
where each sub-phrase can match against a different field, this kind of query is useful. 
> Using shingles for this use case, means each field of each document needs to be indexed
as shingles of all (1..N)-grams as well as the query. (Please correct me if I am wrong.)
> The scorer could also support 
> - ignoring of idf and/or field norms, (so that factors outside the document don't influence
> - consider only the longest match (for example match on "new york" is scored and considered
rather than "new" furniture and "york" city)
> - ignore duplicates ("new york" appearing twice or thrice does not make any difference)
> This kind of query (Phrase Query with SubPhraseScorer) could be combined with DisMax
query. For example, something like solr's dismax request handler can be made to use this query
where we run a user query as it is against all fields and configure each field with above
> I have also attached a patch with comments and test cases in case, my description is
not clear enough. Would appreciate alternatives or feedback. The goal is to give more control
via configuration when searching using user entered queries against multiple fields where
sub phrases have special significance.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message