lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "T. H. Lin" <easy....@gmail.com>
Subject Search on tag / category / label / keyword ...
Date Mon, 27 Oct 2008 09:21:14 GMT
I would like to search a collection of "keyword"s with lucene.

A Document has one or many keywords. The keywords appear only once in a
document. (tf = 1)
for example:
Document_1 : ( "aa"  "bb"  "cc"         )
Document_2 : (         "bb"  "cc"         )
Document_3 : (                 "cc"  "dd" )
Document_4 : ( "aa"          "cc"  "dd" )

I have a query from more terms with different boost. The coord(int overlap,
int maxOverlap) is turn off. i.e. always return 1.0.
query = "aa^0.1 bb^0.9 xx^0.1 yy^0.1 zz^0.1"
the query may contain many terms which do not appear in a Document. i.e.
"xx" "yy" and "zz" here.

Amd I got
3 hits
Document_2 : (         "bb"  "cc"         ) : score : 0.75391763
Document_1 : ( "aa"  "bb"  "cc"         ) : score : 0.67014897
Document_4 : ( "aa"          "cc"  "dd" ) : score : 0.0670149

[Question] is...why Document_2 better than Document_1 !?
Document_1 does match two terms; "aa" and "bb".
I want to emphasize the "match" and less care the "mismatch".
How should I modify Similarity to achieve that? (Document_1 should get
higher score!)

Is there any suggestion or example to implement such "keyword collection"
searching?

For the query above,
I actually use BooleanQuery with TermQuery. What else should I take care of?

/* ************************************************** */
BooleanQuery q = new BooleanQuery(true);  // disable  coord
TermQuery tq;
{
   tq = new TermQuery(new Term(field, "aa"));
   tq.setBoost(.1f);
   q.add(tq, BooleanClause.Occur.SHOULD);
}
{
   tq = new TermQuery(new Term(field, "bb"));
   tq.setBoost(.9f);
   q.add(tq, BooleanClause.Occur.SHOULD);
}
{
    tq = new TermQuery(new Term(field, "xx"));
    tq.setBoost(.1f);
    q.add(tq, BooleanClause.Occur.SHOULD);
}
....
Hits hits = isearcher.search(q);
/* ************************************************** */

Thanks

Lin

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message