lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-java Wiki] Update of "TREC 2007 Million Queries Track - IBM Haifa Team" by DoronCohen
Date Wed, 06 Feb 2008 19:45:14 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The following page has been changed by DoronCohen:
http://wiki.apache.org/lucene-java/TREC_2007_Million_Queries_Track_-_IBM_Haifa_Team

The comment on the change is:
Add info on results and implementation. Yet incomplete.

------------------------------------------------------------------------------
  = Implementation Details =
  
   * Contrib benchmark quality package was used for the search quality measures and submissions.
+  * In order to experiment with various length normalizations the most straight forward way
would have been to create a separate index for each option. But this was unreasonable for
the amount of data, and for the flexibility that was required in order to try new things.
So we indexed and searched like this:
+    * Omit norms for the (single) text field.
+    * index document length (number of terms) in a dedicated field.
+    * index document unique length (number of unique terms) is another dedicated field.
+    * index number of in-links (see anchors above) in a third dedicated field. 
+    * Create an !indexReader that implements norms() by reading (caching) the document length
field and either:
+      * Imitate the stock Lucene by normalizing the length with !DefaultSimilarity and then
compressing it to a single byte. 
+      * Similarly imitate !SweetSpot similarity.
+    * For pivoted length normalization tests (which are not listed in the results below because
they were outperformed with the simpler sweet-spot similarity) we used a regular index reader
(so norms were 1) and used a !CustomScoreQuery (search.function) with a !ValueSourceQuery
part that did the normalization - it read the unique length field and used the average on
it to compute the pivoted norm. 
+    * In-links static rank scoring was also implemented with !CustomScoreQuery - with a !FieldScoreQuery
that read the cached in-links count.
  
- /!\ To be completed...
+   * At some point I tried to accelerate the search and improve its quality by creating a
new query - an OR with Phrase with Proximity query. This query should have been faster (I
hoped) because it read each posting just once. This is in oppose to creating a !SpanNear query
for each pair of query words, in addition to a phrase query and an or query. But the results
were very disappointing: search time was not improved, and quality was hurt. Might just be
a bug... But I learned to appreciate even more the modularity of Lucene queries.
  
+ Here is an example Lucene Query.
+ For the input query (topic): 
+ {{{
+      U.S. oil industry history
+ }}}
+ the following Lucene query was created:
+ {{{
+   oil industri histori
+   (
+     spanNear([oil, industri], 8, false)
+     spanNear([oil, histori], 8, false)
+     spanNear([industri, histori], 8, false)
+   )^4.0
+   "oil industri histori"~1^0.75
+ }}}
+ 
+ This demonstrates that:
+   * U.S. is considered a stop word and was removed from the query text.
+   * Only stemmed forms of words are used.
+   * Default query operator is OR.
+   * Words found in a document up to 7 positions apart form a lexical affinity. (8 in this
example because of the stopped word.)
+   * Lexical affinity matches are boosted 4 times more than single word matches.
+   * Phrase matches are counted slightly less than single word matches.
+   * Phrases allow fuzziness when words were stopped.
+ 
+  
  = More Detailed Results =
  
- /!\ To be added...
+ Following are more detailed results, also listing MRR, and listing various options.
+ 
+  ||<rowbgcolor="#80FF80">'''Run'''                               ||'''MAP'''||'''MRR'''||'''P@5'''||'''P@10'''||'''P@20'''||'''Time'''||
+  || 1.   Lucene out-of-the-box                                   || 0.154   || 0.424   ||
0.313   || 0.303    || 0.289    || 1.349    ||
+  || 2.   LA only                                                 || 0.208   || 0.550   ||
0.409   || 0.382    || 0.368    || 5.573    ||
+  || 3.   Phrase Only                                             || 0.191   || 0.507   ||
0.358   || 0.347    || 0.341    || 4.136    ||
+  || 4.   LA + Phrase                                             || 0.214   || 0.567   ||
0.409   || 0.390    || 0.383    || 6.706    ||
+  || 1.A  Sweet Spot length norm                                  || 0.162   || 0.553   ||
0.438   || 0.400    || 0.383    || 1.372    ||
+  || 1.B  tf normalization                                        || 0.116   || 0.436   ||
0.298   || 0.294    || 0.286    || 1.527    ||
+  || 1.C  Sweet Spot length norm + tf normalization               || 0.269   || 0.705   ||
0.562   || 0.538    || 0.495    || 1.555    ||
+  || 4.A  LA + Phrase + Sweet Spot length norm                    || 0.273   || 0.737   ||
0.593   || 0.565    || 0.527    || 6.871    ||
+  || 4.B  LA + Phrase + tf normalization                          || 0.194   || 0.572   ||
0.404   || 0.373    || 0.370    || 7.792    ||
+  || 4.C  LA + Phrase + Sweet Spot length norm + tf normalization || 0.306   || 0.771   ||
0.627   || 0.589    || 0.543    || 7.984    ||
+ 
+ There are some peculiarities, for instance the fact that tf-normalization alone (without
sweet-spot length normalization) hurts MAP, both on top of stock Lucene (run 1.B vs 1) and
on top of "LA + Phrase" (run option 4.B vs. 4) but other than that results seem quite consistent.
For instance Sweet spot similarity improves with almost no runtime cost - this shows in both
1.A vs. 1 and 4.A vs. 4.  
  
  = Possible Changes in Lucene =
  
   * Move Sweet-Spot-Similarity to core
-  * Make Sweer-Spot-Similarity the default similarity?
+  * Make Sweer-Spot-Similarity the default similarity? If so with which parameters? 
+    In this run we used steepness = 0.5, min = 1000, and max = 15, 000.
   * Easier and more efficient ways to add proximity scoring?
-  * Allow easier implementation/extension of tf-normalization
+  * Allow easier implementation/extension of tf-normalization.
  
  /!\ To be completed & refined...
  

Mime
View raw message