Return-Path: Delivered-To: apmail-lucene-java-commits-archive@www.apache.org Received: (qmail 42484 invoked from network); 6 Feb 2008 20:43:44 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 6 Feb 2008 20:43:44 -0000 Received: (qmail 52578 invoked by uid 500); 6 Feb 2008 20:43:36 -0000 Delivered-To: apmail-lucene-java-commits-archive@lucene.apache.org Received: (qmail 52549 invoked by uid 500); 6 Feb 2008 20:43:36 -0000 Mailing-List: contact java-commits-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-commits@lucene.apache.org Received: (qmail 52538 invoked by uid 99); 6 Feb 2008 20:43:36 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Feb 2008 12:43:36 -0800 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.130] (HELO eos.apache.org) (140.211.11.130) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Feb 2008 20:43:19 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id 8CCC8D2E4 for ; Wed, 6 Feb 2008 20:43:11 +0000 (GMT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: Apache Wiki To: java-commits@lucene.apache.org Date: Wed, 06 Feb 2008 20:43:11 -0000 Message-ID: <20080206204311.3355.87629@eos.apache.org> Subject: [Lucene-java Wiki] Update of "TREC 2007 Million Queries Track - IBM Haifa Team" by PaulElschot X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification. The following page has been changed by PaulElschot: http://wiki.apache.org/lucene-java/TREC_2007_Million_Queries_Track_-_IBM_Haifa_Team The comment on the change is: Specializing SpanNearQuery ------------------------------------------------------------------------------ The [http://ciir.cs.umass.edu/research/million/ Million Queries Track] ran for the first time in 2007. - Quoting from the track home page: + Quoting from the track home page: - * "The goal of this track is to run a retrieval task similar to standard ad-hoc retrieval, + * "The goal of this track is to run a retrieval task similar to standard ad-hoc retrieval, - but to evaluate large numbers of queries incompletely, rather than a small number more completely. + but to evaluate large numbers of queries incompletely, rather than a small number more completely. - Participants will run 10,000 queries and a random 1,000 or so will be evaluated. The corpus is + Participants will run 10,000 queries and a random 1,000 or so will be evaluated. The corpus is - the terabyte track's GOV2 corpus of roughly 25,000,000 .gov web pages, amounting to just + the terabyte track's GOV2 corpus of roughly 25,000,000 .gov web pages, amounting to just under half a terabyte of data." - We participated in this track with two search engines - our home brewed search engine + We participated in this track with two search engines - our home brewed search engine [http://trec.nist.gov/pubs/trec10/papers/JuruAtTrec.pdf Juru]. - The official reports and papers of the track should be available sometimes in February 2008, + The official reports and papers of the track should be available sometimes in February 2008, but here is a summary of the results and our experience with our first ever Lucene submission to TREC. - In summary, the out-of-the-box search quality was not so great, but by altering how + In summary, the out-of-the-box search quality was not so great, but by altering how - we use Lucene (that is, our application) and with some modifications to Lucene, we were + we use Lucene (that is, our application) and with some modifications to Lucene, we were - able to improve the search quality results and to score good in this competition. + able to improve the search quality results and to score good in this competition. - The lessons we learned can be of interest to applications using Lucene, to Lucene + The lessons we learned can be of interest to applications using Lucene, to Lucene itself, and to researchers submitting to other TREC tracks (or elsewhere). = Training = - As preparation for the track runs we "trained" Lucene on queries from previous years + As preparation for the track runs we "trained" Lucene on queries from previous years - tracks - more exactly on the 150 short TREC queries for which there are existing + tracks - more exactly on the 150 short TREC queries for which there are existing judgments from previous years, for the same GOV2 data. - We build an index - actually 27 indexes - for this data. For indexing we used the + We build an index - actually 27 indexes - for this data. For indexing we used the Trec-Doc-Maker that is now in Lucene's contrib benchmark (or a slight modification of it). - We found that best results are obtained when all data is in a single field, and so we did, keeping only stems + We found that best results are obtained when all data is in a single field, and so we did, keeping only stems (English, Porter, from Lucene contrib). We used the Standard-Analyzer, with a modified stoplist that took into account that domain specific stopwords. - Running with both Juru and Lucene, and having obtained good results with Juru in previous years, we + Running with both Juru and Lucene, and having obtained good results with Juru in previous years, we - had something to compare to. For this, we made sure to HTML parse the documents in the same way in + had something to compare to. For this, we made sure to HTML parse the documents in the same way in both systems (we used Juru's HTML parser for this) and use the same stoplist etc. - In addition, anchor text was collect in a pre-indexing global analysis pass, and so anchors + In addition, anchor text was collect in a pre-indexing global analysis pass, and so anchors - of (pointing to) pages where indexed with the page they point to, up to a limited size. The + of (pointing to) pages where indexed with the page they point to, up to a limited size. The number of in-links to each page was saved in a stored field and we used it as a static score element (boosting documents that had more in-links). The way that anchors text was extracted and prepared for indexing will be described in the full report. @@ -53, +53 @@ || 2. Lucene out-of-the-box || 0.154 || 0.313 || 0.303 || 0.289 || We made the following changes: - 1. Add a proximity scoring element, basing on our experience with "Lexical affinities" in Juru. + 1. Add a proximity scoring element, basing on our experience with "Lexical affinities" in Juru. Juru creates posting lists for lexical affinities. In Lucene we used augmented the query with Span-Near-Queries. 1. Phrase expansion - the query text was added to the query as a phrase. - 1. Replace the default similarity by Sweet-Spot-Similarity for a better + 1. Replace the default similarity by Sweet-Spot-Similarity for a better - choice of document length normalization. Juru is using + choice of document length normalization. Juru is using [http://citeseer.ist.psu.edu/singhal96pivoted.html pivoted length normalization] and we experimented with it, but found out that the simpler and faster sweet-spot-simiarity performs better. - 1. Normalized term-frequency, as in Juru. + 1. Normalized term-frequency, as in Juru. Here, tf(freq) is normalized by the average term frequency of the document. So these are the updated results: @@ -71, +71 @@ || 1. Juru || 0.313 || 0.592 || 0.560 || 0.529 || || 2. Lucene out-of-the-box || 0.154 || 0.313 || 0.303 || 0.289 || || 3. Lucene + LA + Phrase + Sweet Spot + tf-norm || 0.306 || 0.627 || 0.589 || 0.543 || - + The improvement is dramatic. - Perhaps even more important, once the track results were published, we found out that these + Perhaps even more important, once the track results were published, we found out that these improvement are consistent and steady, and so Lucene with these changes was ranked high - also by the two new measures introduced in this track - NEU-Map and E-Map (Epsilon-Map). + also by the two new measures introduced in this track - NEU-Map and E-Map (Epsilon-Map). With these new measures more queries are evaluated but less documents - are judged for each query. The algorithms for documents selection for judging (during the + are judged for each query. The algorithms for documents selection for judging (during the - evaluation stage of the track) were not our focus in this work - as there were actually two + evaluation stage of the track) were not our focus in this work - as there were actually two - goals to this TREC: + goals to this TREC: - * the systems evaluation (our main goal) and + * the systems evaluation (our main goal) and * the evaluation itself. - The fact that modified Lucene scored well in both the traditional 150 queries and + The fact that modified Lucene scored well in both the traditional 150 queries and the new 1700 evaluated queries with the new measures was reassuring for the "usefulness" - or perhaps "validity" of these modifications to Lucene. + or perhaps "validity" of these modifications to Lucene. - For certain these changes are not a 100% fit for every application and every data, + For certain these changes are not a 100% fit for every application and every data, but these results are strong, and so I believe can be be valuable for many applications, and certainly for research aspects. = Search time penalty = These improvements did not come for free. - Adding a phrase to the query and adding Span-Near-Queries for every pair of query words + Adding a phrase to the query and adding Span-Near-Queries for every pair of query words - costs query time. + costs query time. - The search time of stock Lucene in our setup was 1.4 seconds/query. + The search time of stock Lucene in our setup was 1.4 seconds/query. - The modified search time took 8.0 seconds/query. + The modified search time took 8.0 seconds/query. This is a large slowdown! But it should be noticed that in this work we did not focus in search time, - only in quality. Now is the time to see how the search time penalty can + only in quality. Now is the time to see how the search time penalty can be reduced while keeping most of the search time improvements. = Implementation Details = @@ -115, +115 @@ * Omit norms for the (single) text field. * index document length (number of terms) in a dedicated field. * index document unique length (number of unique terms) is another dedicated field. - * index number of in-links (see anchors above) in a third dedicated field. + * index number of in-links (see anchors above) in a third dedicated field. * Create an !indexReader that implements norms() by reading (caching) the document length field and either: - * Imitate the stock Lucene by normalizing the length with !DefaultSimilarity and then compressing it to a single byte. + * Imitate the stock Lucene by normalizing the length with !DefaultSimilarity and then compressing it to a single byte. * Similarly imitate !SweetSpot similarity. - * For pivoted length normalization tests (which are not listed in the results below because they were outperformed with the simpler sweet-spot similarity) we used a regular index reader (so norms were 1) and used a !CustomScoreQuery (search.function) with a !ValueSourceQuery part that did the normalization - it read the unique length field and used the average on it to compute the pivoted norm. + * For pivoted length normalization tests (which are not listed in the results below because they were outperformed with the simpler sweet-spot similarity) we used a regular index reader (so norms were 1) and used a !CustomScoreQuery (search.function) with a !ValueSourceQuery part that did the normalization - it read the unique length field and used the average on it to compute the pivoted norm. * In-links static rank scoring was also implemented with !CustomScoreQuery - with a !FieldScoreQuery that read the cached in-links count. * At some point I tried to accelerate the search and improve its quality by creating a new query - an OR with Phrase with Proximity query. This query should have been faster (I hoped) because it read each posting just once. This is in oppose to creating a !SpanNear query for each pair of query words, in addition to a phrase query and an or query. But the results were very disappointing: search time was not improved, and quality was hurt. Might just be a bug... But I learned to appreciate even more the modularity of Lucene queries. Here is an example Lucene Query. - For the input query (topic): + For the input query (topic): {{{ U.S. oil industry history }}} @@ -150, +150 @@ * Phrases allow fuzziness when words were stopped. /!\ (1) todo: refer to payloads (2) todo: describe tf normalization implementation. - + = More Detailed Results = Following are more detailed results, also listing MRR, and listing various options. @@ -167, +167 @@ || 4.B LA + Phrase + tf normalization || 0.194 || 0.572 || 0.404 || 0.373 || 0.370 || 7.792 || || 4.C LA + Phrase + Sweet Spot length norm + tf normalization || 0.306 || 0.771 || 0.627 || 0.589 || 0.543 || 7.984 || - There are some peculiarities, for instance the fact that tf-normalization alone (without sweet-spot length normalization) hurts MAP, both on top of stock Lucene (run 1.B vs 1) and on top of "LA + Phrase" (run option 4.B vs. 4) but other than that results seem quite consistent. For instance Sweet spot similarity improves with almost no runtime cost - this shows in both 1.A vs. 1 and 4.A vs. 4. + There are some peculiarities, for instance the fact that tf-normalization alone (without sweet-spot length normalization) hurts MAP, both on top of stock Lucene (run 1.B vs 1) and on top of "LA + Phrase" (run option 4.B vs. 4) but other than that results seem quite consistent. For instance Sweet spot similarity improves with almost no runtime cost - this shows in both 1.A vs. 1 and 4.A vs. 4. = Possible Changes in Lucene = * Move Sweet-Spot-Similarity to core - * Make Sweer-Spot-Similarity the default similarity? If so with which parameters? + * Make Sweer-Spot-Similarity the default similarity? If so with which parameters? In this run we used steepness = 0.5, min = 1000, and max = 15, 000. * Easier and more efficient ways to add proximity scoring? + For example specialize Span-Near-Query for the case when all subqueries are terms. * Allow easier implementation/extension of tf-normalization. /!\ To be completed & refined...