lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From José Ramón Pérez Agüera <jose.agu...@gmail.com>
Subject Re: Average Precision - TREC-3
Date Wed, 27 Jan 2010 18:41:49 GMT
Hi Ivan,

you might want use the lucene BM25 implementation. Results should be
better changing the ranking function. Another option is Language model
implementation for Lucene:

http://nlp.uned.es/~jperezi/Lucene-BM25/
http://ilps.science.uva.nl/resources/lm-lucene

The main problem with this implementation is that not every different
kind of Lucene query, but if you don't need that these alternatives
implementation are a good choice.

best jose

On Wed, Jan 27, 2010 at 1:36 PM, Ivan Provalov <iprovalo@yahoo.com> wrote:
> Robert, Grant:
>
> Thank you for your replies.
>
> Our goal is to fine-tune our existing system to perform better on relevance.
>
> I agree with Robert's comment that these collections are not completely compatible. 
Yes, it is possible that the results will vary some depending on the collections differences. 
The reason for us picking TREC-3 TIPSTER collection is that our production content overlaps
with some TIPSTER documents.
>
> Any suggestions on how to obtain Lucene's TREC-3 compatible results, or select a better
approach would be appreciated.
>
> We are doing this project in three stages:
>
> 1. Test Lucene's "vanilla" performance to establish the baseline.  We want to iron out
the issues such as topic or document formats.  For example, we had to add a different parser
and clean up the topic title.  This will give us confidence that we are using the data and
the methodology correctly.
>
> 2. Fine-tune Lucene based on the latest research findings (TREC by E. Voorhees, conference
proceedings, etc...).
>
> 3. Repeat these steps with our production system which runs on Lucene.  The reason we
are doing this step last is to ensure that our overall system doesn't introduce the relevance
issues (content pre-processing steps, query parsing steps, etc...).
>
> Thank you,
>
> Ivan Provalov
>
> --- On Wed, 1/27/10, Robert Muir <rcmuir@gmail.com> wrote:
>
>> From: Robert Muir <rcmuir@gmail.com>
>> Subject: Re: Average Precision - TREC-3
>> To: java-user@lucene.apache.org
>> Date: Wednesday, January 27, 2010, 11:16 AM
>> Hello, forgive my ignorance here (I
>> have not worked with these english TREC
>> collections), but is the TREC-3 test collection the same as
>> the test
>> collection used in the 2007 paper you referenced?
>>
>> It looks like that is a different collection, its not
>> really possible to
>> compare these relevance scores across different
>> collections.
>>
>> On Wed, Jan 27, 2010 at 11:06 AM, Grant Ingersoll <gsingers@apache.org>wrote:
>>
>> >
>> > On Jan 26, 2010, at 8:28 AM, Ivan Provalov wrote:
>> >
>> > > We are looking into making some improvements to
>> relevance ranking of our
>> > search platform based on Lucene.  We started by
>> running the Ad Hoc TREC task
>> > on the TREC-3 data using "out-of-the-box"
>> Lucene.  The reason to run this
>> > old TREC-3 (TIPSTER Disk 1 and Disk 2; topics 151-200)
>> data was that the
>> > content is matching the content of our production
>> system.
>> > >
>> > > We are currently getting average precision of
>> 0.14.  We found some format
>> > issues with the TREC-3 data which were causing even
>> lower score.  For
>> > example, the initial average precision number was
>> 0.9.  We discovered that
>> > the topics included the word "Topic:" in the
>> <title> tag.  For example,
>> > > "<title> Topic:  Coping with
>> overcrowded prisons".  By removing this term
>> > from the queries, we bumped the average precision to
>> 0.14.
>> >
>> > There's usually a lot of this involved in running
>> TREC.  I've also seen a
>> > good deal of improvement from things like using phrase
>> queries and the
>> > Dismax Query Parser in Solr (which uses
>> DisjunctionQuery in Lucene, amongst
>> > other things) and by playing around with length
>> normalization.
>> >
>> >
>> > >
>> > > Our query is based on the title tag of the topic
>> and the index field is
>> > based on the <TEXT> tag of the document.
>> > >
>> > > QualityQueryParser qqParser = new
>> SimpleQQParser("title", "TEXT");
>> > >
>> > > Is there an average precision number which
>> "out-of-the-box" Lucene should
>> > be close to?  For example, this IBM's 2007 TREC
>> paper mentions 0.154:
>> > > http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf
>> >
>> > Hard to say.  I can't say I've run TREC 3.
>> You might ask over on the Open
>> > Relevance list too (http://lucene.apache.org/openrelevance).  I know
>> > Robert Muir's done a lot of experiments with Lucene on
>> standard collections
>> > like TREC.
>> >
>> > I guess the bigger question back to you is what is
>> your goal?  Is it to get
>> > better at TREC or to actually tune your system?
>> >
>> > -Grant
>> >
>> >
>> > --------------------------
>> > Grant Ingersoll
>> > http://www.lucidimagination.com/
>> >
>> > Search the Lucene ecosystem using Solr/Lucene:
>> > http://www.lucidimagination.com/search
>> >
>> >
>> >
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Jose R. Pérez-Agüera

Clinical Assistant Professor
Metadata Research Center
School of Information and Library Science
University of North Carolina at Chapel Hill
email: jaguera@email.unc.edu
Web page: http://www.unc.edu/~jaguera/
MRC website: http://ils.unc.edu/mrc/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message