lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joaquin Perez Iglesias <joaquin.pe...@lsi.uned.es>
Subject Re: Query Expansion Module for Lucene based on BM25 ranking function
Date Thu, 23 Oct 2008 08:56:12 GMT
Hi Grant and Jose,

just to give some more details, as Jose said avg_length is precalculated 
at indexing time using an specific Similarity class. Basically this can 
be done through the lengthNorm method, for each document and field the 
total length is stored, when the indexing process is finished these 
values are divided by the total number of docs, and results are saved in 
a file with the following format:

FIELD_NAME
FLOAT_VALUE
FIELD_NAME_2
FIELD_VALUE_2

for example:
CONTENT
459.2903f
ANCHOR
84.55523f

A more detailed explanation can be found at 
http://nlp.uned.es/~jperezi/Lucene-BM25/

Best Regards.


José Ramón Perez Aguera wrote:
> Hi Grant,
>
> Our query expansion approach is quite simple, we apply 
> pseudo-relevance feedback techniques, where a number of top retrieved 
> documents are used to extract the terms candidates to expand the 
> original query. We have used TermPositions in query time to extract 
> the term statistics necessaries for query expansion. On the other 
> hand, to implement BM25, we have used the implementation propoused by 
> Joaquin perez, where avg. Length is computed in indexing time and it 
> is used as a constant in query time.
>
> We know that this is not the best way to do that, but we don't have 
> the knowledge about lucene to made a better implementation. Our code 
> is for research projects only, and it is not robust enougth for 
> real-world projects.
>
> I have not followed any of the flexible indexing discussions, but this 
> subject seems to be very interesting to me. Do you have any pointers?.
>
> Finally, the code has been reliased under apache license. But we are 
> not experts on this kind of things. Our idea is that everyone can take 
> the code, change it and use it without constraints.
>
> Best regards
>
> Jose
>
> José Ramón Pérez Agüera
>
> On 22/10/2008, at 20:16, Grant Ingersoll <gsingers@apache.org> wrote:
>
>> Hi José,
>>
>> Can you explain your approach to implementing? I'm curious how you 
>> incorporated in the avg. doc length. Also, have you followed any of 
>> the flexible indexing discussions?
>>
>> Finally, what's the license on this code?
>>
>> Thanks,
>> Grant
>>
>> On Oct 21, 2008, at 10:14 AM, José Ramón Pérez Agüera wrote:
>>
>>> Hello,
>>>
>>> We have implemented a research module for lucene using BM25 and our
>>> structured version of BM25 as ranking functions and a couple of
>>> state-of-art query expansion algoritms.
>>>
>>> This implementation is quite different to other query expansion
>>> modules for Lucene that are available in the web.
>>>
>>> We have used this implementation in CLEF 2008 Robust WSD track
>>> http://ixa2.si.ehu.es/clirwsd/ with very good results. The code and
>>> other technical and scientific details can be accessed here:
>>>
>>> http://grasia.fdi.ucm.es/jose/query-expansion
>>>
>>> This is an alpha0.0.never.used.before.by.anyone version, we will work
>>> on it in the following months to make it easier to use.
>>>
>>> Please do not hesitate to send comments, questions and/or complaints
>>> to the authors.
>>>
>>> jose
>>>
>>> -- 
>>> José Ramón Pérez Agüera
>>>
>>> Dept. de Ingeniería del Software e Inteligencia Artificial
>>> Despacho 411 tlf. 913947599
>>> Facultad de Informática
>>> Universidad Complutense de Madrid
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
>> http://www.lucenebootcamp.com
>>
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

-- 
-----------------------------------------------------------
Joaquín Pérez Iglesias
Dpto. Lenguajes y Sistemas Informáticos
E.T.S.I. Informática (UNED)
Ciudad Universitaria
C/ Juan del Rosal nº 16
28040 Madrid - Spain
Phone. +34 91 398 89 19
Fax    +34 91 398 65 35
Office  2.11
Email: joaquin.perez@lsi.uned.es
web:   http://nlp.uned.es/~jperezi/
-----------------------------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message