lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Goller <gol...@detego-software.de>
Subject Re: Performance of TermVectors and skipTo
Date Wed, 07 Jul 2004 14:31:46 GMT
Lars Martin wrote:
> Hi Christoph.
> 
> -----Ursprüngliche Nachricht-----
> Von: Christoph Goller <goller@detego-software.de>
> Gesendet am: 02. Jul 2004, 14:10:44
> 
> Thanks for your information.
> 
> 
>>*) For testing skipTo, I used my implementation for getting highly
>>correlated terms. For computing the correlation measure I have to
>>compare a lot of TermDocs lists with each other or other lists of
>>document ids. According to my measurements on an optimized index
>>skipTo speeds up my term correlation implementation by a factor of
>>2. And the benefit of skipTo probably increases with index size.
> 
> 
> Can you say something about this computation? I do something very
> similar but struggle with performance. Maybe we can share some of
> our experiences. Thanks, Lars

Lars,

I don´t want to say too much about the details since this is one of our
products and nothing I can share with the whole community, at least not
currently.

We are using the mutual information measure (sometimes the conditional
mutual information) for measuring correlations between terms or for
finding terms that are highly correlated with a query. Probably
other correlation measures like the chi-square are equally good, but I
prefer information theory over traditional statistics :-)

If you compare two terms, you have to compare their TermDocs lists. Every
document is regarded as a random experiment in which a term either occurs
or does not occur. Thus you can compute a mutual information between terms
or the entropy/information of a single term (everything based on relative
frequencies as estimations for probabilities). Mutual information tells you
how much information one term gives you about the other. Roughly speaking a
term gives you much information about another term if they occur togehter very
often and an occurrence of one without the other is rare. SkipTo helps you
to compare two TermDocs more efficiently and TermVectors help you to get
all terms of a specific document without reindexing the whole document.

Our implementation based on TermVectors and skipTo is quite efficient. We
do not compute any term correlation matrix in advance. Everything is computed
on the fly. We are currently working on phrase detection so that things like
"space agency", "Bayern München" etc. are deteced automatically and in the
future we will find correlated phrases besides correlated terms.

For an online demonstration (currently only on German data) see
www.intrafind.org
www.intrafind.de

For the big Reuters corpus we get e.g. for the term "space" in the body field:

body:nasa 0.2631193
title:space 0.22249728
body:shuttle 0.2184087
body:mir 0.21577257
body:astronaut 0.19784299
body:orbit 0.16689582
body:station 0.16437002
body:astronauts 0.16407493
body:earth 0.1626108
body:crew 0.154212
body:mission 0.15323764
title:shuttle 0.13632075
body:aboard 0.1198784
title:mir 0.10377639
body:orbiting 0.102657825
body:cosmonauts 0.09957995
body:module 0.0969386
body:flight 0.09572442
body:launch 0.08581744
body:kennedy 0.08313377
body:spacecraft 0.08269161
body:rocket 0.08227383
body:craft 0.07688706
body:russian 0.07650414
body:satellites 0.07064384
body:satellite 0.06944384
body:soyuz 0.054038323
body:centre 0.052776292
body:russia 0.05264155
body:florida 0.049462605
title:nasa 0.049066782
body:columbia 0.041542426
body:agency 0.040644858
title:russia 0.040153336
body:landing 0.039837215
body:programme 0.038746372
body:control 0.037223794
body:est 0.037053764
body:aeronautics 0.035379715
body:hatch 0.0352119
body:cargo 0.035061583
body:booster 0.033894222
body:scheduled 0.032277618
body:russians 0.032115374


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message