lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: [Lucene-java Wiki] Update of "TREC 2007 Million Queries Track - IBM Haifa Team" by DoronCohen
Date Thu, 31 Jan 2008 05:21:29 GMT
Doron - this looks super useful!
Can you give an example for the lexical affinities you mention here? ("Juru creates posting
lists for lexical affinities")
Also:

"Normalized term-frequency, as in Juru.
Here, tf(freq) is normalized by the average term frequency of the document."

I've never seen this mentioned anywhere except here and once here on the ML (was it you who
mentioned this?), but this sounds intuitive.  What do others think?
Otis 


--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Apache Wiki <wikidiffs@apache.org>
To: java-commits@lucene.apache.org
Sent: Wednesday, January 30, 2008 5:15:02 PM
Subject: [Lucene-java Wiki] Update of "TREC 2007 Million Queries Track - IBM Haifa Team" by
DoronCohen

Dear 
Wiki 
user,

You 
have 
subscribed 
to 
a 
wiki 
page 
or 
wiki 
category 
on 
"Lucene-java 
Wiki" 
for 
change 
notification.

The 
following 
page 
has 
been 
changed 
by 
DoronCohen:
http://wiki.apache.org/lucene-java/TREC_2007_Million_Queries_Track_-_IBM_Haifa_Team

The 
comment 
on 
the 
change 
is:
Initial 
version, 
some 
data 
still 
missing...

New 
page:
= 
TREC 
2007 
Million 
Queries 
Track 
- 
IBM 
Haifa 
Team 
=

The 
[http://ciir.cs.umass.edu/research/million/ 
Million 
Queries 
Track] 
ran 
for 
the 
first 
time 
in 
2007.

Quoting 
from 
the 
track 
home 
page: 
 
* 
"The 
goal 
of 
this 
track 
is 
to 
run 
a 
retrieval 
task 
similar 
to 
standard 
ad-hoc 
retrieval, 
  
 
but 
to 
evaluate 
large 
numbers 
of 
queries 
incompletely, 
rather 
than 
a 
small 
number 
more 
completely.  
 
  
 
Participants 
will 
run 
10,000 
queries 
and 
a 
random 
1,000 
or 
so 
will 
be 
evaluated. 
The 
corpus 
is 
  
 
the 
terabyte 
track's 
GOV2 
corpus 
of 
roughly 
25,000,000 
.gov 
web 
pages, 
amounting 
to 
just 
  
 
under 
half 
a 
terabyte 
of 
data."

We 
participated 
in 
this 
track 
with 
two 
search 
engines 
- 
our 
home 
brewed 
search 
engine 
[http://trec.nist.gov/pubs/trec10/papers/JuruAtTrec.pdf 
Juru].

The 
official 
reports 
and 
papers 
of 
the 
track 
should 
be 
available 
sometimes 
in 
February 
2008, 
but 
here 
is 
a 
summary 
of 
the 
results 
and 
our 
experience 
with 
our 
first 
ever 
Lucene 
submission 
to 
TREC.

In 
summary, 
the 
out-of-the-box 
search 
quality 
was 
not 
so 
great, 
but 
by 
altering 
how 
we 
use 
Lucene 
(that 
is, 
our 
application) 
and 
with 
some 
modifications 
to 
Lucene, 
we 
were 
able 
to 
improve 
the 
search 
quality 
results 
and 
to 
score 
good 
in 
this 
competition. 

The 
lessons 
we 
learned 
can 
be 
of 
interest 
to 
applications 
using 
Lucene, 
to 
Lucene 
itself, 
and 
to 
researchers 
submitting 
to 
other 
TREC 
tracks 
(or 
elsewhere).

= 
Training 
=
As 
preparation 
for 
the 
track 
runs 
we 
"trained" 
Lucene 
on 
queries 
from 
previous 
years 
tracks 
- 
more 
exactly 
on 
the 
150 
short 
TREC 
queries 
for 
which 
there 
are 
existing 
judgments 
from 
previous 
years, 
for 
the 
same 
GOV2 
data.

We 
build 
an 
index 
- 
actually 
27 
indexes 
- 
for 
this 
data. 
For 
indexing 
we 
used 
the 
Trec-Doc-Maker 
that 
is 
now 
in 
Lucene's 
contrib 
benchmark 
(or 
a 
slight 
modification 
of 
it).

We 
found 
that 
best 
results 
are 
obtained 
when 
all 
data 
is 
in 
a 
single 
field, 
and 
so 
we 
did, 
keeping 
only 
stems 
(English, 
Porter, 
from 
Lucene 
contrib). 
We 
used 
the 
Standard-Analyzer, 
with 
a 
modified 
stoplist 
that 
took 
into
account 
that 
domain 
specific 
stopwords.

Running 
with 
both 
Juru 
and 
Lucene, 
and 
having 
obtained 
good 
results 
with 
Juru 
in 
previous 
years, 
we 
had 
something 
to 
compare 
to. 
For 
this, 
we 
made 
sure 
to 
HTML 
parse 
the 
documents 
in 
the 
same 
way 
in 
both 
systems 
(we 
used 
Juru's 
HTML 
parser 
for 
this) 
and 
use 
the 
same 
stoplist 
etc.

In 
addition, 
anchor 
text 
was 
collect 
in 
a 
pre-indexing 
global 
analysis 
pass, 
and 
so 
anchors 
of 
(pointing 
to) 
pages 
where 
indexed 
with 
the 
page 
they 
point 
to, 
up 
to 
a 
limited 
size. 
The 
number 
of 
in-links 
to 
each 
page 
was 
saved 
in 
a 
stored 
field 
and 
we 
used 
it 
as 
a 
static 
score
element 
(boosting 
documents 
that 
had 
more 
in-links). 
The 
way 
that 
anchors 
text 
was 
extracted
and 
prepared 
for 
indexing 
will 
be 
described 
in 
the 
full 
report.

= 
Results 
=

The 
initial 
results 
were:

 
||<rowbgcolor="#80FF80">'''Run'''||'''MAP'''||'''P@5'''||'''P@10'''||'''P@20'''||
 
|| 
1. 
Juru  
  
  
  
  
  
  
  
  
  
  
|| 
0.313 
|| 
0.592 
|| 
0.560  
|| 
0.529  
||
 
|| 
2. 
Lucene 
out-of-the-box  
  
 
|| 
0.154 
|| 
0.313 
|| 
0.303  
|| 
0.289  
||

We 
made 
the 
following 
changes:
 
1. 
Add 
a 
proximity 
scoring 
element, 
basing 
on 
our 
experience 
with 
"Lexical 
affinities" 
in 
Juru. 
  
  
Juru 
creates 
posting 
lists 
for 
lexical 
affinities.
  
  
In 
Lucene 
we 
used 
augmented 
the 
query 
with 
Span-Near-Queries.
 
1. 
Phrase 
expansion 
- 
the 
query 
text 
was 
added 
to 
the 
query 
as 
a 
phrase.
 
1. 
Replace 
the 
default 
similarity 
by 
Sweet-Spot-Similarity 
for 
a 
better 
  
  
choice 
of 
document 
length 
normalization. 
Juru 
is 
using 
  
  
[http://citeseer.ist.psu.edu/singhal96pivoted.html 
pivoted 
length 
normalization]
  
  
and 
we 
experimented 
with 
it, 
but 
found 
out 
that 
the 
simpler 
and 
faster 
sweet-spot-simiarity
  
  
performs 
better.
 
1. 
Normalized 
term-frequency, 
as 
in 
Juru. 
  
  
Here, 
tf(freq) 
is 
normalized 
by 
the 
average 
term 
frequency 
of 
the 
document.

So 
these 
are 
the 
updated 
results:

 
||<rowbgcolor="#80FF80">'''Run'''  
  
  
  
  
  
  
  
 
||'''MAP'''||'''P@5'''||'''P@10'''||'''P@20'''||
 
|| 
1. 
Juru  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
|| 
0.313  
 
|| 
0.592  
 
|| 
0.560  
  
|| 
0.529  
  
||
 
|| 
2. 
Lucene 
out-of-the-box  
  
  
  
  
  
  
  
  
  
  
 
|| 
0.154  
 
|| 
0.313  
 
|| 
0.303  
  
|| 
0.289  
  
||
 
|| 
3. 
Lucene 
+ 
LA 
+ 
Phrase 
+ 
Sweet 
Spot 
+ 
tf-norm 
|| 
0.306  
 
|| 
0.627  
 
|| 
0.589  
  
|| 
0.543  
  
||
  
  
The 
improvement 
is 
dramatic.

Perhaps 
even 
more 
important, 
once 
the 
track 
results 
were 
published, 
we 
found 
out 
that 
these 
improvement 
are 
consistent 
and 
steady, 
and 
so 
Lucene 
with 
these 
changes 
was 
ranked 
high
also 
by 
the 
two 
new 
measures 
introduced 
in 
this 
track 
- 
NEU-Map 
and 
E-Map 
(Epsilon-Map). 

With 
these 
new 
measures 
more 
queries 
are 
evaluated 
but 
less 
documents
are 
judged 
for 
each 
query. 
The 
algorithms 
for 
documents 
selection 
for 
judging 
(during 
the 
evaluation 
stage 
of 
the 
track) 
were 
not 
our 
focus 
in 
this 
work 
- 
as 
there 
were 
actually 
two 
goals 
to 
this 
TREC: 

  
* 
the 
systems 
evaluation 
(our 
main 
goal) 
and 
  
* 
the 
evaluation 
itself.

The 
fact 
that 
modified 
Lucene 
scored 
well 
in 
both 
the 
traditional 
150 
queries 
and 
the 
new 
1700 
evaluated 
queries 
with 
the 
new 
measures 
was 
reassuring 
for 
the 
"usefulness"
or 
perhaps 
"validity" 
of 
these 
modifications 
to 
Lucene. 

For 
certain 
these 
changes 
are 
not 
a 
100% 
fit 
for 
every 
application 
and 
every 
data, 
but 
these 
results 
are 
strong, 
and 
so 
I 
believe 
can 
be 
be 
valuable 
for 
many 
applications,
and 
certainly 
for 
research 
aspects.

= 
Search 
time 
penalty 
=

These 
improvements 
did 
not 
come 
for 
free.
Adding 
a 
phrase 
to 
the 
query 
and 
adding 
Span-Near-Queries 
for 
every 
pair 
of 
query 
words 
costs 
query 
time. 

The 
search 
time 
of 
stock 
Lucene 
in 
our 
setup 
was 
1.4 
seconds/query. 
The 
modified 
search 
time 
took 
8.0 
seconds/query. 
This 
is 
a 
large 
slowdown!

But 
it 
should 
be 
noticed 
that 
in 
this 
work 
we 
did 
not 
focus 
in 
search 
time,
only 
in 
quality. 
Now 
is 
the 
time 
to 
see 
how 
the 
search 
time 
penalty 
can 
be 
reduced 
while 
keeping 
most 
of 
the 
search 
time 
improvements.

= 
Implementation 
Details 
=

 
* 
Contrib 
benchmark 
quality 
package 
was 
used 
for 
the 
search 
quality 
measures 
and 
submissions.

/!\ 
To 
be 
completed...

= 
More 
Detailed 
Results 
=

/!\ 
To 
be 
added...

= 
Possible 
Changes 
in 
Lucene 
=

 
* 
Move 
Sweet-Spot-Similarity 
to 
core
 
* 
Make 
Sweer-Spot-Similarity 
the 
default 
similarity?
 
* 
Easier 
and 
more 
efficient 
ways 
to 
add 
proximity 
scoring?
 
* 
Allow 
easier 
implementation/extension 
of 
tf-normalization

/!\ 
To 
be 
completed 
& 
refined...




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message