lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nadav Har'El" <>
Subject Re: Small field indexing and ranking
Date Tue, 11 Apr 2006 09:12:10 GMT
"Maxym Mykhalchuk" <> wrote on 11/04/2006 11:52:07 AM:
> As for improving multi-word queries, Doug Cutting recently posted a link
> his presentation,

> just scroll down to Nutch N-Grams there, and you'll see the answer.
> Basically, "Buffy the Vampire" is indexed as buffy, buffy-the+0, the,
> the-vampire+0, vampire; but that's only for common terms like "the",
> etc. I wander if it's possible to use the same technique to store

The method you propose has been known for a long time, and sometimes called
"Lexical Affinities".
For example,

   "Lexical affinities (LAs) were first introduced by Saussure in 1947
    to represent the correlation between words co-occurring in a given
    language and then restricted to a given document for IR purposes.
    LAs are identified by looking at pairs of words found in close
    proximity to each other. It has been described elsewhere how LAs,
    when used as indexing units, improve precision of search by
    disambiguating terms."

A paper from SIGIR from as back as 1989
makes use of this technique.

However, I'm not sure that this technique is still considered the
best one. It can have a large impact on the index size, and it may
be possible to get similar results with no impact on index size
and just a small run-time slowdown by using something like
SpanNearQuery, or a variation on this idea. Again, I didn't yet
try to do this myself, so I'm not sure how successful that would

Nadav Har'El

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message