From Tatu Saloranta <>
Subject Re: inter-term correlation [was Re: Vector Space Model in Lucene?]
Date Sat, 15 Nov 2003 01:30:16 GMT
On Friday 14 November 2003 13:39, Chong, Herb wrote:
> you're describing ad-hoc solutions to a problem that have an effect, but
> not one that is easily predictable. one can concoct all sorts of
> combinations of the query operators that would have something of the effect
> that i am describing. crossing sentence boundaries, however, can't be done

Hmmh? You implied that there are some useful distance heuristics (words
5 words apart or more correlate much less), and others have pointed out Lucene 
has many useful components.

Building more complex system from small components is usually considered a 
Good Thing (tm), not an "ad hoc solution". In fact, I would guess most 
experienced people around here start with Lucene defaults, and build their 
own systems gradually customizing more and more of pieces.
It may be there are actual fundamental problems with Lucene, regarding 
approach you'd prefer, but I don't think it makes sense to brush off 
suggestions regarding distance  & fuzzy/sloppy queries by claiming they are 
"just hacks".

> without having some sentence boundaries as a reference. on top of this,
> there is a relatively simple concept which, if implemented, takes away all
> the ad-hocness of the solutions and replaces it with a something that is
> both linguistically and mathematically sound and on top of which won't

Like most people have pointed out, linguistics are nothing of sorts of exact 
science; and comparing it to maths sounds like apples vs. oranges to me.
I'm not even convinced one can use general terms like "linguistically sound"; 
especially as content being indexed and searched on is often mixture of 
natural and programming languages (at least with knowledge bases I work 

Now; if you (or anyone else) could build more advanced query mechanism either 
on top of Lucene fundamentals, or have modified version, THAT would be 
useful. But it's more efficient to first consider suggestions, and especially 
WHAT WORKS as opposed to argue for what appears most elegant a solution.

> materially make the engine core more complicated. that concept is that
> multiword queries are mostly multiword terms and they can't cross sentence
> boundaries according to the rules of English.

Which brings us back to the problem of detecting boundaries. Punctuation can 
help; classifications of words can help; all are inexact "science". Which 
just makes me wonder if just considering token distances might then just be 
plenty good enough.

-+ Tatu +-

