lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karsten Konrad" <Karsten.Kon...@xtramind.com>
Subject AW: inter-term correlation [was Re: Vector Space Model in Lucene?]
Date Sat, 15 Nov 2003 12:15:57 GMT

>>
Rules of linguistics? Is there such a thing? :)
>>

Yes there are. How can you expect communication (the goal of
the game that natural language is about) to work if the game 
has no rules? 

Anyway, Herb is right, sentence boundaries do carry a meaning and the 
linguistic rule could be phrased as: "Constituents (Concepts) mentioned 
in one sentence together have a closer relation than those that are not."

I was wondering whether we could, while indexing, make a use of this by 
increasing the position counter by a large number, let's say 1000, 
whenever we encounter a sentence separator (Note, this is not trivial; 
not every '.' ends a  sentence etc. etc. etc.). Thus, searching for

"income tax"~100 "tax gain"~100 "income tax gain"~100 income tax gain

would find "income tax gain" as usual, but would boost all texts
where the phrases involved appear within sentence boundaries - I 
assume that a sentence with 100 words would be pretty unlikely,
but still within the 1000 word separation done by increasing the
position. No linguistics necessary, actually, but it is an application
of a linguistic rule!

>>
Sure. But my take on this, is that pigs will fly before NLP turns into 
a predictable "science" :)
>>

You mean like physics (new models every 10 years), biology (same),
medicine (er.. cancer research anyone?), chemistry ("the result could be
verified in 8 of 10 experiments..."). What does predictabiltity mean
to you? What sciences beside mathematics do give you 100% certainty? 

But I guess you are in flame mode anyway now :)

Regards,

Karsten 


-----Urspr√ľngliche Nachricht-----
Von: petite_abeille [mailto:petite_abeille@mac.com] 
Gesendet: Freitag, 14. November 2003 20:04
An: Lucene Users List
Betreff: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]



On Nov 14, 2003, at 19:50, Chong, Herb wrote:

> if you are handling inter correlation properly, then terms can't cross
> sentence boundaries.

Could you not break down your document along sentences boundary? If you 
manage to figure out what a sentence is, that is.

> if you are not paying attention to sentence boundaries, then you are
> not following rules of linguistics.

Rules of linguistics? Is there such a thing? :)

PA.




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message