lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier ...@ice-sa.com>
Subject Re: Frequency Term of Composite words
Date Thu, 17 Dec 2009 11:04:45 GMT
Antonio Calò wrote:
> Hi Ted.
> 
> Thank you very much for your feedback.
> 
> I can see the term frequency for each term, but not fo couples or more term
> togheter.
> 
> An example: "the quick brown fox jumps over the lazy dog. But the big dog
> was sleeping.So The lazy dog didn't see the fox"
> 
> So, with your suggestion I'm able to find that tf("dog") = 2,
> tf("fox")=3,... (the terms are composed by  just a word).
> 
> But it seems that TermFrequencyVector cannot answer to this: tf("lazy
> dog")=2, tf("quick brown")=1.
> 
> Unlikely I've been asked to retrieve the occurrence of a set of concept in a
> document and I was trying to use lucene cause my simple mapping algorithm is
> too slow :(.
> 
> I'll try to see if I can do something with TermFreqVector, or with the
> Analizer. OR I'll go to look for another way :)
> 
> Antonio
> 
> 
> 
> 2009/12/16 Ted Dunning <ted.dunning@gmail.com>
> 
>> You need the term frequency vector.
>>
>> See here
>>
>> http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/index/IndexReader.html#getTermFreqVector%28int,%20java.lang.String%29
>>
>> This is compatible in 3.0 as well:
>>
>> http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/index/IndexReader.html#getTermFreqVector%28int,%20java.lang.String%29
>>
>> Note the package change.
>>
>>
>> On Wed, Dec 16, 2009 at 7:34 AM, Antonio Calò <anton.calo@gmail.com>
>> wrote:
>>
>>> I All
>>>
>>> I Hope that you can help me on this.
>>>
>>> I'm looking for a fast way to obtainf for a given word, its term
>> frequency
>>> (I mean how many times it is available in a single doc). I've looking
>> into
>>> mail archive and LIA (Lucene In Action) book and I found something like
>>> this:
>>>
>>> IndexSearcher index = new IndexSearcher(invertedIndexinRam);
>>> Term term = new Term("doc", "quick");
>>> int occurrence = index.docFreq(term);
>>>
>>> ok, occurrence contains the occurrences of the word "quick" into the
>> index
>>> (In my case the index will contain only one document example "the quick
>>> brown fox jumps over the lazy dog"). In this case the occurrence will be
>> 1.
>>> :)
>>>
>>> But now I need to retrieve the occurrency of a composite word: as example
>>> "quick brown fox" but I'm quite in trouble on how could I perform this.
>>>
I haven't even really started to use Lucene yet, but I follow this list.
So just an unqualified idea :
- assuming each word is indexed, along with its position in each item
- assuming that you kept all the words, and did not strip out "stop words"
- assuming that you have the list of items which contain all of the 
words composing your multi-word term
- then you should be able to determine which items contain
   word 1 of your term in position n
   word 2 of your term in position n+1
   etc..


Mime
View raw message