lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antonio Calò <anton.c...@gmail.com>
Subject Re: Frequency Term of Composite words
Date Thu, 17 Dec 2009 11:42:28 GMT
Hi Ted. yes, your assumption are correct.

If lucene save position and offset, I should find a way to get occurrence of
a multiword term. I'll let you know. I'll write some code to understand if
this is the optimum way.

Many thanks & regards

Antonio

2009/12/17 André Warnier <aw@ice-sa.com>

> Antonio Calň wrote:
>
>> Hi Ted.
>>
>> Thank you very much for your feedback.
>>
>> I can see the term frequency for each term, but not fo couples or more
>> term
>> togheter.
>>
>> An example: "the quick brown fox jumps over the lazy dog. But the big dog
>> was sleeping.So The lazy dog didn't see the fox"
>>
>> So, with your suggestion I'm able to find that tf("dog") = 2,
>> tf("fox")=3,... (the terms are composed by  just a word).
>>
>> But it seems that TermFrequencyVector cannot answer to this: tf("lazy
>> dog")=2, tf("quick brown")=1.
>>
>> Unlikely I've been asked to retrieve the occurrence of a set of concept in
>> a
>> document and I was trying to use lucene cause my simple mapping algorithm
>> is
>> too slow :(.
>>
>> I'll try to see if I can do something with TermFreqVector, or with the
>> Analizer. OR I'll go to look for another way :)
>>
>> Antonio
>>
>>
>>
>> 2009/12/16 Ted Dunning <ted.dunning@gmail.com>
>>
>>  You need the term frequency vector.
>>>
>>> See here
>>>
>>>
>>> http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/index/IndexReader.html#getTermFreqVector%28int,%20java.lang.String%29
>>>
>>> This is compatible in 3.0 as well:
>>>
>>>
>>> http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/index/IndexReader.html#getTermFreqVector%28int,%20java.lang.String%29
>>>
>>> Note the package change.
>>>
>>>
>>> On Wed, Dec 16, 2009 at 7:34 AM, Antonio Calň <anton.calo@gmail.com>
>>>
>>> wrote:
>>>
>>>  I All
>>>>
>>>> I Hope that you can help me on this.
>>>>
>>>> I'm looking for a fast way to obtainf for a given word, its term
>>>>
>>> frequency
>>>
>>>> (I mean how many times it is available in a single doc). I've looking
>>>>
>>> into
>>>
>>>> mail archive and LIA (Lucene In Action) book and I found something like
>>>> this:
>>>>
>>>> IndexSearcher index = new IndexSearcher(invertedIndexinRam);
>>>> Term term = new Term("doc", "quick");
>>>> int occurrence = index.docFreq(term);
>>>>
>>>> ok, occurrence contains the occurrences of the word "quick" into the
>>>>
>>> index
>>>
>>>> (In my case the index will contain only one document example "the quick
>>>> brown fox jumps over the lazy dog"). In this case the occurrence will be
>>>>
>>> 1.
>>>
>>>> :)
>>>>
>>>> But now I need to retrieve the occurrency of a composite word: as
>>>> example
>>>> "quick brown fox" but I'm quite in trouble on how could I perform this.
>>>>
>>>>  I haven't even really started to use Lucene yet, but I follow this
> list.
> So just an unqualified idea :
> - assuming each word is indexed, along with its position in each item
> - assuming that you kept all the words, and did not strip out "stop words"
> - assuming that you have the list of items which contain all of the words
> composing your multi-word term
> - then you should be able to determine which items contain
>  word 1 of your term in position n
>  word 2 of your term in position n+1
>  etc..
>
>


-- 
Antonio Calò
------------------------------------------
Software Developer Engineer
@ Intellisemantic
Mail anton.calo@gmail.com
Tel. 011-56.90.429
------------------------------------------

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message