lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rao, Vaijanath" <vaijanath....@corp.aol.com>
Subject RE: Frequency Term of Composite words
Date Thu, 17 Dec 2009 11:18:44 GMT
Hi Antonio,

One of the simple way would be to generate the ngram of the text and store them as is. 

For example : "the quick brown fox jumps over the lazy dog. But the big dog was sleeping.So
The lazy dog didn't see the fox"
You decide your system can support concept upto an len of 3 generate ngrams for the text
So the output of your ngrams would be something like this
The, the quick, the quick brown and so on ..

Then create an keyword analyzer for this field and store all these values as part of it. Then
you can call the TermFrequencyVector on that text.

Hope this helps 

--Thanks and Regards
Vaijanath N. Rao




 

-----Original Message-----
From: Antonio Calò [mailto:anton.calo@gmail.com] 
Sent: Thursday, December 17, 2009 4:25 PM
To: general@lucene.apache.org
Subject: Re: Frequency Term of Composite words

Hi Ted.

Thank you very much for your feedback.

I can see the term frequency for each term, but not fo couples or more term togheter.

An example: "the quick brown fox jumps over the lazy dog. But the big dog was sleeping.So
The lazy dog didn't see the fox"

So, with your suggestion I'm able to find that tf("dog") = 2, tf("fox")=3,... (the terms are
composed by  just a word).

But it seems that TermFrequencyVector cannot answer to this: tf("lazy dog")=2, tf("quick brown")=1.

Unlikely I've been asked to retrieve the occurrence of a set of concept in a document and
I was trying to use lucene cause my simple mapping algorithm is too slow :(.

I'll try to see if I can do something with TermFreqVector, or with the Analizer. OR I'll go
to look for another way :)

Antonio



2009/12/16 Ted Dunning <ted.dunning@gmail.com>

> You need the term frequency vector.
>
> See here
>
> http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/index/IndexR
> eader.html#getTermFreqVector%28int,%20java.lang.String%29
>
> This is compatible in 3.0 as well:
>
> http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/index/I
> ndexReader.html#getTermFreqVector%28int,%20java.lang.String%29
>
> Note the package change.
>
>
> On Wed, Dec 16, 2009 at 7:34 AM, Antonio Calò <anton.calo@gmail.com>
> wrote:
>
> > I All
> >
> > I Hope that you can help me on this.
> >
> > I'm looking for a fast way to obtainf for a given word, its term
> frequency
> > (I mean how many times it is available in a single doc). I've 
> > looking
> into
> > mail archive and LIA (Lucene In Action) book and I found something 
> > like
> > this:
> >
> > IndexSearcher index = new IndexSearcher(invertedIndexinRam);
> > Term term = new Term("doc", "quick"); int occurrence = 
> > index.docFreq(term);
> >
> > ok, occurrence contains the occurrences of the word "quick" into the
> index
> > (In my case the index will contain only one document example "the 
> > quick brown fox jumps over the lazy dog"). In this case the 
> > occurrence will be
> 1.
> > :)
> >
> > But now I need to retrieve the occurrency of a composite word: as 
> > example "quick brown fox" but I'm quite in trouble on how could I perform this.
> >
> > Thanks in advance for your help.
> >
> > Best Regards.
> >
> > Antonio
> >
> >
> >
> > --
> > Antonio Calò
> > ------------------------------------------
> > Software Developer Engineer
> > @ Intellisemantic
> > Mail anton.calo@gmail.com
> > Tel. 011-56.90.429
> > ------------------------------------------
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>



--
Antonio Calò
------------------------------------------
Software Developer Engineer
@ Intellisemantic
Mail anton.calo@gmail.com
Tel. 011-56.90.429
------------------------------------------

Mime
View raw message