lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Frequency Term of Composite words
Date Thu, 17 Dec 2009 22:50:43 GMT
it does.

Look at TermPositionVector.  It is usually much more efficient to count word
sequences at index time, however.

On Thu, Dec 17, 2009 at 3:42 AM, Antonio Calò <anton.calo@gmail.com> wrote:

> Hi Ted. yes, your assumption are correct.
>
> If lucene save position and offset, I should find a way to get occurrence
> of
> a multiword term. I'll let you know. I'll write some code to understand if
> this is the optimum way.
>
> Many thanks & regards
>
> Antonio
>
> 2009/12/17 André Warnier <aw@ice-sa.com>
>
> > Antonio Calň wrote:
> >
> >> Hi Ted.
> >>
> >> Thank you very much for your feedback.
> >>
> >> I can see the term frequency for each term, but not fo couples or more
> >> term
> >> togheter.
> >>
> >> An example: "the quick brown fox jumps over the lazy dog. But the big
> dog
> >> was sleeping.So The lazy dog didn't see the fox"
> >>
> >> So, with your suggestion I'm able to find that tf("dog") = 2,
> >> tf("fox")=3,... (the terms are composed by  just a word).
> >>
> >> But it seems that TermFrequencyVector cannot answer to this: tf("lazy
> >> dog")=2, tf("quick brown")=1.
> >>
> >> Unlikely I've been asked to retrieve the occurrence of a set of concept
> in
> >> a
> >> document and I was trying to use lucene cause my simple mapping
> algorithm
> >> is
> >> too slow :(.
> >>
> >> I'll try to see if I can do something with TermFreqVector, or with the
> >> Analizer. OR I'll go to look for another way :)
> >>
> >> Antonio
> >>
> >>
> >>
> >> 2009/12/16 Ted Dunning <ted.dunning@gmail.com>
> >>
> >>  You need the term frequency vector.
> >>>
> >>> See here
> >>>
> >>>
> >>>
> http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/index/IndexReader.html#getTermFreqVector%28int,%20java.lang.String%29
> >>>
> >>> This is compatible in 3.0 as well:
> >>>
> >>>
> >>>
> http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/index/IndexReader.html#getTermFreqVector%28int,%20java.lang.String%29
> >>>
> >>> Note the package change.
> >>>
> >>>
> >>> On Wed, Dec 16, 2009 at 7:34 AM, Antonio Calň <anton.calo@gmail.com>
> >>>
> >>> wrote:
> >>>
> >>>  I All
> >>>>
> >>>> I Hope that you can help me on this.
> >>>>
> >>>> I'm looking for a fast way to obtainf for a given word, its term
> >>>>
> >>> frequency
> >>>
> >>>> (I mean how many times it is available in a single doc). I've looking
> >>>>
> >>> into
> >>>
> >>>> mail archive and LIA (Lucene In Action) book and I found something
> like
> >>>> this:
> >>>>
> >>>> IndexSearcher index = new IndexSearcher(invertedIndexinRam);
> >>>> Term term = new Term("doc", "quick");
> >>>> int occurrence = index.docFreq(term);
> >>>>
> >>>> ok, occurrence contains the occurrences of the word "quick" into the
> >>>>
> >>> index
> >>>
> >>>> (In my case the index will contain only one document example "the
> quick
> >>>> brown fox jumps over the lazy dog"). In this case the occurrence will
> be
> >>>>
> >>> 1.
> >>>
> >>>> :)
> >>>>
> >>>> But now I need to retrieve the occurrency of a composite word: as
> >>>> example
> >>>> "quick brown fox" but I'm quite in trouble on how could I perform
> this.
> >>>>
> >>>>  I haven't even really started to use Lucene yet, but I follow this
> > list.
> > So just an unqualified idea :
> > - assuming each word is indexed, along with its position in each item
> > - assuming that you kept all the words, and did not strip out "stop
> words"
> > - assuming that you have the list of items which contain all of the words
> > composing your multi-word term
> > - then you should be able to determine which items contain
> >  word 1 of your term in position n
> >  word 2 of your term in position n+1
> >  etc..
> >
> >
>
>
> --
> Antonio Calò
> ------------------------------------------
> Software Developer Engineer
> @ Intellisemantic
> Mail anton.calo@gmail.com
> Tel. 011-56.90.429
> ------------------------------------------
>



-- 
Ted Dunning, CTO
DeepDyve

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message