lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bianca Pereira <aivykar...@gmail.com>
Subject Re: Calculate Term Frequency
Date Fri, 22 Aug 2014 13:06:39 GMT
Hi,

  Thank you for the answers. At the end I calculated the Topic Frequency
using Java, getting the text, broken into tokens and calculating from
there. It turns out to be around 6 times faster in my case (using cache).
Only the document frequency I keep calculating using Lucene.

 Regards,
 Bianca


2014-08-19 17:56 GMT+01:00 Tri Cao <tmcao@me.com>:

> Erick, Solr termfreq implementation also uses DocsEnum with the assumption
> that freq are called on ascending
> doc IDs which is valid when scoring from from the hit list. If freq is
> requested for an out of order doc, a new
> DocsEnum has to be created.
>
> Bianca, can you explain your use case in more details? What did you mean
> by having a new document? A new
> document is added to the index? Then you already have to reopen the
> searcher/reader anyway to get a new
> DocsEnum.
>
> On Aug 19, 2014, at 08:26 AM, Erick Erickson <erickerickson@gmail.com>
> wrote:
>
> Hmmm, I'm not at all an expert here, but Solr has a function
> query "termfreq" that does what you're doing I think? I wonder
> if the code for that function query would be a good place to
> copy (or even make use of)? See TermFreqValueSource...
>
> Maybe not helpful at all, but...
> Erick
>
> On Tue, Aug 19, 2014 at 7:04 AM, Bianca Pereira <aivykarter@gmail.com
>    > wrote:
>        > Hi everybody,
>        >
>        > I would like to know your suggestions to calculate Term Frequency
> in a
>        > Lucene document. Currently I am using MultiFields.getTermDocsEnum,
>        > iterating through the DocsEnum 'de' returned and getting the
> frequency with
>        > de.freq() for the desired document.
>        >
>        > My solution gives me the result I want but I am having time
> issues. For
>        > instance, I want to calculate the term frequency for a given term
> for N
>        > documents in a sequence. Then, every time I have a new document I
> have to
>        > retrieve exactly the same DocsEnum again and iterate until find
> the
>        > document I want. Of course I cannot cache DocsEnum (yes, I did
> this huge
>        > mistake) because it is an iterator.
>        >
>        > Do you have any suggestions on how I can get Term Frequency in a
> fast way?
>        > The unique suggestion I had up to now was "Do it programatically,
> don't use
>        > Lucene". Should be this the solution?
>        >
>        > Thank you.
>        >
>        > Regards,
>        > Bianca Pereira
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message