Thanks for your help, Adrien. But unfortunately, my term frequencies will
be partial counts so they won't be integers, And finding a common
denominator and scaling the rest of the frequencies accordingly will affect
the relative lengths of the documents which will affect the Lucene scoring
because the length of the documents is taken into account in the scoring.
Are there any other ideas?
On Thu, Mar 28, 2013 at 9:06 PM, Adrien Grand <jpountz@gmail.com> wrote:
> Hi,
>
> On Thu, Mar 28, 2013 at 8:25 PM, Sharon Tam <sharontam@gmail.com> wrote:
> > I believe that when Lucene indexes documents, it generates counts for a
> > term by counting how many times the term appears in a particular
> document.
> > Instead of having Lucene do the counting, I want to do my own counting
> and
> > feed a termfrequency vector representation of a document directly into
> the
> > indexer which will take my counts and proceed to do the other processing
> > such as generating inverse document frequency. My termfrequencies may
> not
> > all be integers. Is there a way to do this?
>
> You could provide the indexer with arbitrary frequencies by creating a
> handcrafted TokenStream that repeats terms ${termFreq} times, but
> unfortunately, frequencies need to be strictly positive (> 0)
> integers.
>
> 
> Adrien
>
> 
> To unsubscribe, email: javauserunsubscribe@lucene.apache.org
> For additional commands, email: javauserhelp@lucene.apache.org
>
>
