lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: TermFrequencies vector limits?
Date Mon, 21 Nov 2005 08:06:35 GMT
By default, documents get truncated at 10,000 terms (maybe there is  
an off-by-one where it is going to 10,001 though?).

To increase this, and I always do, set the max field length on your  
IndexWriter, and re-index.  In 1.4.3, you set the maxFieldLength  
variable of IndexWriter directly.  We've changed this to be  
setMaxFieldLength in the TRUNK codebase.

	Erik



On 20 Nov 2005, at 20:16, <marigoldcc@yahoo.com>  
<marigoldcc@yahoo.com> wrote:

> Hi.  I was wondering if anyone else has seen this
> before.  I'm using  lucene 1.4.3 and have indexed
> about 3000 text documents using the statement:
>
> doc.add(Field.Text("contents", new FileReader(f),
> true));
>
> When I go and retrieve the term frequency vectors, for
> any document under about 90k, everything looks as
> expected.  However for larger documents (I haven't
> found the exact point, but I know that those over 128k
> qualify) the sum of the term frequencies in the vector
> seems to max out at 10001.  Here's the code snippet
> that I'm using when I see this:
>
>         int vecSize = vector.size();
>         for (int j = 0; j < vecSize; j++) {
>             currentTermFreq =
> vector.getTermFrequencies()[j];
>             sumTermFreq = currentTermFreq +
> sumTermFreq;
>             if ( currentTermFreq > maxTermFreq) {
>                 maxTermFreq = currentTermFreq;
>             }
>         }
>
> The results in sumTermFreq winds up being 10001 for
> large documents.  The vector.size() varies from
> document to document, the term with the highest
> freqency (and that frequency) varies from document to
> document, but not the sum.
>
> Any thougths/suggestions would be appreciated.
>
> Thanks
> --MG
>
>
>
>
> 		
> __________________________________
> Yahoo! FareChase: Search multiple travel sites in one click.
> http://farechase.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message