lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <marigol...@yahoo.com>
Subject TermFrequencies vector limits?
Date Mon, 21 Nov 2005 01:16:58 GMT
Hi.  I was wondering if anyone else has seen this
before.  I'm using  lucene 1.4.3 and have indexed
about 3000 text documents using the statement:

doc.add(Field.Text("contents", new FileReader(f),
true));

When I go and retrieve the term frequency vectors, for
any document under about 90k, everything looks as
expected.  However for larger documents (I haven't
found the exact point, but I know that those over 128k
qualify) the sum of the term frequencies in the vector
seems to max out at 10001.  Here's the code snippet
that I'm using when I see this:

        int vecSize = vector.size();
        for (int j = 0; j < vecSize; j++) {
            currentTermFreq =
vector.getTermFrequencies()[j];
            sumTermFreq = currentTermFreq +
sumTermFreq;
            if ( currentTermFreq > maxTermFreq) {
                maxTermFreq = currentTermFreq;
            }
        }

The results in sumTermFreq winds up being 10001 for
large documents.  The vector.size() varies from
document to document, the term with the highest
freqency (and that frequency) varies from document to
document, but not the sum.

Any thougths/suggestions would be appreciated.

Thanks
--MG




		
__________________________________ 
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message