lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From wal...@Cyveillance.com
Subject getting most common terms for a smaller set of documents
Date Tue, 07 Sep 2004 17:56:31 GMT
Dear Lucene Users:

What is the best way to get the most common terms for a subset of the total
documents in your index?

I know how to get the most common terms for a field for the entire index,
but what is the most efficient way to do this for a subset of documents?

Here is the code I am using to get the top "numberOfTerms" common terms for
the field "fieldName":

	public TermInfo[] mostCommonTerms(String fieldName, int
numberOfTerms)
	{
		//make sure min will get a positive number
		if (numberOfTerms < 1)
		{
			numberOfTerms = Integer.MAX_VALUE;
		}
		numberOfTerms = Math.min(numberOfTerms, 50);
		//String[] commonTerms = new String[numberOfTerms];
		try
		{
			IndexReader reader = IndexReader.open(indexPath);
			TermInfoQueue tiq = new
TermInfoQueue(numberOfTerms);
			TermEnum terms = reader.terms();

			int minFreq = 0;
			while (terms.next())
			{
	
if(fieldName.equalsIgnoreCase(terms.term().field()))
				{
					if (terms.docFreq() > minFreq)
					{
						tiq.put(new
TermInfo(terms.term(), terms.docFreq()));
						if (tiq.size() >=
numberOfTerms) // if tiq overfull
						{
							tiq.pop(); // remove
lowest in tiq
							minFreq =
((TermInfo) tiq.top()).docFreq; // reset
	
// minFreq
						}
					}

				}
			}
			TermInfo[] res = new TermInfo[tiq.size()];
			for (int i = 0; i < res.length; i++)
			{
				res[res.length - i - 1] = (TermInfo)
tiq.pop();
			}
			reader.close();
			return res;

		}
		catch (IOException ioe)
		{
			logger.error("IOException: " + ioe.getMessage());
		}
		return null;
	}

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message