lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Gershman <>
Subject Re: word frequency counting
Date Fri, 13 Aug 2010 19:13:20 GMT

Index your documents, then open an IndexReader and take a look at the terms() 
method.  You can grab each term, and pass it to the IndexReader using the 
docFreq(Term t) method and get back the number of documents that term appears 


From: Shuai Weng <>
Sent: Wed, August 11, 2010 1:21:54 PM
Subject: word frequency counting


I'm new to Lucene...  I was wondering if we can use Lucene/Solr for word 
frequency counting
(eg, in a subset of full text papers).

Thanks for any info you may provide.

On Aug 11, 2010, at 10:16 AM, Julien Nioche wrote:

> BTW I don't remember anyone on the Nutch list suggesting you to use Carrot
> for this (see : or classifying at
> querying time
> What I suggested in was about
> classifying during the parsing or indexing and generating a field for Lucene
> or SOLR. As Otis pointed out you can of course use SOLR for faceting. Since
> you will be using Nutch anyway, you might as well avoid an external DB just
> for storing the results of the classification and just keep the labels e.g.
> in the parse metadata
> Julien
> -- 
> DigitalPebble Ltd
> Open Source Solutions for Text Engineering
> On 9 August 2010 00:16, Luan Cestari <> wrote:
>> Lucene developers,
>> We’ve been working on a undergraduate project to the college about changing
>> Apache Nutch (that uses Lucene do index it’s web pages) to include a
>> category filter, and we are having problems about the query part. We want
>> to
>> develop an application with a good performance, so we thought that here
>> would be the best place to ask this kind of question. The idea is that the
>> user can search pages stored for only a category. So the number of results
>> found should display the number of pages that actually is classified in
>> that
>> category.
>> The problem is about how to add to the Lucene indexes the category
>> information, and how filter the search on that. We tried to look on the
>> Nutch mailing-list (Nabble) about that and asked some help, but people from
>> there think that we should use some plug-in like Carrot, that get like 100
>> of pages and classify it in the query time. We are not very confident that
>> it’s the best solution. We thought in other two different ideas: #1 To
>> classify those pages and store that information on a DB and in the query
>> time filter the result that DB to filter the result. #2 Use different index
>> servers, one for each category and one to search without filtering by
>> category.
>> We have seen on this project that there are
>> pre-defined categories. We think that this should be classified at indexing
>> time, as we wanted.
>> Do you have any other idea about how to do that?
>> Sincerely,
>> Daniel Costa Gimenes & Luan Cestari
>> Undergraduate students of University Center of FEI
>> Brazil
>> --
>> View this message in context:
>> Sent from the Lucene - Java Users mailing list archive at
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message