lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shuai Weng <>
Subject word frequency counting
Date Wed, 11 Aug 2010 17:21:54 GMT


I'm new to Lucene...  I was wondering if we can use Lucene/Solr for word frequency counting
(eg, in a subset of full text papers).

Thanks for any info you may provide.

On Aug 11, 2010, at 10:16 AM, Julien Nioche wrote:

> BTW I don't remember anyone on the Nutch list suggesting you to use Carrot
> for this (see : or classifying at
> querying time
> What I suggested in was about
> classifying during the parsing or indexing and generating a field for Lucene
> or SOLR. As Otis pointed out you can of course use SOLR for faceting. Since
> you will be using Nutch anyway, you might as well avoid an external DB just
> for storing the results of the classification and just keep the labels e.g.
> in the parse metadata
> Julien
> -- 
> DigitalPebble Ltd
> Open Source Solutions for Text Engineering
> On 9 August 2010 00:16, Luan Cestari <> wrote:
>> Lucene developers,
>> We’ve been working on a undergraduate project to the college about changing
>> Apache Nutch (that uses Lucene do index it’s web pages) to include a
>> category filter, and we are having problems about the query part. We want
>> to
>> develop an application with a good performance, so we thought that here
>> would be the best place to ask this kind of question. The idea is that the
>> user can search pages stored for only a category. So the number of results
>> found should display the number of pages that actually is classified in
>> that
>> category.
>> The problem is about how to add to the Lucene indexes the category
>> information, and how filter the search on that. We tried to look on the
>> Nutch mailing-list (Nabble) about that and asked some help, but people from
>> there think that we should use some plug-in like Carrot, that get like 100
>> of pages and classify it in the query time. We are not very confident that
>> it’s the best solution. We thought in other two different ideas: #1 To
>> classify those pages and store that information on a DB and in the query
>> time filter the result that DB to filter the result. #2 Use different index
>> servers, one for each category and one to search without filtering by
>> category.
>> We have seen on this project that there are
>> pre-defined categories. We think that this should be classified at indexing
>> time, as we wanted.
>> Do you have any other idea about how to do that?
>> Sincerely,
>> Daniel Costa Gimenes & Luan Cestari
>> Undergraduate students of University Center of FEI
>> Brazil
>> --
>> View this message in context:
>> Sent from the Lucene - Java Users mailing list archive at
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message