lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Use of Lucene to store data from RSS feeds
Date Thu, 14 Oct 2010 15:21:36 GMT

On Oct 14, 2010, at 10:17 AM, appy74@dsl.pipex.com wrote:

> Hello
> 
> I would like to store data retrieved hourly from RSS feeds in a database or in Lucene
so that the text can be easily
> indexed for word frequencies.
> 
> I need to get the text from the title and description elements of RSS items.
> 
> Ideally, for each hourly retrieval from a given feed, I would add a row to a table in
a dataset made up of the
> following columns:
> 
> feed_url, title_element_text, description_element_text, polling_date_time
> 
> From this, I can look up any element in a feed and calculate keyword frequencies based
upon the length of time required.
> 
> This can be done as a database table and hashmaps used to calculate word frequencies.
But can I do this in Lucene to
> this degree of granularity at all? If so, would each feed form a Lucene document or would
each 'row' from the
> database table form one?


Yes, you should be able to do all of this.  Note, Solr's DataImportHandler easily handles
RSS into Solr/Lucene and then you can either use the TermsComponent (which uses Lucene's TermEnum,
etc.) to get at the frequencies.  You might also need to do some stuff with Spans and SpanQueries
to properly incorporate your length of time requirement.

-Grant
--------------------------
Grant Ingersoll
http://www.lucidimagination.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message