lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <>
Subject Re: Continuous stream indexing and time-based segment management
Date Tue, 19 Jun 2012 19:46:12 GMT
On Tue, Jun 19, 2012 at 9:44 PM, Simon Willnauer
<> wrote:
> On Tue, Jun 19, 2012 at 6:42 PM, mark harwood <> wrote:
>> There are a number of scenarios where Lucene might be used to index a fixed time
range on a continuous stream of data e.g. a news feed.
>> In these scenarios I imagine the following facilities would be useful:
>> a) A MergePolicy that organized content into segments on the basis of increasing
time units e.g. 5min->10 min->1 hour->1 day
>> b) The ability to drop entire segments e.g. the day-level segment from exactly a
week ago
> you can do that by subclassing IW and call some package private APIs /
> members. We can certainly make that easier but I personally don't want
> to open this as a public API. I can certainly imagine to have a
> protected API that allows dropping entire segment.
> simon
>> c) Various new analysis functions comparing term frequencies across time e.g discovery
of "trending" topics.
>> I can see that a) could be implemented using a custom MergePolicy and c) can be done
via existing APIs but I'm not sure if there is way to simply drop entire segments currently?
>> Anyone else had thoughts in this area?

I had some ideas to add statistics to DocValues that get created
during index time. You can already do that and expose it via
Attributes maybe we can add some API to docvlaues you can hook into so
that you don't need to write you own DV impl.
>> Cheers
>> Mark
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message