lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <markharw...@yahoo.co.uk>
Subject Re: Continuous stream indexing and time-based segment management
Date Tue, 19 Jun 2012 21:39:46 GMT
> you can do that by subclassing IW and call some package private APIs /


To date I have used separate physical indexes with a MultiReader to combine them then dropping
the outdated indexes.
At least this has the benefit that a custom MergePolicy is not required to keep content from
the different dates segregated.

Where I saw the potential is when looking at S4 or Esper stream processing technologies when
they try to count things in time windows.
It struck me that careful organisation of Lucene segments along time units could provide an
efficient means of accessing and comparing counts of many things over time.
It looked like the "Hello World' example in S4 for counting top Twitter topics instantiated
a Java object per unique topic String which was then responsible for maintaining counts on
things - this seems a fairly inefficient way of modelling things.

>>If you are willing/able to close the IndexWriter, it's easy to drop segments by reading
the SegmentInfos, editing, and writing back.

My assumption was that ultimately that's what it comes down to - I just wonder if this is
likely to be a common requirement, deserving of a supported API



> members. We can certainly make that easier but I personally don't want
> to open this as a public API. I can certainly imagine to have a
> protected API that allows dropping entire segment.
>
> simon
>
>> c) Various new analysis functions comparing term frequencies across time e.g discovery
of "trending" topics.
>>
>> I can see that a) could be implemented using a custom MergePolicy and c) can be done
via existing APIs but I'm not sure if there is way to simply drop entire segments currently?
>>
>> Anyone else had thoughts in this area?

I had some ideas to add statistics to DocValues that get created
during index time. You can already do that and expose it via
Attributes maybe we can add some API to docvlaues you can hook into so
that you don't need to write you own DV impl.
>>
>> Cheers
>> Mark
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message