kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Burroughs <chris.burrou...@gmail.com>
Subject Re: On time/offset indexs
Date Wed, 27 Jul 2011 17:22:17 GMT
- Per partition or segment.

I think per segment is more useful and easier.  If it's per segment we
can just delete the index at the same time as the segment.  For per
partition I think we would have to do something other than append to a file.

- Use cases.

On minute is the finest resolution I could ever see using for web access
logs, verbose gc, syslog type data. (I'd be interested in hearing use
cases for finer resolution.) I agree it should be configurable.

- Low volume.

If the data volume is that low I suspect the extra seek to read the
index wouldn't be worth it.  Maybe there is a not too clever way to only
update the index if new data is flowing in?

- Should there be an option to turn it off?

Sure. If you never "rewind the queue" or start from a time other than
now the index would be useless to you.

On 07/27/2011 11:51 AM, Jun Rao wrote:
> Adding a separate index file is possible. Will there be 1 index file per
> partition or per segment? Is 1 minute interval good enough for typical use
> cases? Should we make the interval configurable? One downside of this is
> that for low volume data, there will be lots of entries pointing to the same
> offset. May be we should make the index optional?
> Jun
> On Tue, Jul 26, 2011 at 6:19 PM, Chris Burroughs
> <chris.burroughs@gmail.com>wrote:
>> So for good reason [1] Kafka doesn't keep a complicated time --> offset
>> index.  Whatever is the start and end of log file is what you get.  We
>> can approximate finer grained time indexes with smaller log files [2]
>> and getOffsetsBefore, but we would really prefer not to have lots of
>> small files everywhere.
>> To solve the case of wanting time based indexes without lots of files
>> could we have another append only companion file for each Log that
>> periodically (I'm thinking on the order of 1 minute) gets
>> timestamp:offset appended to it?  That should have low overhead and if
>> the companion file is missing/deleted/etc we can still use the current
>> logic.
>> [1] "Furthermore the complexity of maintaining the mapping from a random
>> id to an offset requires a heavy weight index structure which must be
>> synchronized with disk, essentially requiring a full persistent
>> random-access data structure. " http://sna-projects.com/kafka/design.php
>> [2] And KAFKA-40 would make this easier to do.

View raw message