incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Philip O'Toole" <>
Subject Re: Guaranteeing globally unique TimeUUID's in a high throughput distributed system
Date Sat, 16 Mar 2013 21:40:46 GMT
On Sat, Mar 16, 2013 at 2:24 PM, Josh Dzielak <> wrote:
> I have a system where a client sends me arbitrary JSON events containing a
> timestamp at millisecond resolution. The timestamp is used to generate
> column names of type TimeUUIDType.
> The problem I run into is this - if I client sends me 2 events with the same
> timestamp, the TimeUUID that gets generated for each is the same, and we get
> 1 insert and 1 update instead of 2 inserts. I might be running many
> processes (in my case Storm supervisors) on the same node, so the
> machine-specific part of the UUID doesn't help.
> I have noticed how the Cassandra UUIDGen class lets you work around this. It
> has a 'createTimeSafe' method that adds extra precision to the timestamp
> such that you can actually get up to 10k unique UUID's for the same
> millisecond. That works pretty good for a single process (although it's
> still possible to go over 10k, it's unlikely in our actual production
> scenario). It does make searches at boundary conditions a little
> unpredictable – 'equal' may or may not work depending on whether extra ns
> intervals were added – but I can live with that.)
> However, this still leaves vulnerability across a distributed system. If 2
> events arrive in 2 processes at the exact same millisecond, one will
> overwrite the other. If events keep flowing to each process evenly over the
> course of the millisecond, we'll be left with roughly half the events we
> should have. To work around this, I add a distinct 'component id' to my row
> keys that roughly equates to a Storm worker or a JVM process I can cheaply
> synchronize.
> The real problem is that this trick of adding ns intervals only works when
> you are generating timestamps from the current time (or any time that's
> always increasing). As I mentioned before, my client might be providing a
> past or future timestamp, and I have to find a way to make sure each one is
> unique.
> For example, a client might send me 10k events with the same millisecond
> timestamp today, and 10k again tomorrow. Using the standard Java library
> stuff to generate UUID's, I'd end up with only 1 event stored, not 20,000.
> The warning in UUIDGen.getTimeUUIDBytes is clear about this.

It is a mistake, IMHO, to use the timestamp contained within the event
to generate the time-based UUID. While it will work, it suffers from
exactly the problem you describe. Instead, use the clock of the host
system to generate the timestamp. In otherwords, the event timestamp
may be different from the timestamp in the UUID. In fact, it *will* be
different, if the rate gets fast enough (since the 100ns period clock
used to generate time-based UUIDs may not be fine-grained enough, and
the UUID timestamp will increase as explained by RFC4122).

> Adapting the ns-adding 'trick' to this problem requires synchronized
> external state (i.e. storing that the current ns interval for millisecond
> 12330982383 is 1234, etc) - definitely a non-starter.
> So, my dear, and far more seasoned Cassandra users, do you have any
> suggestions for me?
> Should I drop TimeUUID altogether and just make column names a combination
> of millisecond and a big enough random part to be safe? e.g.
> '1363467790212-a6c334fefda'. Would I be able to run proper slice queries if
> I did this? What other problems might crop up? (It seems too easy :)
> Or should I just create a normal random UUID for every event as the column
> key and create the non-unique index by time in some other way?
> Would appreciate any thoughts, suggestions, and off-the-wall ideas!
> PS- I assume this could be a problem in any system (not just Cassandra)
> where you want to use 'time' as a unique index yet might have multiple
> records for the same time. So any solutions from other realms could be
> useful too.
> --
> Josh Dzielak
> VP Engineering • Keen IO
> Twitter • @dzello
> Mobile • 773-540-5264

View raw message