Hi,
I'm evaluating Storm for use in a fairly high throughput environment: ~100
billion messages/day on internal servers. I've seen rates for Storm quoted
of around a million tuples/second/server, but I wonder if anyone could help
clarify the precise definition of that metric.
For discussion's sake, assume a rather trivial network of a single spout
S1, feeding a single bolt B1, which sends each input tuple to two
downstream bolts: B2 and B3. In that case, if the cluster were running at
"1,000,000 tuples/second", what would the input rate be?
1. Does it mean a million tuples per second coming into S1?
2. Or maybe each input tuple is considered to be three tuples: one from
S1 to B1, and one each from B1>B2 and B1>B3, giving an input rate of
333,000 tuples/second?
3. Are the acks considered to be tuples as well, giving maybe 6 tuples
for every input tuple and an effective input rate of 166,000 tuples/second?
4. Are there perhaps even more tuples factoring into the mix (metrics
etc) lowering the input rate even more?
Having established the definition of tuples/second, is a million/server
still the current benchmark or might it be a bit higher these days due to
hardware improvements? Is it safe to assume that a million tuples would be
with bolts that were actually doing some sort of modest computation, or is
that just the barebones getting tuples in and out as fast as possible doing
nothing with them? What kind of numbers are people seeing per server in
practical implementations?
Also, has anyone done any experimentation with Storm running with
hyperthreading on vs off? Our experience with Hadoop has shown
hyperthreading to either help performance very little or hurt slightly 
would the same be expected for Storm?
Thanks!
