incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Hanna <>
Subject Re: Differences in row iteration behavior
Date Sat, 15 Sep 2012 03:21:29 GMT
Are there any deletions in your data?  The Hadoop support doesn't filter out tombstones, though
you may not be filtering them out in your code either.  I've used the hadoop support for doing
a lot of data validation in the past and as long as you're sure that the code is sound, I'm
pretty confident in it.

On Sep 14, 2012, at 10:07 PM, Todd Fast <> wrote:

> Hi--
> We are iterating rows in a column family two different ways and are seeing radically
different row counts. We are using 1.0.8 and RandomPartitioner on a 3-node cluster.
> In the first case, we have a trivial Hadoop job that counts 29M rows using the standard
MR pattern for counting (mapper outputs a single key with a value of 1, reducer adds up all
the values).
> In the second case, we have a simple Quartz batch job which counts only 10M rows. We
are iterating using chained calls to get_row_slices, as described on the wiki:
We've also implemented the batch job using Pelops, with and without chaining. In all cases,
the job counts just 10M rows, and it is not encountering any errors.
> We are confident that we are doing everything right in both cases (no bugs), yet the
results are baffling. Tests in smaller, single-node environments results in consistent counts
between the two methods, but we don't have the same amount of data nor the same topology.
> Is the right answer 29M or 10M? Any clues to what we're seeing?
> Todd

View raw message