incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Fast <t...@conga.com>
Subject Differences in row iteration behavior
Date Sat, 15 Sep 2012 03:07:11 GMT
Hi--

We are iterating rows in a column family two different ways and are 
seeing radically different row counts. We are using 1.0.8 and 
RandomPartitioner on a 3-node cluster.

In the first case, we have a trivial Hadoop job that counts 29M rows 
using the standard MR pattern for counting (mapper outputs a single key 
with a value of 1, reducer adds up all the values).

In the second case, we have a simple Quartz batch job which counts only 
10M rows. We are iterating using chained calls to get_row_slices, as 
described on the wiki: http://wiki.apache.org/cassandra/FAQ#iter_world 
We've also implemented the batch job using Pelops, with and without 
chaining. In all cases, the job counts just 10M rows, and it is not 
encountering any errors.

We are confident that we are doing everything right in both cases (no 
bugs), yet the results are baffling. Tests in smaller, single-node 
environments results in consistent counts between the two methods, but 
we don't have the same amount of data nor the same topology.

Is the right answer 29M or 10M? Any clues to what we're seeing?

Todd

Mime
View raw message