So I can have one PagedIndex CF that holdes a row for each data file I am processing.  

The columns for that row (in my example) would have X columns and I can make those columns values be 100 strings that represent keys in another PagedData CF

This other PagedData CF for each row would have 10,000 columns and their values would have my data in them that I would loop through paralyze and scale on so I can do this 100 times simultaneously.

This is really awesome because if I have 10 files each with a billion rows in it then I push it into this pattern I can scale quite nicely providing 10,000 is my magic number of columns to page.   for 10,000,000,000 rows I would have in my first PagedIndex CF 10,000 columns (each representing 100s PagedData rows that have data) for each of the 100 rows for each column I can then pull that row pulling out 10,000 pieces of data to process 100 at a time on different servers.

got it, thanks! awesome!

On Sun, Jun 5, 2011 at 4:36 PM, Jonathan Ellis <> wrote:
If you need to parallelize (and scale) you need to distribute across
multiple rows. One Big Row means all your 100 workers are hammering
the same 3 (for instance) replicas at the same time.

On Sun, Jun 5, 2011 at 1:43 PM, Joseph Stein <> wrote:
> What is the best practices here to page and slice columns from a row.
> So lets say I have 1,000,000 columns in a row
> I read the row but want to have 1 thread read columns 0 - 9999, second
> thread (actor in my case) 10000 - 19999 ... and so on so i can have 100
> workers processing 10,000 columns for each of my rows.
> If there is no API for this then is it something I should a composite key on
> and have to populate the rows with a counter
> 0000000:myoriginalcolumnnameX
> 0000001:myoriginalcolumnnameY
> 0000002:myoriginalcolumnnameZ
> Going the composite key route and doing a start/end predicate would work but
> then it kind of makes the insertion/load of this have to go through a
> single synchronized point to generate the columns names... I am not opposed
> to this but would prefer both the load of my data and processing of my data
> to not be bound by any 1 single lock (even if distributed).
> Thanks!!!!
> /*
> Joe Stein
> Twitter: @allthingshadoop
> */

Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support


Joe Stein
Twitter: @allthingshadoop