From "Marcelo Valle (BLOOMBERG/ LONDON)" <>
Subject Re:Fastest way to map/parallel read all values in a table?
Date Mon, 09 Feb 2015 10:24:14 GMT
Just for the record, I was doing the exact same thing in an internal application in the start
up I used to work. We have had the need of writing custom code process in parallel all rows
of a column family. Normally we would use Spark for the job, but in our case the logic was
a little more complicated, so we wrote custom code. 

What we did was to run N process in M machines (N cores in each), each one processing tasks.
The tasks were created by splitting the range -2^ 63 to 2^ 63 -1 in N*M*10 tasks. Even if
data was not completely distributed along the tasks, no machines were idle, as when some task
was completed another one was taken from the task pool.

It was fast enough for us, but I am interested in knowing if there is a better way of doing

For your specific case, here is a tool we had opened as open source and can be useful for
simpler tests:

Also, I guess you probably know that, but I would consider using Spark for doing this.

Best regards,

What’s the fastest way to map/parallel read all values in a table?

Kind of like a mini map only job.

I’m doing this to compute stats across our entire corpus.

What I did to begin with was use token() and then spit it into the number of splits I needed.

So I just took the total key range space which is -2^63 to 2^63 - 1 and broke it into N parts.

Then the queries come back as:

select * from mytable where token(primaryKey) >= x and token(primaryKey) < y

From reading on this list I thought this was the correct way to handle this problem.

However, I’m seeing horrible performance doing this.  After about 1% it just flat out locks

Could it be that I need to randomize the token order so that it’s not contiguous?  Maybe
it’s all mapping on the first box to begin with.


