cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Kjellman <>
Subject Re: hadoop consistency level
Date Thu, 18 Oct 2012 21:03:06 GMT
1. Yes, you can absolutely benefit from data locality, and the InputSplits
will theoretically schedule the map task on Cassandra+Hadoop nodes that
have the data locally. If your application doesn't require you to worry
about that one pesky row that should be local to that node (and that node
is responsible for it but for some reason the data isn't there) then go
ahead and run it with CF ONE. In a perfect world all of the rows should be
there but any seasoned Cassandra user use knows that exceptions happen.

If what Bryan says is right then your first MR job, the mapper would be
missing that row but the subsequent run would contain that data as the
read repair would be triggered in the background. Once again, how
important it is that you get all your data 100% of the time?

2. I would consider thinking a little more about your project if you are
planning on using Hadoop only for data locality. I would say it depends if
your workload would benefit from Hadoop and distributed processing. Hadoop
provides many benefits but, if you require QUORUM consistency and you
don't have a work load that lends itself to a input > output distributed
workload then Hadoop might not be the right tool for the job.


On 10/18/12 1:52 PM, "Andrey Ilinykh" <> wrote:

>On Thu, Oct 18, 2012 at 1:34 PM, Michael Kjellman
><> wrote:
>> Not sure I understand your question (if there is one..)
>> You are more than welcome to do CL ONE and assuming you have hadoop
>> in the right places on your ring things could work out very nicely. If
>> need to guarantee that you have all the data in your job then you'll
>> to use QUORUM.
>> If you don't specify a CL in your job config it will default to ONE (at
>> least that's what my read of the ConfigHelper source for 1.1.6 shows)
>I have two questions.
>1. I can benefit from data locality (and Hadoop) only with CL ONE. Is
>it correct?
>2. With CL QUORUM cassandra reads data from all replicas. In this case
>Hadoop doesn't give me any  benefits. Application running outside the
>cluster has the same performance. Is it correct?
>Thank you,
>  Andrey

'Like' us on Facebook for exclusive content and other resources on all Barracuda Networks

View raw message