Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of marcelo@s1mbi0se.com.br
 designates 209.85.223.175 as permitted sender)
MIME-Version: 1.0
Date: Mon, 21 Jul 2014 12:24:54 -0300
Message-ID: 
 <CAAX2xq6UhsGfq_gtfjogOV7=Mi8q=5SmRfNM1+KFEXXVk+p8iw@mail.gmail.com>
Subject: map reduce for Cassandra
From: Marcelo Elias Del Valle <marcelo@s1mbi0se.com.br>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=089e0158b0347e2bbf04feb5b528

--089e0158b0347e2bbf04feb5b528
Content-Type: text/plain; charset=UTF-8

Hi,

I have the need to executing a map/reduce job to identity data stored in
Cassandra before indexing this data to Elastic Search.

I have already used ColumnFamilyInputFormat (before start using CQL) to
write hadoop jobs to do that, but I use to have a lot of troubles to
perform tunning, as hadoop depends on how map tasks are split in order to
successfull execute things in parallel, for IO/bound processes.

First question is: Am I the only one having problems with that? Is anyone
else using hadoop jobs that reads from Cassandra in production?

Second question is about the alternatives. I saw new version spark will
have Cassandra support, but using CqlPagingInputFormat, from hadoop. I
tried to use HIVE with Cassandra community, but it seems it only works with
Cassandra Enterprise and doesn't do more than FB presto (http://prestodb.io/),
which we have been using reading from Cassandra and so far it has been
great for SQL-like queries. For custom map reduce jobs, however, it is not
enough.

Does anyone know some other tool that performs MR on Cassandra? My
impression is most tools were created to work on top of HDFS and reading
from a nosql db is some kind of "workaround".

Third question is about how these tools work. Most of them writtes mapped
data on a intermediate storage, then data is shuffled and sorted, then it
is reduced. Even when using CqlPagingInputFormat, if you are using hadoop
it will write files to HDFS after the mapping phase, shuffle and sort this
data, and then reduce it.

I wonder if a tool supporting Cassandra out of the box wouldn't be smarter.
Is it faster to write all your data to a file and then sorting it, or batch
inserting data and already indexing it, as it happens when you store data
in a Cassandra CF? I didn't do the calculations to check the complexity of
each one, what should consider no index in Cassandra would be really large,
as the maximum index size will always depend on the maximum capacity of a
single host, but my guess is that a map / reduce tool written specifically
to Cassandra, from the beggining, could perform much better than a tool
written to HDFS and adapted. I hear people saying Map/Reduce on
Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really make
sense? Should we expect a result like this?

Final question: Do you think writting a new M/R tool like described would
be reinventing the wheel? Or it makes sense?

Thanks in advance. Any opinions about this subject will be very appreciated.

Best regards,
Marcelo Valle.

--089e0158b0347e2bbf04feb5b528
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi,=C2=A0<div><br></div><div>I have the need to executing =
a map/reduce job to identity data stored in Cassandra before indexing this =
data to Elastic Search.</div><div><br></div><div>I have already used Column=
FamilyInputFormat (before start using CQL) to write hadoop jobs to do that,=
 but I use to have a lot of troubles to perform tunning, as hadoop depends =
on how map tasks are split in order to successfull execute things in parall=
el, for IO/bound processes.=C2=A0</div>
<div><br></div><div>First question is: Am I the only one having problems wi=
th that? Is anyone else using hadoop jobs that reads from Cassandra in prod=
uction?</div><div><br></div><div>Second question is about the alternatives.=
 I saw new version spark will have Cassandra support, but using CqlPagingIn=
putFormat, from hadoop. I tried to use HIVE with Cassandra community, but i=
t seems it only works with Cassandra Enterprise and doesn&#39;t do more tha=
n FB presto (<a href=3D"http://prestodb.io/">http://prestodb.io/</a>), whic=
h we have been using reading from Cassandra and so far it has been great fo=
r SQL-like queries. For custom map reduce jobs, however, it is not enough.<=
/div>
<div><br></div><div>Does anyone know some other tool that performs MR on Ca=
ssandra? My impression is most tools were created to work on top of HDFS an=
d reading from a nosql db is some kind of &quot;workaround&quot;.</div>
<div><br></div><div>Third question is about how these tools work. Most of t=
hem writtes mapped data on a intermediate storage, then data is shuffled an=
d sorted, then it is reduced. Even when using CqlPagingInputFormat, if you =
are using hadoop it will write files to HDFS after the mapping phase, shuff=
le and sort this data, and then reduce it.=C2=A0</div>
<div><br></div><div>I wonder if a tool supporting Cassandra out of the box =
wouldn&#39;t be smarter. Is it faster to write all your data to a file and =
then sorting it, or batch inserting data and already indexing it, as it hap=
pens when you store data in a Cassandra CF? I didn&#39;t do the calculation=
s to check the complexity of each one, what should consider no index in Cas=
sandra would be really large, as the maximum index size will always depend =
on the maximum capacity of a single host, but my guess is that a map / redu=
ce tool written specifically to Cassandra, from the beggining, could perfor=
m much better than a tool written to HDFS and adapted. I hear people saying=
 Map/Reduce on Cassandra/HBase is usually 30% slower than M/R in HDFS. Does=
 it really make sense? Should we expect a result like this?</div>
<div><br></div><div>Final question: Do you think writting a new M/R tool li=
ke described would be reinventing the wheel? Or it makes sense?</div><div><=
br></div><div>Thanks in advance. Any opinions about this subject will be ve=
ry appreciated.</div>
<div><br></div><div>Best regards,</div><div>Marcelo Valle.</div></div>

--089e0158b0347e2bbf04feb5b528--