Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EB20C1133F for ; Mon, 21 Jul 2014 15:25:25 +0000 (UTC) Received: (qmail 24468 invoked by uid 500); 21 Jul 2014 15:25:23 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 24433 invoked by uid 500); 21 Jul 2014 15:25:23 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 24421 invoked by uid 99); 21 Jul 2014 15:25:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Jul 2014 15:25:22 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of marcelo@s1mbi0se.com.br designates 209.85.223.175 as permitted sender) Received: from [209.85.223.175] (HELO mail-ie0-f175.google.com) (209.85.223.175) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Jul 2014 15:25:20 +0000 Received: by mail-ie0-f175.google.com with SMTP id x19so6986520ier.34 for ; Mon, 21 Jul 2014 08:24:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=s1mbi0se.com.br; s=google; h=mime-version:date:message-id:subject:from:to:content-type; bh=/T+HacpOCrYhQ8r6/CoPUxJnZpmoKwz7bHloGcMArBs=; b=VRcr3b4Broj3zukmWRCpu/8JNuyFXhw6OCA6BnYtUuAVxuyVnPv8fFwMRGn7noGMlL HxIT9+m2k9Mqe0iNF65zhbgaAQ0RM0CFmN/l8B/ogVMqAjzQjLIlaown8eD+E7PNxhUF F+dvAuPfx90QInR4sP8FG3iOolE7Vcid9fw10= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:date:message-id:subject:from:to :content-type; bh=/T+HacpOCrYhQ8r6/CoPUxJnZpmoKwz7bHloGcMArBs=; b=LQSkTFxm7uGlaYlqerXvQ2qifpHOIPftAn3CWHeSj5/t8dr6oocb4IpIqnpBwuEyOD JkCapIQqb3E2beIuNmlPSDEdBNW5dOZNRgkmuZbv8bkx9n22BOx5+LsZXYkjjw6qOUVM 9fEpOi0BvfJ/ypnUbCUjZ1CtGklCCZWVy9xq064S5mtSSf2YkZ2+ZRRDrDf2SQcCt4pl lWGXAH+eToUwQjJP7Gh34K8dc5hI6sHq2giQVKGwjIAttOcSO/iuhEo59xVn3TkK/O+j SrgrGK5jIfrDZd1RDq9xROIx6hV3WiHNP41M1bTAXIxSw+1IQ284qiygRKhsX35O03lz 8COg== X-Gm-Message-State: ALoCoQlaCagzGJl4LijnXety0hDdcKArQ9VlzYMsJUap2y1JwX0SlZGoC5C1jDAQfgGAPfs3TuAj MIME-Version: 1.0 X-Received: by 10.50.33.100 with SMTP id q4mr6350237igi.8.1405956294316; Mon, 21 Jul 2014 08:24:54 -0700 (PDT) Received: by 10.64.16.233 with HTTP; Mon, 21 Jul 2014 08:24:54 -0700 (PDT) X-Originating-IP: [189.101.187.145] Date: Mon, 21 Jul 2014 12:24:54 -0300 Message-ID: Subject: map reduce for Cassandra From: Marcelo Elias Del Valle To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=089e0158b0347e2bbf04feb5b528 X-Virus-Checked: Checked by ClamAV on apache.org --089e0158b0347e2bbf04feb5b528 Content-Type: text/plain; charset=UTF-8 Hi, I have the need to executing a map/reduce job to identity data stored in Cassandra before indexing this data to Elastic Search. I have already used ColumnFamilyInputFormat (before start using CQL) to write hadoop jobs to do that, but I use to have a lot of troubles to perform tunning, as hadoop depends on how map tasks are split in order to successfull execute things in parallel, for IO/bound processes. First question is: Am I the only one having problems with that? Is anyone else using hadoop jobs that reads from Cassandra in production? Second question is about the alternatives. I saw new version spark will have Cassandra support, but using CqlPagingInputFormat, from hadoop. I tried to use HIVE with Cassandra community, but it seems it only works with Cassandra Enterprise and doesn't do more than FB presto (http://prestodb.io/), which we have been using reading from Cassandra and so far it has been great for SQL-like queries. For custom map reduce jobs, however, it is not enough. Does anyone know some other tool that performs MR on Cassandra? My impression is most tools were created to work on top of HDFS and reading from a nosql db is some kind of "workaround". Third question is about how these tools work. Most of them writtes mapped data on a intermediate storage, then data is shuffled and sorted, then it is reduced. Even when using CqlPagingInputFormat, if you are using hadoop it will write files to HDFS after the mapping phase, shuffle and sort this data, and then reduce it. I wonder if a tool supporting Cassandra out of the box wouldn't be smarter. Is it faster to write all your data to a file and then sorting it, or batch inserting data and already indexing it, as it happens when you store data in a Cassandra CF? I didn't do the calculations to check the complexity of each one, what should consider no index in Cassandra would be really large, as the maximum index size will always depend on the maximum capacity of a single host, but my guess is that a map / reduce tool written specifically to Cassandra, from the beggining, could perform much better than a tool written to HDFS and adapted. I hear people saying Map/Reduce on Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really make sense? Should we expect a result like this? Final question: Do you think writting a new M/R tool like described would be reinventing the wheel? Or it makes sense? Thanks in advance. Any opinions about this subject will be very appreciated. Best regards, Marcelo Valle. --089e0158b0347e2bbf04feb5b528 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi,=C2=A0

I have the need to executing = a map/reduce job to identity data stored in Cassandra before indexing this = data to Elastic Search.

I have already used Column= FamilyInputFormat (before start using CQL) to write hadoop jobs to do that,= but I use to have a lot of troubles to perform tunning, as hadoop depends = on how map tasks are split in order to successfull execute things in parall= el, for IO/bound processes.=C2=A0

First question is: Am I the only one having problems wi= th that? Is anyone else using hadoop jobs that reads from Cassandra in prod= uction?

Second question is about the alternatives.= I saw new version spark will have Cassandra support, but using CqlPagingIn= putFormat, from hadoop. I tried to use HIVE with Cassandra community, but i= t seems it only works with Cassandra Enterprise and doesn't do more tha= n FB presto (http://prestodb.io/), whic= h we have been using reading from Cassandra and so far it has been great fo= r SQL-like queries. For custom map reduce jobs, however, it is not enough.<= /div>

Does anyone know some other tool that performs MR on Ca= ssandra? My impression is most tools were created to work on top of HDFS an= d reading from a nosql db is some kind of "workaround".

Third question is about how these tools work. Most of t= hem writtes mapped data on a intermediate storage, then data is shuffled an= d sorted, then it is reduced. Even when using CqlPagingInputFormat, if you = are using hadoop it will write files to HDFS after the mapping phase, shuff= le and sort this data, and then reduce it.=C2=A0

I wonder if a tool supporting Cassandra out of the box = wouldn't be smarter. Is it faster to write all your data to a file and = then sorting it, or batch inserting data and already indexing it, as it hap= pens when you store data in a Cassandra CF? I didn't do the calculation= s to check the complexity of each one, what should consider no index in Cas= sandra would be really large, as the maximum index size will always depend = on the maximum capacity of a single host, but my guess is that a map / redu= ce tool written specifically to Cassandra, from the beggining, could perfor= m much better than a tool written to HDFS and adapted. I hear people saying= Map/Reduce on Cassandra/HBase is usually 30% slower than M/R in HDFS. Does= it really make sense? Should we expect a result like this?

Final question: Do you think writting a new M/R tool li= ke described would be reinventing the wheel? Or it makes sense?
<= br>
Thanks in advance. Any opinions about this subject will be ve= ry appreciated.

Best regards,
Marcelo Valle.
--089e0158b0347e2bbf04feb5b528--