Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id BAA74200BC2 for ; Thu, 17 Nov 2016 15:05:07 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id B99DB160B0B; Thu, 17 Nov 2016 14:05:07 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 0FE28160AFF for ; Thu, 17 Nov 2016 15:05:06 +0100 (CET) Received: (qmail 96465 invoked by uid 500); 17 Nov 2016 14:05:05 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 96451 invoked by uid 99); 17 Nov 2016 14:05:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Nov 2016 14:05:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id B137CC1F61 for ; Thu, 17 Nov 2016 14:05:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.002 X-Spam-Level: X-Spam-Status: No, score=0.002 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-2.999, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id qcCPjoXznZnt for ; Thu, 17 Nov 2016 14:05:03 +0000 (UTC) Received: from mail.nododos.com (mail.nododos.com [54.208.244.54]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 90F675FAE7 for ; Thu, 17 Nov 2016 14:05:03 +0000 (UTC) Received: from localhost (localhost.localdomain [127.0.0.1]) by mail.nododos.com (Postfix) with ESMTP id EFB44C2A26 for ; Thu, 17 Nov 2016 13:58:10 +0000 (UTC) Received: from mail.nododos.com ([127.0.0.1]) by localhost (mail.nododos.com [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 5LeqaRO6CdDK for ; Thu, 17 Nov 2016 13:58:06 +0000 (UTC) Received: from localhost (localhost.localdomain [127.0.0.1]) by mail.nododos.com (Postfix) with ESMTP id C8222C2B57 for ; Thu, 17 Nov 2016 13:58:05 +0000 (UTC) X-Virus-Scanned: amavisd-new at nododos.com Received: from mail.nododos.com ([127.0.0.1]) by localhost (mail.nododos.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id Y9KV0FH0haWc for ; Thu, 17 Nov 2016 13:58:02 +0000 (UTC) Received: from mail.nododos.com (mail.nododos.com [10.4.10.21]) by mail.nododos.com (Postfix) with ESMTP id A3DFFC2A26 for ; Thu, 17 Nov 2016 13:58:01 +0000 (UTC) Date: Thu, 17 Nov 2016 13:58:00 +0000 (UTC) From: Joe Olson To: user@cassandra.apache.org Message-ID: <1616103142.1405.1479391080488.JavaMail.zimbra@nododos.com> Subject: Any Bulk Load on Large Data Set Advice? MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_1404_1294362120.1479391080479" X-Originating-IP: [73.8.56.50] X-Mailer: Zimbra 8.5.0_GA_3042 (ZimbraWebClient - FF49 (Mac)/8.5.0_GA_3042) Thread-Topic: Any Bulk Load on Large Data Set Advice? Thread-Index: KvOC6lKui4WX4Y4Q/3cWpyycrL72RQ== archived-at: Thu, 17 Nov 2016 14:05:07 -0000 ------=_Part_1404_1294362120.1479391080479 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit I received a grant to do some analysis on netflow data (Local IP address, Local Port, Remote IP address, Remote Port, time, # of packets, etc) using Cassandra and Spark. The de-normalized data set is about 13TB out the door. I plan on using 9 Cassandra nodes (replication factor=3) to store the data, with Spark doing the aggregation. Data set will be immutable once loaded, and am using the replication factor = 3 to somewhat simulate the real world. Most of the analysis will be of the sort "Give me all the remote ip addresses for source IP 'X' between time t1 and t2" I built and tested a bulk loader following this example in GitHub: https://github.com/yukim/cassandra-bulkload-example to generate the SSTables, but I have not executed it on the entire data set yet. Any advice on how to execute the bulk load under this configuration? Can I generate the SSTables in parallel? Once generated, can I write the SSTables to all nodes simultaneously? Should I be doing any kind of sorting by the partition key? This is a lot of data, so I figured I'd ask before I pulled the trigger. Thanks in advance! ------=_Part_1404_1294362120.1479391080479 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable
I received a grant to do some analysis on n= etflow data (Local IP address, Local Port, Remote IP address, Remote Port, = time, # of packets, etc) using Cassandra and Spark. The de-normalized data = set is about 13TB out the door. I plan on using 9 Cassandra nodes (replicat= ion factor=3D3) to store the data, with Spark doing the aggregation.

Data= set will be immutable once loaded, and am using the replication factor =3D= 3 to somewhat simulate the real world. Most of the analysis will be of the= sort "Give me all the remote ip addresses for source IP 'X' between time t= 1 and t2"

I built a= nd tested a bulk loader following this example in GitHub: https://github.co= m/yukim/cassandra-bulkload-example to generate the SSTables, but I have not= executed it on the entire data set yet.

Any advice on how to execute the bulk load under this configura= tion?  Can I generate the SSTables in parallel? Once generated, can I = write the SSTables to all nodes simultaneously? Should I be doing any kind = of sorting by the partition key?

This is a lot of data, so I figured I'd ask before I pulled the tri= gger. Thanks in advance!


------=_Part_1404_1294362120.1479391080479--