Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 14E32200BC2 for ; Thu, 17 Nov 2016 18:40:37 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 1489F160B0B; Thu, 17 Nov 2016 17:40:37 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id DDBBE160AD8 for ; Thu, 17 Nov 2016 18:40:35 +0100 (CET) Received: (qmail 83770 invoked by uid 500); 17 Nov 2016 17:40:34 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 83760 invoked by uid 99); 17 Nov 2016 17:40:33 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Nov 2016 17:40:33 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 81AD4C0FD4 for ; Thu, 17 Nov 2016 17:40:33 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.002 X-Spam-Level: ** X-Spam-Status: No, score=2.002 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, MIME_QP_LONG_LINE=0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id KM_K49CIqjIA for ; Thu, 17 Nov 2016 17:40:31 +0000 (UTC) Received: from mx0b-00206401.pphosted.com (mx0b-00206401.pphosted.com [148.163.152.21]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 6FBB25FC4A for ; Thu, 17 Nov 2016 17:40:31 +0000 (UTC) Received: from pps.filterd (m0093025.ppops.net [127.0.0.1]) by mx0b-00206401.pphosted.com (8.16.0.17/8.16.0.17) with SMTP id uAHHcE6R026190 for ; Thu, 17 Nov 2016 09:40:23 -0800 Received: from ee01.crowdstrike.sys (dragosx.crowdstrike.com [208.42.231.60]) by mx0b-00206401.pphosted.com with ESMTP id 26p2uek0sm-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT) for ; Thu, 17 Nov 2016 09:40:23 -0800 Received: from Casmbox03.crowdstrike.sys (10.100.11.66) by ee01.crowdstrike.sys (10.100.0.12) with Microsoft SMTP Server (TLS) id 15.0.1178.4; Thu, 17 Nov 2016 09:40:22 -0800 Received: from Casmbox03.crowdstrike.sys (10.100.11.66) by Casmbox03.crowdstrike.sys (10.100.11.66) with Microsoft SMTP Server (TLS) id 15.0.1210.3; Thu, 17 Nov 2016 09:40:20 -0800 Received: from Casmbox03.crowdstrike.sys ([fe80::dcb9:3456:3bac:896e]) by Casmbox03.crowdstrike.sys ([fe80::dcb9:3456:3bac:896e%25]) with mapi id 15.00.1210.000; Thu, 17 Nov 2016 09:40:20 -0800 From: Jeff Jirsa To: "user@cassandra.apache.org" Subject: Re: Any Bulk Load on Large Data Set Advice? Thread-Topic: Any Bulk Load on Large Data Set Advice? Thread-Index: KvOC6lKui4WX4Y4Q/3cWpyycrL72Rb18Hq6A Date: Thu, 17 Nov 2016 17:40:20 +0000 Message-ID: References: <1616103142.1405.1479391080488.JavaMail.zimbra@nododos.com> In-Reply-To: <1616103142.1405.1479391080488.JavaMail.zimbra@nododos.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: yes X-MS-TNEF-Correlator: x-ms-exchange-messagesentrepresentingtype: 1 x-ms-exchange-transport-fromentityheader: Hosted x-originating-ip: [10.100.0.16] x-disclaimer: USA Content-Type: multipart/signed; protocol="application/pkcs7-signature"; micalg=sha256; boundary="B_3562220419_1890608582" MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-11-17_09:,, signatures=0 archived-at: Thu, 17 Nov 2016 17:40:37 -0000 --B_3562220419_1890608582 Content-type: multipart/alternative; boundary="B_3562220419_832529298" --B_3562220419_832529298 Content-type: text/plain; charset="UTF-8" Content-transfer-encoding: quoted-printable Other people are commenting on the appropriateness of Cassandra =E2=80=93 they ma= y have a point you should consider, but I=E2=80=99m going to answer the question.=20 =20 1) Yes, you can generate the sstables in parallel 2) If you use sstable bulk loader interface (sstableloader), it=E2=80=99ll = stream to all appropriate nodes. You can run sstableloader from multiple nod= es at the same time as well.=20 3) Sorting by partition key probably won=E2=80=99t hurt. If you run jobs in= parallel, dividing them up by partition key seems like a good way to parall= elize your task.=20 =20 We do something like this in certain parts of our workflow, and it works we= ll. =C2=A0 =20 =20 =20 From: Joe Olson Reply-To: "user@cassandra.apache.org" Date: Thursday, November 17, 2016 at 5:58 AM To: "user@cassandra.apache.org" Subject: Any Bulk Load on Large Data Set Advice? =20 I received a grant to do some analysis on netflow data (Local IP address, L= ocal Port, Remote IP address, Remote Port, time, # of packets, etc) using Ca= ssandra and Spark. The de-normalized data set is about 13TB out the door. I = plan on using 9 Cassandra nodes (replication factor=3D3) to store the data, wi= th Spark doing the aggregation.=20 =20 Data set will be immutable once loaded, and am using the replication factor= =3D 3 to somewhat simulate the real world. Most of the analysis will be of th= e sort "Give me all the remote ip addresses for source IP 'X' between time t= 1 and t2" =20 I built and tested a bulk loader following this example in GitHub: https://= github.com/yukim/cassandra-bulkload-example to generate the SSTables, but I = have not executed it on the entire data set yet. =20 Any advice on how to execute the bulk load under this configuration? Can I= generate the SSTables in parallel? Once generated, can I write the SSTables= to all nodes simultaneously? Should I be doing any kind of sorting by the p= artition key? =20 This is a lot of data, so I figured I'd ask before I pulled the trigger. Th= anks in advance! =20 =20 --B_3562220419_832529298 Content-type: text/html; charset="UTF-8" Content-transfer-encoding: quoted-printable

Other people are commenting on the appropriateness of Cassand= ra – they may have a point you should consider, but I’m going to= answer the question.

 

1)  &n= bsp;    Yes, you can generate the sstables in parallel<= o:p>

2)       <= ![endif]>If you use sstab= le bulk loader interface (sstableloader), it’ll stream to all appropri= ate nodes. You can run sstableloader from multiple nodes at the same time as= well.

3)       = Sorting = by partition key probably won’t hurt. If you run jobs in parallel, div= iding them up by partition key seems like a good way to parallelize your tas= k.

 

We do something like this in certai= n parts of our workflow, and it works well. =C2=A0

 

 

 

From: Joe Olson <technology@nododos.= com>
Reply-To: "user@cassandra.apache.org" <user@c= assandra.apache.org>
Date: Thursday, November 17, 2016 at 5:58 = AM
To: "user@cassandra.apache.org" <user@cassandra.ap= ache.org>
Subject: Any Bulk Load on Large Data Set Advice?=

 

I received a grant to do some analysis on netflow data (Local IP addres= s, Local Port, Remote IP address, Remote Port, time, # of packets, etc) usin= g Cassandra and Spark. The de-normalized data set is about 13TB out the door= . I plan on using 9 Cassandra nodes (replication factor=3D3) to store the data= , with Spark doing the aggregation.

 

Data set will be immutable once loaded, and am using the replicat= ion factor =3D 3 to somewhat simulate the real world. Most of the analysis wil= l be of the sort "Give me all the remote ip addresses for source IP 'X'= between time t1 and t2"

 

--B_3562220419_832529298-- --B_3562220419_1890608582 Content-Type: application/pkcs7-signature; name="smime.p7s" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="smime.p7s" MIIRKwYJKoZIhvcNAQcCoIIRHDCCERgCAQExDzANBglghkgBZQMEAgEFADALBgkqhkiG9w0B BwGggg6dMIIFTDCCBDSgAwIBAgIRAIeX7oQRaz3bAAAAAEw1XuAwDQYJKoZIhvcNAQELBQAw gaUxCzAJBgNVBAYTAlVTMRYwFAYDVQQKEw1FbnRydXN0LCBJbmMuMTkwNwYDVQQLEzB3d3cu ZW50cnVzdC5uZXQvQ1BTIGlzIGluY29ycG9yYXRlZCBieSByZWZlcmVuY2UxHzAdBgNVBAsT FihjKSAyMDEwIEVudHJ1c3QsIEluYy4xIjAgBgNVBAMTGUVudHJ1c3QgQ2xhc3MgMiBDbGll bnQgQ0EwHhcNMTYwOTIwMjIxMTIwWhcNMTkwOTMwMjI0MTE3WjCBjTELMAkGA1UEBhMCVVMx EzARBgNVBAgTCkNhbGlmb3JuaWExDzANBgNVBAcTBklydmluZTEaMBgGA1UEChMRQ3Jvd2RT dHJpa2UsIEluYy4xPDARBgNVBAMTCkplZmYgSmlyc2EwJwYJKoZIhvcNAQkBFhpqZWZmLmpp cnNhQGNyb3dkc3RyaWtlLmNvbTCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAM8/ kM42VwiKXDU7EgjPU7wyr7KRidCCUqlqfSJ9pcvlNzqluaTAYfoAALsc8vYhIxw7h9qJPVC9 xXdgQGXJcHeVfHwslf0jUWezmnk4jXOuhhiGKF8hCDR2OK1vwl495dCVl8ui+Xly59MMxIvc uAVieWJ8+E5JLa0/IQVPHg3OHB4vWfipOnp9ZXyXWvwtbU6px4vV5tG80PXBeMPUO3vT7XTe rQuua+nZTiqh3VnVuOxdxr1ttkxu3Gn5SqBLwbuPlMrBYtJVa5nAMPo+fVgUmV+aSCCjG/x+ Vy6dFutaIyLXyB2jiQx3t9mX0Iu2Nnc2rtpezj+g0FP6dB703nsCAwEAAaOCAYswggGHMA4G A1UdDwEB/wQEAwIFoDAdBgNVHSUEFjAUBggrBgEFBQcDAgYIKwYBBQUHAwQwQgYDVR0gBDsw OTA3BgtghkgBhvpsCgEEAjAoMCYGCCsGAQUFBwIBFhpodHRwOi8vd3d3LmVudHJ1c3QubmV0 L3JwYTBqBggrBgEFBQcBAQReMFwwIwYIKwYBBQUHMAGGF2h0dHA6Ly9vY3NwLmVudHJ1c3Qu bmV0MDUGCCsGAQUFBzAChilodHRwOi8vYWlhLmVudHJ1c3QubmV0LzIwNDhjbGFzczJzaGEy LmNlcjA0BgNVHR8ELTArMCmgJ6AlhiNodHRwOi8vY3JsLmVudHJ1c3QubmV0L2NsYXNzMmNh LmNybDAlBgNVHREEHjAcgRpqZWZmLmppcnNhQGNyb3dkc3RyaWtlLmNvbTAfBgNVHSMEGDAW gBQJkaW66fIuKnXfzX7+d8ry3mubJDAdBgNVHQ4EFgQUSw+neOez3ZJWkkEF36O0c2skDkAw CQYDVR0TBAIwADANBgkqhkiG9w0BAQsFAAOCAQEAoQiIaUSkRZecrnLGP6/as+GANvfMnFNL i5wawcZljyeJg8e7p6+ZcXUSI0GOPs/Wl9paitiIIhGuvD2iD3+cvJQlrC+8LT2PFkRUu81B riyF3QzWygI1hCdFQcRY+9Fox1zKT0+5SwfOPstSBLHuYAUfRQrc9WtoqF70xbngPUCfGZVJ +8l9kJgCnXqwmfTu8s2d1Q5MCdz68g8geVU3nYnJ7ONPvvgsdlgywW0sNLLhn4iqGY6y5xSh uR2GYgSwcYrvKfU56sHYc2JLyyUzUm3r3BWE+CedpBg+B4Al6XsgqJPu2t2hgSrcDoHrpEsV +hTUoTgWxZlqHh7bcQdRhjCCBOkwggPRoAMCAQICBEwOjDgwDQYJKoZIhvcNAQEFBQAwgbQx FDASBgNVBAoTC0VudHJ1c3QubmV0MUAwPgYDVQQLFDd3d3cuZW50cnVzdC5uZXQvQ1BTXzIw NDggaW5jb3JwLiBieSByZWYuIChsaW1pdHMgbGlhYi4pMSUwIwYDVQQLExwoYykgMTk5OSBF bnRydXN0Lm5ldCBMaW1pdGVkMTMwMQYDVQQDEypFbnRydXN0Lm5ldCBDZXJ0aWZpY2F0aW9u IEF1dGhvcml0eSAoMjA0OCkwHhcNMTExMTExMTUzODM0WhcNMjExMTEyMDAxNzM0WjCBpTEL MAkGA1UEBhMCVVMxFjAUBgNVBAoTDUVudHJ1c3QsIEluYy4xOTA3BgNVBAsTMHd3dy5lbnRy dXN0Lm5ldC9DUFMgaXMgaW5jb3Jwb3JhdGVkIGJ5IHJlZmVyZW5jZTEfMB0GA1UECxMWKGMp IDIwMTAgRW50cnVzdCwgSW5jLjEiMCAGA1UEAxMZRW50cnVzdCBDbGFzcyAyIENsaWVudCBD QTCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAMQyjULQnhmdW5BaEEy1EAAhuQdI 3q5ugNb/FFAG6HWva0aO56VPrcOMsPp74BmR/fBjrXFJ86gcH6s0GSBOS1TpAJO+cAgx3olT rFe8JO8qj0LU9+qVJV0UdtLNpxL6G7K0XGFAvV/dV5tEVdjFiRk8ZT256NSlLcIs0+qDMaII PF5ZrhIuKgqMXvOzMa4KrX7ssEkJ/KcuIh5oZDSdFuOmPQMxQBb3lPZLGTTJl+YinEjeZKCD C1gFmMQiRokF/aO+9klMYQMWpPgKmRziwMZ+aQIyV5ADrwCUobnczq/v9HwYzjALyof41V8f WVHYiwu5OMZYwlN82ibU2/K9kM0CAwEAAaOCAQ4wggEKMA4GA1UdDwEB/wQEAwIBBjASBgNV HRMBAf8ECDAGAQH/AgEAMDMGCCsGAQUFBwEBBCcwJTAjBggrBgEFBQcwAYYXaHR0cDovL29j c3AuZW50cnVzdC5uZXQwMgYDVR0fBCswKTAnoCWgI4YhaHR0cDovL2NybC5lbnRydXN0Lm5l dC8yMDQ4Y2EuY3JsMDsGA1UdIAQ0MDIwMAYEVR0gADAoMCYGCCsGAQUFBwIBFhpodHRwOi8v d3d3LmVudHJ1c3QubmV0L3JwYTAdBgNVHQ4EFgQUCZGluunyLip1381+/nfK8t5rmyQwHwYD VR0jBBgwFoAUVeSB0RGAvtiJuQijMfmhJAkWuXAwDQYJKoZIhvcNAQEFBQADggEBAAqJtbEz ORCxLAl57vMbbah2SrTDeOPn/ydhNMxK7NiC7h9jSuF9RXpERqpWxoBM38h1CZxhIdk+Tcug GvSRiiWlem0buWcZPyUz1EEfYT8YIpPIPvfD6Q/nWPSeH07jn+HV3ze6/LHtgDZmZoUmV2K1 4m6wgmrQrCMT0RcVRglZds6ncKeIHnEnPh3e2eqdCIp/K5byi5sUf8pFck8KLVu/zrl76IyI TI/XXgmQoOfI+YA+rcEyskbD/c0MDOXC/U8Jt4IgkrzTZJ8HMU32zzVpN6TvRz8lK3sO35s7 snE9J86ULnsmrUifBH+fG4fMeh2xIJAVCK4CEdPDAD2o60cwggRcMIIDRKADAgECAgQ4Y7lm MA0GCSqGSIb3DQEBBQUAMIG0MRQwEgYDVQQKEwtFbnRydXN0Lm5ldDFAMD4GA1UECxQ3d3d3 LmVudHJ1c3QubmV0L0NQU18yMDQ4IGluY29ycC4gYnkgcmVmLiAobGltaXRzIGxpYWIuKTEl MCMGA1UECxMcKGMpIDE5OTkgRW50cnVzdC5uZXQgTGltaXRlZDEzMDEGA1UEAxMqRW50cnVz dC5uZXQgQ2VydGlmaWNhdGlvbiBBdXRob3JpdHkgKDIwNDgpMB4XDTk5MTIyNDE3NTA1MVoX DTE5MTIyNDE4MjA1MVowgbQxFDASBgNVBAoTC0VudHJ1c3QubmV0MUAwPgYDVQQLFDd3d3cu ZW50cnVzdC5uZXQvQ1BTXzIwNDggaW5jb3JwLiBieSByZWYuIChsaW1pdHMgbGlhYi4pMSUw IwYDVQQLExwoYykgMTk5OSBFbnRydXN0Lm5ldCBMaW1pdGVkMTMwMQYDVQQDEypFbnRydXN0 Lm5ldCBDZXJ0aWZpY2F0aW9uIEF1dGhvcml0eSAoMjA0OCkwggEiMA0GCSqGSIb3DQEBAQUA A4IBDwAwggEKAoIBAQCtTUupEoay6qMgBxUWZCorS9G/C0pNju2AdqVnt3hAwHNCyGjA21Mr 3V64dpg1k4sanXwTOg4fW7cez+UkFB6xgamNfbjMa0sD8QIM3KulQCQAf3SUoZ0IKbOIC/WH d51VzeTDftdqZKuFFIaVW5cyUG89yLpmDOP8vbhJwXaJSRn9wKi9iaNnL8afvHEZYLgt6SzJ kHZme5Tir3jWZVNdPNacss8pA/kvpFCy1EjOBTJViv2yZEwO5JgHddt/37kIVWCFMCn5e0ik aYbjNT8ehl16ehW97wCOFSJUFwCQJpO8Dklokb/4R9OdlULBDk3fbybPwxghYmZDcNbVwAfh AgMBAAGjdDByMBEGCWCGSAGG+EIBAQQEAwIABzAfBgNVHSMEGDAWgBRV5IHREYC+2Im5CKMx +aEkCRa5cDAdBgNVHQ4EFgQUVeSB0RGAvtiJuQijMfmhJAkWuXAwHQYJKoZIhvZ9B0EABBAw DhsIVjUuMDo0LjADAgSQMA0GCSqGSIb3DQEBBQUAA4IBAQBZR6whhIoXyZyJUx66gIUaxjxO PrGctnzGkl0YZALj0wYIEWF8Y+MrnTEDcHbSoyig9LuaY3PtbeUq2+0UqSvGNhHQK+sHi6Xa nlwZnVYS9VQpyAXtshIqjfQDG//nkhCHsDq1w50FNxKjx/QVudWkORabUzojkfGogqJqiGjB eQIivKqm1q7fsBRfuIfQ3Xx/e/+vHM/m2wetXtuFndArDTPbBNHmSUATK3b7PumciQ8Vzhiw hXghT2tPDvo2Z80H8v8I0OLe2b8qr7iHhiE8BMq3lGh/zzzpmNc4/+zA2VDwLktYrkZv0C7D YNpyVXK9TEWeYbq/hIGSA9HSaXzFMYICUjCCAk4CAQEwgbswgaUxCzAJBgNVBAYTAlVTMRYw FAYDVQQKEw1FbnRydXN0LCBJbmMuMTkwNwYDVQQLEzB3d3cuZW50cnVzdC5uZXQvQ1BTIGlz IGluY29ycG9yYXRlZCBieSByZWZlcmVuY2UxHzAdBgNVBAsTFihjKSAyMDEwIEVudHJ1c3Qs IEluYy4xIjAgBgNVBAMTGUVudHJ1c3QgQ2xhc3MgMiBDbGllbnQgQ0ECEQCHl+6EEWs92wAA AABMNV7gMA0GCWCGSAFlAwQCAQUAoGkwLwYJKoZIhvcNAQkEMSIEII0cGBqZxtP0jB98x3jI Vzwo8WK1Ri2/Jt73y6I/hoCLMBgGCSqGSIb3DQEJAzELBgkqhkiG9w0BBwEwHAYJKoZIhvcN AQkFMQ8XDTE2MTExNzE3NDAxOVowDQYJKoZIhvcNAQEBBQAEggEAUG7Z+W2qJ2zvi2C76wVB Qe8wBA9KsSsjijxxj+mX985KHb9DOtP1Pq/NyBYbGd01fmpORZNN1uorCu4uY3XTOWhWG5h/ IotW/35NqRwsSCuMuQMTw0mOHQ1fsF+BGhcH2LUHJl+n+PRh/VuV53qtQJ3hH6f2NZ/cQw6T 61BO9KTJf6I0VYvB1X9VGGQw9unE/PxpBAEdYnRSsjRPHEV9qqYmS9FiVQZaa75rEQBo41D0 /9VQ30kGjGDuwcmfPQdkx2Y41e5iPub1AasVT0aYt4y1yrNPV/7sfDXWFZ/SD/RVQZd+/QOa teqfICmOIfeoVfVWHbjd33Kq6hrCLq1CkA== --B_3562220419_1890608582--