From user-return-64302-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org Sat Aug 3 10:06:37 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 64B0E1802C7 for ; Sat, 3 Aug 2019 12:06:37 +0200 (CEST) Received: (qmail 56937 invoked by uid 500); 3 Aug 2019 10:06:33 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 56927 invoked by uid 99); 3 Aug 2019 10:06:33 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Aug 2019 10:06:33 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id AF672C056A for ; Sat, 3 Aug 2019 10:06:32 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.703 X-Spam-Level: **** X-Spam-Status: No, score=4.703 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, KAM_ASCII_DIVIDERS=0.8, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001, URIBL_SBL=4, URIBL_SBL_A=0.1] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=xvalheru.org Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id A1Jevfr80GJM for ; Sat, 3 Aug 2019 10:06:27 +0000 (UTC) Received-SPF: None (mailfrom) identity=mailfrom; client-ip=2a00:1ed0:1:500::a005; helo=mail.pipni.cz; envelope-from=pat@xvalheru.org; receiver= Received: from mail.pipni.cz (mail.pipni.cz [IPv6:2a00:1ed0:1:500::a005]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 3F8AC7D3FB for ; Sat, 3 Aug 2019 10:06:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=xvalheru.org; s=dkim; h=Message-ID:References:In-Reply-To:Subject:Cc:To: From:Date:Content-Transfer-Encoding:Content-Type:MIME-Version:Sender:Reply-To :Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender: Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=Kn2PyPeB11SDOnxZLEVa7DPwbfqetwb4UwyaHO6kdo0=; b=nT7CGDbbYYfL5zf2OHLgHH5H9g 2RY2k3r3TtxMPw/MS/lUoG1k0KOboGgXiBJm1pbFtyPXTv1NS+sMTIYzz0xRFSGAGu+JLb2CNTbNU 9VAEZItnRyXFjRs2zzPz4I+4ZX4/r1pcDZYJ46Z9hgD4m/Xs5vtehbN0KcQHEOl1QTZg=; Received: from ns.pipni.cz ([93.185.104.4] helo=webmail.pipni.cz) id 1htqvh-0001vl-IE by authid ; Sat, 03 Aug 2019 12:06:25 +0200 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Date: Sat, 03 Aug 2019 12:06:25 +0200 From: pat@xvalheru.org To: user@cassandra.apache.org Cc: Dimo Velev Subject: Re: loading big amount of data to Cassandra In-Reply-To: <29277B2E-F37E-416A-9EA5-6C8537A2A568@gmail.com> References: <26369efff169c1e9c15425b5158a6b91@xvalheru.org> <29277B2E-F37E-416A-9EA5-6C8537A2A568@gmail.com> Message-ID: <80cdef60317a43504d7d8470eb13c378@xvalheru.org> X-Sender: pat@xvalheru.org User-Agent: Roundcube Webmail/1.2.3 X-Remote-Browser: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0 SeaMonkey/2.49.9.1 Lightning/5.4.9.1 X-Originating-IP: [217.195.172.32] X-pipni-MailScanner-ID: 1htqvh-0001vl-IE X-pipni-MailScanner: Not scanned: please contact your Internet E-Mail Service Provider for details X-pipni-MailScanner-SpamCheck: not spam X-pipni-MailScanner-From: pat@xvalheru.org Thanks to all, I'll try the SSTables. Thanks Pat On 2019-08-03 09:54, Dimo Velev wrote: > Check out the CQLSSTableWriter java class - > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/CQLSSTableWriter.java > . You use it to generate sstables - you need to write a small program > for that. You can then stream them over the network using the > sstableloader (either use the utility or use the underlying classes to > embed it in your program). > > On 3. Aug 2019, at 07:17, Ayub M wrote: > >> Dimo, how do you generate sstables? Do you mean load data locally on >> a cassandra node and use sstableloader? >> >> On Fri, Aug 2, 2019, 5:48 PM Dimo Velev >> wrote: >> >>> Hi, >>> >>> Batches will actually slow down the process because they mean a >>> different thing in C* - as you read they are just grouping changes >>> together that you want executed atomically. >>> >>> Cassandra does not really have indices so that is different than a >>> relational DB. However, after writing stuff to Cassandra it >>> generates many smallish partitions of the data. These are then >>> joined in the background together to improve read performance. >>> >>> You have two options from my experience: >>> >>> Option 1: use normal CQL api in async mode. This will create a >>> high CPU load on your cluster. Depending on whether that is fine >>> for you that might be the easiest solution. >>> >>> Option 2: generate sstables locally and use the sstableloader to >>> upload them into the cluster. The streaming does not generate high >>> cpu load so it is a viable option for clusters with other >>> operational load. >>> >>> Option 2 scales with the number of cores of the machine generating >>> the sstables. If you can split your data you can generate sstables >>> on multiple machines. In contrast, option 1 scales with your >>> cluster. If you have a large cluster that is idling, it would be >>> better to use option 1. >>> >>> With both options I was able to write at about 50-100K rows / sec >>> on my laptop and local Cassandra. The speed heavily depends on the >>> size of your rows. >>> >>> Back to your question — I guess option2 is similar to what you >>> are used to from tools like sqlloader for relational DBMSes >>> >>> I had a requirement of loading a few 100 mio rows per day into an >>> operational cluster so I went with option 2 to offload the cpu >>> load to reduce impact on the reading side during the loads. >>> >>> Cheers, >>> Dimo >>> >>> Sent from my iPad >>> >>>> On 2. Aug 2019, at 18:59, pat@xvalheru.org wrote: >>>> >>>> Hi, >>>> >>>> I need to upload to Cassandra about 7 billions of records. What >>> is the best setup of Cassandra for this task? Will usage of batch >>> speeds up the upload (I've read somewhere that batch in Cassandra >>> is dedicated to atomicity not to speeding up communication)? How >>> Cassandra internally works related to indexing? In SQL databases >>> when uploading such amount of data is suggested to turn off >>> indexing and then turn on. Is something simmillar possible in >>> Cassandra? >>>> >>>> Thanks for all suggestions. >>>> >>>> Pat >>>> >>>> ---------------------------------------- >>>> Freehosting PIPNI - http://www.pipni.cz/ >>>> >>>> >>>> >>> >> > --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org >>>> For additional commands, e-mail: user-help@cassandra.apache.org >>>> >>> >>> >> > --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org >>> For additional commands, e-mail: user-help@cassandra.apache.org > > --------------------------------------------------------------------------- > > Freehosting PIPNI - http://www.pipni.cz/ ---------------------------------------- Freehosting PIPNI - http://www.pipni.cz/ --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org For additional commands, e-mail: user-help@cassandra.apache.org