Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1B74C87DB for ; Tue, 16 Aug 2011 23:52:48 +0000 (UTC) Received: (qmail 73063 invoked by uid 500); 16 Aug 2011 23:52:46 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 72879 invoked by uid 500); 16 Aug 2011 23:52:45 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 72871 invoked by uid 99); 16 Aug 2011 23:52:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Aug 2011 23:52:44 +0000 X-ASF-Spam-Status: No, hits=3.2 required=5.0 tests=HTML_MESSAGE,SPF_SOFTFAIL X-Spam-Check-By: apache.org Received-SPF: softfail (nike.apache.org: transitioning domain of yyang@me.com does not designate 17.158.233.225 as permitted sender) Received: from [17.158.233.225] (HELO nk11p99mm-asmtpout004.mac.com) (17.158.233.225) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Aug 2011 23:52:37 +0000 MIME-version: 1.0 Content-type: multipart/alternative; boundary="Boundary_(ID_Wi9ZndHVjxBz0KDzM/JNeg)" Received: from [10.10.10.128] ([209.119.60.1]) by nk11p99mm-asmtp004.mac.com (Oracle Communications Messaging Exchange Server 7u4-22.01 64bit (built Apr 21 2011)) with ESMTPSA id <0LQ1000ELOZ4YE00@nk11p99mm-asmtp004.mac.com> for user@cassandra.apache.org; Tue, 16 Aug 2011 16:52:16 -0700 (PDT) From: Yi Yang Subject: Re: Cassandra for numerical data set Date: Tue, 16 Aug 2011 16:52:15 -0700 In-reply-to: <1B1B8BB7-4D03-47EA-8197-CE90F6B470F3@me.com> To: user@cassandra.apache.org References: <3AD349C6-69AD-4126-86AD-35C33986C74E@me.com> <8E2EC6B7-1218-4149-909B-680194B94F71@thelastpickle.com> <1B1B8BB7-4D03-47EA-8197-CE90F6B470F3@me.com> Message-id: X-Mailer: Apple Mail (2.1249) X-Virus-Checked: Checked by ClamAV on apache.org --Boundary_(ID_Wi9ZndHVjxBz0KDzM/JNeg) Content-type: text/plain; CHARSET=US-ASCII Content-transfer-encoding: 7BIT BTW, If I'm going to insert a SCF row with ~400 columns and ~50 subcolumns under each column, how often should I do a mutation? per column or per row? On Aug 16, 2011, at 3:24 PM, Yi Yang wrote: > > Thanks Aaron. > >>> 2) >>> I'm doing batch writes to the database (pulling data from multiple resources and put them together). I wish to know if there's some better methods to improve the writing efficiency since it's just about the same speed as MySQL, when writing sequentially. Seems like the commitlog requires a huge mount of disk IO comparing with my test machine can afford. >> Have a look at http://www.datastax.com/dev/blog/bulk-loading > This is a great tool for me. I'll try on this tool since it will require much lower bandwidth cost and disk IO. > >> >>> 3) >>> In my case, each row is read randomly with the same chance. I have around 0.5M rows in total. Can you provide some practical advices on optimizing the row cache and key cache? I can use up to 8 gig of memory on test machines. >> If your data set small enough to fit in memory ? . You may also be interested in the row_cache_provider setting for column families, see the CLI help for create column family and the IRowCacheProvider interface. You can replace the caching strategy if you want to. > The dataset is about 150 Gig storing as CSV and estimated as 1.3T storing as SSTable. Hence I don't think it can fit into memory. I'll try the caching strategy a little bit but I think it can improve my case a little bit. > > I'm now looking into some native compression on SSTable, just patched the CASSANDRA-47 and found there is a huge performance penalty in my use case, and I haven't figured out the reason yet. I suppose CASSANDRA-647 will solve it better, however I seek there's a number of tickets working at a similar issue, including CASSANDRA-1608 etc. Is that because cassandra really cost a huge disk space? > > Well my target is to simply get the 1.3T compressed to 700 Gig so that I can fit it into a single server, while keeping the same level of performance. > > Best, > Steve > > > On Aug 16, 2011, at 2:27 PM, aaron morton wrote: > >>> >> >> Hope that helps. >> >> >> ----------------- >> Aaron Morton >> Freelance Cassandra Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 16/08/2011, at 12:44 PM, Yi Yang wrote: >> >>> Dear all, >>> >>> I wanna report my use case, and have a discussion with you guys. >>> >>> I'm currently working on my second Cassandra project. I got into somehow a unique use case: storing traditional, relational data set into Cassandra datastore, it's a dataset of int and float numbers, no more strings, no more other data and the column names are much longer than the value itself. Besides, row-key is the md-5 hash ver3 UUID of some other data. >>> >>> 1) >>> I did some workaround to make it save some disk space however it still takes approximately 12-15x more disk space than MySQL. I looked into Cassandra SSTable internal, did some optimizing on selecting better data serializer and also hashed the column name into one byte. That made the current database having ~6x overhead on disk space comparing with MySQL, which I think it might be acceptable. >>> >>> I'm currently interested into CASSANDRA-674 and will also test CASSANDRA-47 in the coming days. I'll keep you updated on my testing. But I'm willing to hear your idea on saving disk space. >>> >>> 2) >>> I'm doing batch writes to the database (pulling data from multiple resources and put them together). I wish to know if there's some better methods to improve the writing efficiency since it's just about the same speed as MySQL, when writing sequentially. Seems like the commitlog requires a huge mount of disk IO comparing with my test machine can afford. >>> >>> 3) >>> In my case, each row is read randomly with the same chance. I have around 0.5M rows in total. Can you provide some practical advices on optimizing the row cache and key cache? I can use up to 8 gig of memory on test machines. >>> >>> Thanks for your help. >>> >>> >>> Best, >>> >>> Steve >>> >>> >> > --Boundary_(ID_Wi9ZndHVjxBz0KDzM/JNeg) Content-type: text/html; CHARSET=US-ASCII Content-transfer-encoding: quoted-printable

Thanks = Aaron.

2)
I'm doing batch writes to the database (pulling = data from multiple resources and put them together).   I wish = to know if there's some better methods to improve the writing efficiency = since it's just about the same speed as MySQL, when writing = sequentially.   Seems like the commitlog requires a huge mount = of disk IO comparing with my test machine can = afford.
Have a look at http://www.datastax= .com/dev/blog/bulk-loading
This is a great = tool for me.   I'll try on this tool since it will require much = lower bandwidth cost and disk IO.


3)
In my case, each row is read randomly with the = same chance.   I have around 0.5M rows in total. =   Can you provide some practical advices on optimizing the row = cache and key cache?   I can use up to 8 gig of memory on test = machines.
If your data set small enough to fit in = memory ? . You may also be interested in the row_cache_provider setting = for column families, see the CLI help for create column family and the = IRowCacheProvider interface. You can replace the caching strategy if you = want to.  
The dataset is about 150 = Gig storing as CSV and estimated as 1.3T storing as SSTable.   = Hence I don't think it can fit into memory.    I'll try the = caching strategy a little bit but I think it can improve my case a = little bit.

I'm now looking into some native = compression on SSTable, just patched the CASSANDRA-47 and found there is = a huge performance penalty in my use case, and I haven't figured out the = reason yet.   I suppose CASSANDRA-647 will solve it better, however = I seek there's a number of tickets working at a similar issue, including = CASSANDRA-1608 etc.   Is that because cassandra really cost a huge = disk space?

Well my target is to simply get the = 1.3T compressed to 700 Gig so that I can fit it into a single server, = while keeping the same level of = performance.

Best,
Steve

=

On Aug 16, 2011, at 2:27 PM, aaron morton = wrote:



Hope that = helps. 

 
http://www.thelastpickle.com

On 16/08/2011, at 12:44 PM, Yi Yang wrote:

Dear = all,

I wanna report my use case, and have a discussion with you = guys.

I'm currently working on my second Cassandra project. =   I got into somehow a unique use case: storing traditional, = relational data set into Cassandra datastore, it's a dataset of int and = float numbers, no more strings, no more other data and the column names = are much longer than the value itself.   Besides, row-key is = the md-5 hash ver3 UUID of some other data.

1)
I did some = workaround to make it save some disk space however it still takes = approximately 12-15x more disk space than MySQL.   I looked = into Cassandra SSTable internal, did some optimizing on selecting better = data serializer and also hashed the column name into one byte. =   That made the current database having ~6x overhead on disk = space comparing with MySQL, which I think it might be = acceptable.

I'm currently interested into CASSANDRA-674 and will = also test CASSANDRA-47 in the coming days.   I'll keep you = updated on my testing.   But I'm willing to hear your idea on = saving disk space.

2)
I'm doing batch writes to the database = (pulling data from multiple resources and put them together). =   I wish to know if there's some better methods to improve the = writing efficiency since it's just about the same speed as MySQL, when = writing sequentially.   Seems like the commitlog requires a = huge mount of disk IO comparing with my test machine can = afford.

3)
In my case, each row is read randomly with the same = chance.   I have around 0.5M rows in total.   Can = you provide some practical advices on optimizing the row cache and key = cache?   I can use up to 8 gig of memory on test = machines.

Thanks for your = help.


Best,

Steve




= --Boundary_(ID_Wi9ZndHVjxBz0KDzM/JNeg)--