Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2FB8A885A for ; Tue, 16 Aug 2011 23:53:27 +0000 (UTC) Received: (qmail 74942 invoked by uid 500); 16 Aug 2011 23:53:24 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 74790 invoked by uid 500); 16 Aug 2011 23:53:23 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 74782 invoked by uid 99); 16 Aug 2011 23:53:23 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Aug 2011 23:53:23 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a50.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Aug 2011 23:53:18 +0000 Received: from homiemail-a50.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a50.g.dreamhost.com (Postfix) with ESMTP id 7B9646F8060 for ; Tue, 16 Aug 2011 16:52:57 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; q=dns; s=thelastpickle.com; b=hcwakOM1k4 L0FcQBetpTtB1jNzmX95t0WYjG2H2i55aohLPqSCQY78JqeLZzZi7/oJeq4S8Dw/ sPIFr8IW4x0B4d+0oLgbDn09zTh7Nh4x6OeynUtKDzFy5BJNBx23G1ZxvV1oqEhG RJf9tnF29pmQsHqfV1ydXdHwga3vGY4ZA= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; s=thelastpickle.com; bh=EJ1BgtnKINCQ2HLz iWjbEaxAE4w=; b=0nXD8gfi4J5igsh2KjADQ5S5oIOd19A+g2ppFoprPYpgwj+d vslWoBa2YrdOE5Y4q9zVAglsaZonVxZS2LO1Gdg+ZwKG7CRVbJQ2U8Xj2O/bi/ZS I/nFzExHT2nS64mOupP795qWpVndpM7P7F+2sv3U7ByzKM8aKh20SDPCayo= Received: from [202.126.206.38] (unknown [202.126.206.38]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a50.g.dreamhost.com (Postfix) with ESMTPSA id D807C6F8059 for ; Tue, 16 Aug 2011 16:52:56 -0700 (PDT) From: aaron morton Mime-Version: 1.0 (Apple Message framework v1244.3) Content-Type: multipart/alternative; boundary="Apple-Mail=_3EC2CF8A-F198-4604-9DA1-DFC69A51CF73" Subject: Re: Cassandra for numerical data set Date: Wed, 17 Aug 2011 11:52:53 +1200 In-Reply-To: <1B1B8BB7-4D03-47EA-8197-CE90F6B470F3@me.com> To: user@cassandra.apache.org References: <3AD349C6-69AD-4126-86AD-35C33986C74E@me.com> <8E2EC6B7-1218-4149-909B-680194B94F71@thelastpickle.com> <1B1B8BB7-4D03-47EA-8197-CE90F6B470F3@me.com> Message-Id: X-Mailer: Apple Mail (2.1244.3) --Apple-Mail=_3EC2CF8A-F198-4604-9DA1-DFC69A51CF73 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii > Is that because cassandra really cost a huge disk space? The general design approach is / has been that storage space is cheap = and plentiful.=20 > Well my target is to simply get the 1.3T compressed to 700 Gig so that = I can fit it into a single server, while keeping the same level of = performance. Not sure it's going to be possible to get the same performance from one = machine as you would from several.=20 Cheers =20 ----------------- Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 17/08/2011, at 10:24 AM, Yi Yang wrote: >=20 > Thanks Aaron. >=20 >>> 2) >>> I'm doing batch writes to the database (pulling data from multiple = resources and put them together). I wish to know if there's some = better methods to improve the writing efficiency since it's just about = the same speed as MySQL, when writing sequentially. Seems like the = commitlog requires a huge mount of disk IO comparing with my test = machine can afford. >> Have a look at http://www.datastax.com/dev/blog/bulk-loading > This is a great tool for me. I'll try on this tool since it will = require much lower bandwidth cost and disk IO. >=20 >>=20 >>> 3) >>> In my case, each row is read randomly with the same chance. I have = around 0.5M rows in total. Can you provide some practical advices on = optimizing the row cache and key cache? I can use up to 8 gig of = memory on test machines. >> If your data set small enough to fit in memory ? . You may also be = interested in the row_cache_provider setting for column families, see = the CLI help for create column family and the IRowCacheProvider = interface. You can replace the caching strategy if you want to. =20 > The dataset is about 150 Gig storing as CSV and estimated as 1.3T = storing as SSTable. Hence I don't think it can fit into memory. = I'll try the caching strategy a little bit but I think it can improve my = case a little bit. >=20 > I'm now looking into some native compression on SSTable, just patched = the CASSANDRA-47 and found there is a huge performance penalty in my use = case, and I haven't figured out the reason yet. I suppose = CASSANDRA-647 will solve it better, however I seek there's a number of = tickets working at a similar issue, including CASSANDRA-1608 etc. Is = that because cassandra really cost a huge disk space? >=20 > Well my target is to simply get the 1.3T compressed to 700 Gig so that = I can fit it into a single server, while keeping the same level of = performance. >=20 > Best, > Steve >=20 >=20 > On Aug 16, 2011, at 2:27 PM, aaron morton wrote: >=20 >>>=20 >>=20 >> Hope that helps.=20 >>=20 >> =20 >> ----------------- >> Aaron Morton >> Freelance Cassandra Developer >> @aaronmorton >> http://www.thelastpickle.com >>=20 >> On 16/08/2011, at 12:44 PM, Yi Yang wrote: >>=20 >>> Dear all, >>>=20 >>> I wanna report my use case, and have a discussion with you guys. >>>=20 >>> I'm currently working on my second Cassandra project. I got into = somehow a unique use case: storing traditional, relational data set into = Cassandra datastore, it's a dataset of int and float numbers, no more = strings, no more other data and the column names are much longer than = the value itself. Besides, row-key is the md-5 hash ver3 UUID of some = other data. >>>=20 >>> 1) >>> I did some workaround to make it save some disk space however it = still takes approximately 12-15x more disk space than MySQL. I looked = into Cassandra SSTable internal, did some optimizing on selecting better = data serializer and also hashed the column name into one byte. That = made the current database having ~6x overhead on disk space comparing = with MySQL, which I think it might be acceptable. >>>=20 >>> I'm currently interested into CASSANDRA-674 and will also test = CASSANDRA-47 in the coming days. I'll keep you updated on my testing. = But I'm willing to hear your idea on saving disk space. >>>=20 >>> 2) >>> I'm doing batch writes to the database (pulling data from multiple = resources and put them together). I wish to know if there's some = better methods to improve the writing efficiency since it's just about = the same speed as MySQL, when writing sequentially. Seems like the = commitlog requires a huge mount of disk IO comparing with my test = machine can afford. >>>=20 >>> 3) >>> In my case, each row is read randomly with the same chance. I have = around 0.5M rows in total. Can you provide some practical advices on = optimizing the row cache and key cache? I can use up to 8 gig of = memory on test machines. >>>=20 >>> Thanks for your help. >>>=20 >>>=20 >>> Best, >>>=20 >>> Steve >>>=20 >>>=20 >>=20 >=20 --Apple-Mail=_3EC2CF8A-F198-4604-9DA1-DFC69A51CF73 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii
The general design approach is / has been = that storage space is cheap and = plentiful. 

Well my target = is to simply get the 1.3T compressed to 700 Gig so that I can fit it = into a single server, while keeping the same level of = performance.
Not sure it's = going to be possible to get the same performance from one machine as you = would from = several. 

Cheers
 
http://www.thelastpickle.com

On 17/08/2011, at 10:24 AM, Yi Yang wrote:


Thanks = Aaron.

2)
I'm doing batch writes to the database (pulling = data from multiple resources and put them together).   I wish = to know if there's some better methods to improve the writing efficiency = since it's just about the same speed as MySQL, when writing = sequentially.   Seems like the commitlog requires a huge mount = of disk IO comparing with my test machine can = afford.
Have a look at http://www.datastax= .com/dev/blog/bulk-loading
This is a great = tool for me.   I'll try on this tool since it will require much = lower bandwidth cost and disk IO.


3)
In my case, each row is read randomly with the = same chance.   I have around 0.5M rows in total. =   Can you provide some practical advices on optimizing the row = cache and key cache?   I can use up to 8 gig of memory on test = machines.
If your data set small enough to fit in = memory ? . You may also be interested in the row_cache_provider setting = for column families, see the CLI help for create column family and the = IRowCacheProvider interface. You can replace the caching strategy if you = want to.  
The dataset is about 150 = Gig storing as CSV and estimated as 1.3T storing as SSTable.   = Hence I don't think it can fit into memory.    I'll try the = caching strategy a little bit but I think it can improve my case a = little bit.

I'm now looking into some native = compression on SSTable, just patched the CASSANDRA-47 and found there is = a huge performance penalty in my use case, and I haven't figured out the = reason yet.   I suppose CASSANDRA-647 will solve it better, however = I seek there's a number of tickets working at a similar issue, including = CASSANDRA-1608 etc.   Is that because cassandra really cost a huge = disk space?

Well my target is to simply get the = 1.3T compressed to 700 Gig so that I can fit it into a single server, = while keeping the same level of = performance.

Best,
Steve

=

On Aug 16, 2011, at 2:27 PM, aaron morton = wrote:



Hope that = helps. 

 
http://www.thelastpickle.com

On 16/08/2011, at 12:44 PM, Yi Yang wrote:

Dear = all,

I wanna report my use case, and have a discussion with you = guys.

I'm currently working on my second Cassandra project. =   I got into somehow a unique use case: storing traditional, = relational data set into Cassandra datastore, it's a dataset of int and = float numbers, no more strings, no more other data and the column names = are much longer than the value itself.   Besides, row-key is = the md-5 hash ver3 UUID of some other data.

1)
I did some = workaround to make it save some disk space however it still takes = approximately 12-15x more disk space than MySQL.   I looked = into Cassandra SSTable internal, did some optimizing on selecting better = data serializer and also hashed the column name into one byte. =   That made the current database having ~6x overhead on disk = space comparing with MySQL, which I think it might be = acceptable.

I'm currently interested into CASSANDRA-674 and will = also test CASSANDRA-47 in the coming days.   I'll keep you = updated on my testing.   But I'm willing to hear your idea on = saving disk space.

2)
I'm doing batch writes to the database = (pulling data from multiple resources and put them together). =   I wish to know if there's some better methods to improve the = writing efficiency since it's just about the same speed as MySQL, when = writing sequentially.   Seems like the commitlog requires a = huge mount of disk IO comparing with my test machine can = afford.

3)
In my case, each row is read randomly with the same = chance.   I have around 0.5M rows in total.   Can = you provide some practical advices on optimizing the row cache and key = cache?   I can use up to 8 gig of memory on test = machines.

Thanks for your = help.


Best,

Steve




= --Apple-Mail=_3EC2CF8A-F198-4604-9DA1-DFC69A51CF73--