Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AD3429D18 for ; Mon, 2 Apr 2012 18:19:22 +0000 (UTC) Received: (qmail 1975 invoked by uid 500); 2 Apr 2012 18:19:20 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 1949 invoked by uid 500); 2 Apr 2012 18:19:20 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 1940 invoked by uid 99); 2 Apr 2012 18:19:20 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Apr 2012 18:19:20 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ben.coverston@datastax.com designates 74.125.82.44 as permitted sender) Received: from [74.125.82.44] (HELO mail-wg0-f44.google.com) (74.125.82.44) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Apr 2012 18:19:14 +0000 Received: by wgbdr13 with SMTP id dr13so2411011wgb.25 for ; Mon, 02 Apr 2012 11:18:53 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:x-gm-message-state; bh=OVuslp/ZWC4GvlWrrXdI7oC96nhvrSOnj7/Xs6e79Iw=; b=eGIF0tBo3cftowr82fBL1MEoFH26mnhhv1Nmy/WZ64dQTbIQy4PgBQA7SmO24YAAw6 wVcRUSpDYhD5zapIVzFicFbEcuh4jcNX9VW1dgucz/0tVjFYfTdzMw998jf86u1hYFnw 6dtbaOxVV6s+hX94dn/ANUgb7s5lTKN0M+w2Wt2BpsBi2GF/DuqsnGP8xHykO6rIGGlE o7fAipVl8wGdjf5Av/JmS9Y65J1C1tGvQR9Dv3qGUxohQPXq21AMt1pPkCIN6I9KmWhB 4GFsbXbfKsARlZnpNTJ6rWukM7RUrWcwaLT5qIgbFJoonjAyMzdKFzEzzQLgLTyRXPkw vRuQ== Received: by 10.180.105.69 with SMTP id gk5mr32978574wib.3.1333390733147; Mon, 02 Apr 2012 11:18:53 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.185.12 with HTTP; Mon, 2 Apr 2012 11:18:32 -0700 (PDT) In-Reply-To: References: From: Ben Coverston Date: Mon, 2 Apr 2012 18:18:32 +0000 Message-ID: Subject: Re: Largest 'sensible' value To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=f46d04426f14fefe7904bcb63822 X-Gm-Message-State: ALoCoQlp57JgzdrKDQNuI2AZbf5c/0V8wq9vX9DN8TyR72JHOHBE1K7Y1ntrSkGrcE0QvrqQbm5b X-Virus-Checked: Checked by ClamAV on apache.org --f46d04426f14fefe7904bcb63822 Content-Type: text/plain; charset=ISO-8859-1 This is a difficult question to answer for a variety of reasons, but I'll give it a try, maybe it will be helpful, maybe not. The most obvious problem with this is that Thrift is buffer based, not streaming. That means that whatever the size of your chunk it needs to be received, deserialized, and processed by cassandra within a timeframe that we call the rpc_timeout (by default this is 10 seconds). Bigger buffers mean larger allocations, larger allocations mean that the JVM is working harder, and is more prone to fragmentation on the heap. With mixed workloads (lots of high latency, large requests and many very small low latency requests) larger buffers can also, over time, clog up the thread pool in a way that can cause your shorter queries to have to wait for your longer running queries to complete (to free up worker threads) making everything slow. This isn't a problem unique to Cassandra, everything that uses worker queues runs into some variant of this problem. As with everything else, you'll probably need to test your specific use case to see what 'too big' is for you. On Mon, Apr 2, 2012 at 9:23 AM, Franc Carter wrote: > > Hi, > > We are in the early stages of thinking about a project that needs to store > data that will be accessed by Hadoop. One of the concerns we have is around > the Latency of HDFS as our use case is is not for reading all the data and > hence we will need custom RecordReaders etc. > > I've seen a couple of comments that you shouldn't put large chunks in to a > value - however 'large' is not well defined for the range of people using > these solutions ;-) > > Doe anyone have a rough rule of thumb for how big a single value can be > before we are outside sanity? > > thanks > > -- > > *Franc Carter* | Systems architect | Sirca Ltd > > > franc.carter@sirca.org.au | www.sirca.org.au > > Tel: +61 2 9236 9118 > > Level 9, 80 Clarence St, Sydney NSW 2000 > > PO Box H58, Australia Square, Sydney NSW 1215 > > -- Ben Coverston DataStax -- The Apache Cassandra Company --f46d04426f14fefe7904bcb63822 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable This is a difficult question to answer for a variety of reasons, but I'= ll give it a try, maybe it will be helpful, maybe not.

T= he most obvious problem with this is that Thrift is buffer based, not strea= ming. That means that whatever the size of your chunk it needs to be=A0rece= ived, deserialized, and processed by cassandra within a timeframe that we c= all the rpc_timeout (by default this is 10 seconds).

Bigger buffers mean larger allocations, larger allocati= ons mean that the JVM is working harder, and =A0is more prone to fragmentat= ion on the heap.

With mixed workloads (lots of hig= h latency, large requests and many very small low latency requests) larger = buffers can also, over time, clog up the thread pool in a way that can caus= e your shorter queries to have to wait for your longer running queries to c= omplete (to free up worker threads) making everything slow. This isn't = a problem unique to Cassandra, everything that uses worker queues runs into= some variant of this problem.

As with everything else, you'll probably need to te= st your specific use case to see what 'too big' is for you.

On Mon, Apr 2, 2012 at 9:23 AM, Fran= c Carter <franc.carter@sirca.org.au> wrote:

Hi,

We are= in the early stages of thinking about a project that needs to store data t= hat will be accessed by Hadoop. One of the concerns we have is around the L= atency of HDFS as our use case is is not for reading all the data and hence= we will need custom RecordReaders etc.

I've seen a couple of comments that you shouldn'= ;t put large chunks in to a value - however 'large' is not well def= ined for the range of people using these solutions ;-)

Doe anyone have a rough rule of thumb for how big a single value can b= e before we are outside sanity?

thanks

--

Franc Carter<= /b> |<= /span> Systems architect | Sirca Ltd

franc.carter@sirca.org.au=A0|=A0www.sirca.org.au

Tel:= =A0+61 2 9236 9118

Level 9, 80 Clarence St, Sydney=A0NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215<= /span>





--
Ben Coversto= n
DataStax -- The Apache Cassandra Company

--f46d04426f14fefe7904bcb63822--