On Tue, Apr 3, 2012 at 4:18 AM, Ben Coverston <ben.coverston@datastax.com> wrote:
This is a difficult question to answer for a variety of reasons, but I'll give it a try, maybe it will be helpful, maybe not.

The most obvious problem with this is that Thrift is buffer based, not streaming. That means that whatever the size of your chunk it needs to be received, deserialized, and processed by cassandra within a timeframe that we call the rpc_timeout (by default this is 10 seconds).

Thanks.

 I suspect that 'not streaming' is the key, and not just from the Cassandra side - our use case has a subtle assumption of streaming on the client side. We could chop it up in to buckets and put each one in a time ordered column, but that the defeats the purpose of why I was considering Cassandra - to avoid the latency of seeks in HDFS

cheers
 

Bigger buffers mean larger allocations, larger allocations mean that the JVM is working harder, and  is more prone to fragmentation on the heap.

With mixed workloads (lots of high latency, large requests and many very small low latency requests) larger buffers can also, over time, clog up the thread pool in a way that can cause your shorter queries to have to wait for your longer running queries to complete (to free up worker threads) making everything slow. This isn't a problem unique to Cassandra, everything that uses worker queues runs into some variant of this problem.

As with everything else, you'll probably need to test your specific use case to see what 'too big' is for you.

On Mon, Apr 2, 2012 at 9:23 AM, Franc Carter <franc.carter@sirca.org.au> wrote:

Hi,

We are in the early stages of thinking about a project that needs to store data that will be accessed by Hadoop. One of the concerns we have is around the Latency of HDFS as our use case is is not for reading all the data and hence we will need custom RecordReaders etc.

I've seen a couple of comments that you shouldn't put large chunks in to a value - however 'large' is not well defined for the range of people using these solutions ;-)

Doe anyone have a rough rule of thumb for how big a single value can be before we are outside sanity?

thanks

--

Franc Carter | Systems architect | Sirca Ltd

franc.carter@sirca.org.au | www.sirca.org.au

Tel: +61 2 9236 9118

Level 9, 80 Clarence St, Sydney NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215





--
Ben Coverston
DataStax -- The Apache Cassandra Company




--

Franc Carter | Systems architect | Sirca Ltd

franc.carter@sirca.org.au | www.sirca.org.au

Tel: +61 2 9236 9118

Level 9, 80 Clarence St, Sydney NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215