This is a difficult question to answer for a variety of reasons, but I'll give it a try, maybe it will be helpful, maybe not.
The most obvious problem with this is that Thrift is buffer based, not streaming. That means that whatever the size of your chunk it needs to be received, deserialized, and processed by cassandra within a timeframe that we call the rpc_timeout (by default this is 10 seconds).
Bigger buffers mean larger allocations, larger allocations mean that the JVM is working harder, and is more prone to fragmentation on the heap.
With mixed workloads (lots of high latency, large requests and many very small low latency requests) larger buffers can also, over time, clog up the thread pool in a way that can cause your shorter queries to have to wait for your longer running queries to complete (to free up worker threads) making everything slow. This isn't a problem unique to Cassandra, everything that uses worker queues runs into some variant of this problem.
As with everything else, you'll probably need to test your specific use case to see what 'too big' is for you.
On Mon, Apr 2, 2012 at 9:23 AM, Franc Carter <email@example.com>
We are in the early stages of thinking about a project that needs to store data that will be accessed by Hadoop. One of the concerns we have is around the Latency of HDFS as our use case is is not for reading all the data and hence we will need custom RecordReaders etc.
I've seen a couple of comments that you shouldn't put large chunks in to a value - however 'large' is not well defined for the range of people using these solutions ;-)
Doe anyone have a rough rule of thumb for how big a single value can be before we are outside sanity?
DataStax -- The Apache Cassandra Company