cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Goutham reddy <goutham.chiru...@gmail.com>
Subject Re: Partition key with 300K rows can it be queried and distributed using Spark
Date Fri, 18 Jan 2019 04:24:44 GMT
Thanks Jeff, yes we have 18 columns in total. But my question was does
spark can retrieve data by partitioning 300k data into spark nodes?

On Thu, Jan 17, 2019 at 1:30 PM Jeff Jirsa <jjirsa@gmail.com> wrote:

> The reason big rows are painful in Cassandra is that by default, we index
> it every 64kb. With 300k objects, it may or may not have a lot of those
> little index blocks/objects. How big is each row?
>
> If you try to read it and it's very wide, you may see heap pressure / GC.
> If so, you could try changing the column index size from 64k to something
> larger (128k, 256k, etc) - small point reads will be more disk IO, but less
> heap pressure.
>
>
>
> On Thu, Jan 17, 2019 at 12:15 PM Goutham reddy <goutham.chirutha@gmail.com>
> wrote:
>
>> Hi,
>> As each partition key can hold up to 2 Billion rows, even then it is an
>> anti-pattern to have such huge data set for one partition key in our case
>> it is 300k rows only, but when trying to query for one particular key we
>> are getting timeout exception. If I use Spark to get the 300k rows for a
>> particular key does it solve the problem of timeouts and distribute the
>> data across the spark nodes or will it still throw timeout exceptions. Can
>> you please help me with the best practice to retrieve the data for the key
>> with 300k rows. Any help is highly appreciated.
>>
>> Regards
>> Goutham.
>>
> --
Regards
Goutham Reddy

Mime
View raw message