incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Colin <colpcl...@gmail.com>
Subject Re: Data model for streaming a large table in real time.
Date Sun, 08 Jun 2014 02:51:36 GMT
To have any redundancy in the system, start with at least 3 nodes and a replication factor
of 3.

Try to have at least 8 cores, 32 gig ram, and separate disks for log and data.

Will you be replicating data across data centers?

--
Colin
320-221-9531


> On Jun 7, 2014, at 9:40 PM, Kevin Burton <burton@spinn3r.com> wrote:
> 
> Oh.. To start with we're going to use from 2-10 nodes.. 
> 
> I think we're going to take the original strategy and just to use 100 buckets .. 0-99…
then the timestamp under that..  I think it should be fine and won't require an ordered partitioner.
:)
> 
> Thanks!
> 
> 
>> On Sat, Jun 7, 2014 at 7:38 PM, Colin Clark <colin@clark.ws> wrote:
>> With 100 nodes, that ingestion rate is actually quite low and I don't think you'd
need another column in the partition key.
>> 
>> You seem to be set in your current direction.  Let us know how it works out.
>> 
>> --
>> Colin
>> 320-221-9531
>> 
>> 
>>> On Jun 7, 2014, at 9:18 PM, Kevin Burton <burton@spinn3r.com> wrote:
>>> 
>>> What's 'source' ? You mean like the URL?
>>> 
>>> If source too random it's going to yield too many buckets.  
>>> 
>>> Ingestion rates are fairly high but not insane.  About 4M inserts per hour..
from 5-10GB… 
>>> 
>>> 
>>>> On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark <colin@clark.ws> wrote:
>>>> Not if you add another column to the partition key; source for example. 

>>>> 
>>>> I would really try to stay away from the ordered partitioner if at all possible.
>>>> 
>>>> What ingestion rates are you expecting, in size and speed.
>>>> 
>>>> --
>>>> Colin
>>>> 320-221-9531
>>>> 
>>>> 
>>>>> On Jun 7, 2014, at 9:05 PM, Kevin Burton <burton@spinn3r.com> wrote:
>>>>> 
>>>>> 
>>>>> Thanks for the feedback on this btw.. .it's helpful.  My notes below.
>>>>> 
>>>>>> On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark <colin@clark.ws>
wrote:
>>>>>> No, you're not-the partition key will get distributed across the
cluster if you're using random or murmur.
>>>>> 
>>>>> Yes… I'm aware.  But in practice this is how it will work…
>>>>> 
>>>>> If we create bucket b0, that will get hashed to h0…
>>>>> 
>>>>> So say I have 50 machines performing writes, they are all on the same
time thanks to ntpd, so they all compute b0 for the current bucket based on the time.
>>>>> 
>>>>> That gets hashed to h0…
>>>>> 
>>>>> If h0 is hosted on node0 … then all writes go to node zero for that
1 second interval.
>>>>> 
>>>>> So all my writes are bottlenecking on one node.  That node is *changing*
over time… but they're not being dispatched in parallel over N nodes.  At most writes will
only ever reach 1 node a time.
>>>>> 
>>>>>  
>>>>>> You could also ensure that by adding another column, like source
to ensure distribution. (Add the seconds to the partition key, not the clustering columns)
>>>>>> 
>>>>>> I can almost guarantee that if you put too much thought into working
against what Cassandra offers out of the box, that it will bite you later.
>>>>> 
>>>>> Sure.. I'm trying to avoid the 'bite you later' issues. More so because
I'm sure there are Cassandra gotchas to worry about.  Everything has them.  Just trying to
avoid the land mines :-P
>>>>>  
>>>>>> In fact, the use case that you're describing may best be served by
a queuing mechanism, and using Cassandra only for the underlying store.
>>>>> 
>>>>> Yes… that's what I'm doing.  We're using apollo to fan out the queue,
but the writes go back into cassandra and needs to be read out sequentially.
>>>>>  
>>>>>> 
>>>>>> I used this exact same approach in a use case that involved writing
over a million events/second to a cluster with no problems.  Initially, I thought ordered
partitioner was the way to go too.  And I used separate processes to aggregate, conflate,
and handle distribution to clients.
>>>>> 
>>>>> 
>>>>> Yes. I think using 100 buckets will work for now.  Plus I don't have
to change the partitioner on our existing cluster and I'm lazy :)
>>>>>  
>>>>>> 
>>>>>> Just my two cents, but I also spend the majority of my days helping
people utilize Cassandra correctly, and rescuing those that haven't.
>>>>> 
>>>>> Definitely appreciate the feedback!  Thanks!
>>>>>  
>>>>> -- 
>>>>> Founder/CEO Spinn3r.com
>>>>> Location: San Francisco, CA
>>>>> Skype: burtonator
>>>>> blog: http://burtonator.wordpress.com
>>>>> … or check out my Google+ profile
>>>>> 
>>>>> War is peace. Freedom is slavery. Ignorance is strength. Corporations
are people.
>>> 
>>> 
>>> 
>>> -- 
>>> Founder/CEO Spinn3r.com
>>> Location: San Francisco, CA
>>> Skype: burtonator
>>> blog: http://burtonator.wordpress.com
>>> … or check out my Google+ profile
>>> 
>>> War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
> 
> 
> 
> -- 
> Founder/CEO Spinn3r.com
> Location: San Francisco, CA
> Skype: burtonator
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
> War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.

Mime
View raw message