Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of acockcroft@netflix.com
 designates 208.75.77.162 as permitted sender)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=s1024;d=netflix.com;
  h=from:to:cc:date:subject:message-id:references:in-reply-to
    :content-type:mime-version;
  b=OghtmjrKwtA7YHyhUgDYhXzgGsP6/dfyVM6aeyF9blPhghGiOxCdBkh+CBHnMPVqemvNLnNN
    IDb45Lg9VL93Hy9SbGGK0xQQVJYnLo35gJ4JlL51ClPVLLjpv6sg74gAGqqLr5mo1LjgI3NG
    4jxiuLNvwhZOu1rYlpzZHuVOrmY=
From: Adrian Cockcroft <acockcroft@netflix.com>
To: Patrick Julien <pjulien@gmail.com>
CC: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Thu, 14 Apr 2011 13:47:26 -0700
Subject: Re: Pyramid Organization of Data
Thread-Topic: Pyramid Organization of Data
Thread-Index: Acv65SshSq/pPhJcRcCuv13+JUkmMA==
Message-ID: <A0851059-4D39-4A37-817F-4B67AB5A6C3B@netflix.com>
References: <BANLkTimJw3vB4UiaGUxMWB0R--NhAFetKQ@mail.gmail.com>
 <C9CC5D93.21328%acockcroft@netflix.com>
 <BANLkTi=fAt0QBtV3ZC5JNxCxHE0mum7m-A@mail.gmail.com>
In-Reply-To: <BANLkTi=fAt0QBtV3ZC5JNxCxHE0mum7m-A@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

What you are asking for breaks the eventual consistency model, so you need =
to create a separate cluster in NYC that collects the same updates but has =
a much longer setting to timeout the data for deletion, or doesn't get the =
deletes.=20

One way is to have a trigger on writes on your pyramid nodes in NY that cop=
ies data over to the long term analysis cluster. The two clusters won't be =
eventually consistent in the presence of failures, but with RF=3D3 you will=
 get up to three triggers for each write, so you get three chances to get t=
he copy done.=20

Adrian

On Apr 14, 2011, at 10:18 AM, "Patrick Julien" <pjulien@gmail.com> wrote:

> Thanks for your input Adrian, we've pretty much settled on this too.
> What I'm trying to figure out is how we do deletes.
>=20
> We want to do deletes in the satellites because:
>=20
> a) we'll run out of disk space very quickly with the amount of data we ha=
ve
> b) we don't need more than 3 days worth of history in the satellites,
> we're currently planning for 7 days of capacity
>=20
> However, the deletes will get replicated back to NY.  In NY, we don't
> want that, we want to run hadoop/pig over all that data dating back to
> several months/years.  Even if we set the replication factor of the
> satellites to 1 and NY to 3, we'll run out of space very quickly in
> the satellites.
>=20
>=20
> On Thu, Apr 14, 2011 at 11:23 AM, Adrian Cockcroft
> <acockcroft@netflix.com> wrote:
>> We have similar requirements for wide area backup/archive at Netflix.
>> I think what you want is a replica with RF of at least 3 in NY for all t=
he
>> satellites, then each satellite could have a lower RF, but if you want s=
afe
>> local quorum I would use 3 everywhere.
>> Then NY is the sum of all the satellites, so that makes most use of the =
disk
>> space.
>> For archival storage I suggest you use snapshots in NY and save compress=
ed
>> tar files of each keyspace in NY. We've been working on this to allow fu=
ll
>> and incremental backup and restore from our EC2 hosted Cassandra cluster=
s
>> to/from S3. Full backup/restore works fine, incremental and per-keyspace
>> restore is being worked on.
>> Adrian
>> From: Patrick Julien <pjulien@gmail.com>
>> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>> Date: Thu, 14 Apr 2011 05:38:54 -0700
>> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>> Subject: Re: Pyramid Organization of Data
>>=20
>> Thanks,  I'm still working the problem so anything I find out I will pos=
t
>> here.
>>=20
>> Yes, you're right, that is the question I am asking.
>>=20
>> No, adding more storage is not a solution since new york would have seve=
ral
>> hundred times more storage.
>>=20
>> On Apr 14, 2011 6:38 AM, "aaron morton" <aaron@thelastpickle.com> wrote:
>>> I think your question is "NY is the archive, after a certain amount of
>>> time we want to delete the row from the original DC but keep it in the
>>> archive in NY."
>>>=20
>>> Once you delete a row, it's deleted as far as the client is concerned.
>>> GCGaceSeconds is only concerned with when the tombstone marker can be
>>> removed. If NY has a replica of a row from Tokyo and the row is deleted=
 in
>>> either DC, it will be deleted in the other DC as well.
>>>=20
>>> Some thoughts...
>>> 1) Add more storage in the satellite DC's, then tilt you chair to
>>> celebrate a job well done :)
>>> 2) Run two clusters as you say.
>>> 3) Just thinking out loud, and I know this does not work now. Would it =
be
>>> possible to support per CF strategy options, so an archive CF only
>>> replicates to NY ? Can think of possible problems with repair and
>>> LOCAL_QUORUM, out of interest what else would it break?
>>>=20
>>> Hope that helps.
>>> Aaron
>>>=20
>>>=20
>>>=20
>>> On 14 Apr 2011, at 10:17, Patrick Julien wrote:
>>>=20
>>>> We have been successful in implementing, at scale, the comments you
>>>> posted here. I'm wondering what we can do about deleting data
>>>> however.
>>>>=20
>>>> The way I see it, we have considerably more storage capacity in NY,
>>>> but not in the other sites. Using this technique here, it occurs to
>>>> me that we would replicate non-NY deleted rows back to NY. Is there a
>>>> way to tell NY not to tombstone rows?
>>>>=20
>>>> The ideas I have so far:
>>>>=20
>>>> - Set GCGracePeriod to be much higher in NY than in the other sites.
>>>> This way we can get to tombstone'd rows well beyond their disk life in
>>>> other sites.
>>>> - A variant on this solution is to set the TTL on rows in non NY sites
>>>> and again, set the GCGracePeriod to be considerably higher in NY
>>>> - break this up to multiple clusters and do one write from the client
>>>> to the its 'local' cluster and one write to the NY cluster.
>>>>=20
>>>>=20
>>>>=20
>>>> On Fri, Apr 8, 2011 at 7:15 PM, Jonathan Ellis <jbellis@gmail.com> wro=
te:
>>>>> No, I'm suggesting you have a Tokyo keyspace that gets replicated as
>>>>> {Tokyo: 2, NYC:1}, a London keyspace that gets replicated to {London:
>>>>> 2, NYC: 1}, for example.
>>>>>=20
>>>>> On Fri, Apr 8, 2011 at 5:59 PM, Patrick Julien <pjulien@gmail.com>
>>>>> wrote:
>>>>>> I'm familiar with this material. I hadn't thought of it from this
>>>>>> angle but I believe what you're suggesting is that the different dat=
a
>>>>>> centers would hold a different properties file for node discovery
>>>>>> instead of using auto-discovery.
>>>>>>=20
>>>>>> So Tokyo, and others, would have a configuration that make it
>>>>>> oblivious to the non New York data centers.
>>>>>> New York would have a configuration that would give it knowledge of =
no
>>>>>> other data center.
>>>>>>=20
>>>>>> Would that work? Wouldn't the NY data center wonder where these othe=
r
>>>>>> writes are coming from?
>>>>>>=20
>>>>>> On Fri, Apr 8, 2011 at 6:38 PM, Jonathan Ellis <jbellis@gmail.com>
>>>>>> wrote:
>>>>>>> On Fri, Apr 8, 2011 at 12:17 PM, Patrick Julien <pjulien@gmail.com>
>>>>>>> wrote:
>>>>>>>> The problem is this: we would like the historical data from Tokyo =
to
>>>>>>>> stay in Tokyo and only be replicated to New York. The one in Londo=
n
>>>>>>>> to be in London and only be replicated to New York and so on for a=
ll
>>>>>>>> data centers.
>>>>>>>>=20
>>>>>>>> Is this currently possible with Cassandra? I believe we would need=
 to
>>>>>>>> run multiple clusters and migrate data manually from data centers =
to
>>>>>>>> North America to achieve this. Also, any suggestions would also be
>>>>>>>> welcomed.
>>>>>>>=20
>>>>>>> NetworkTopologyStrategy allows configuration replicas per-keyspace,
>>>>>>> per-datacenter:
>>>>>>>=20
>>>>>>> http://www.datastax.com/dev/blog/deploying-cassandra-across-multipl=
e-data-centers
>>>>>>>=20
>>>>>>> --
>>>>>>> Jonathan Ellis
>>>>>>> Project Chair, Apache Cassandra
>>>>>>> co-founder of DataStax, the source for professional Cassandra suppo=
rt
>>>>>>> http://www.datastax.com
>>>>>>>=20
>>>>>>=20
>>>>>=20
>>>>>=20
>>>>>=20
>>>>> --
>>>>> Jonathan Ellis
>>>>> Project Chair, Apache Cassandra
>>>>> co-founder of DataStax, the source for professional Cassandra support
>>>>> http://www.datastax.com
>>>>>=20
>>>=20
>>=20
>=20