Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of john@pagerduty.com designates
 209.85.215.172 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAMZt1DcEDbguengV99oPpGgpPWccAk0wXKT_NSUDUdT-CvsAZQ@mail.gmail.com>
References: 
 <CAGEfnJOepMgP0OrB-CD-eh5z_FmXEW0zd0DPvf3LAvWh6H5p4w@mail.gmail.com>
 <CAPM=6wWwLEoZaEEStovtK2GQLoospRBuxT3cu862HyBWr+geNw@mail.gmail.com>
 <CAGEfnJPCnY1CqDhWaKiimHLp40ZZ4iMfOcBcmJ32N1M+2HdwDg@mail.gmail.com>
 <4EE40CFE.4050103@gmail.com>
 <CAMZt1DcEDbguengV99oPpGgpPWccAk0wXKT_NSUDUdT-CvsAZQ@mail.gmail.com>
From: John Laban <john@pagerduty.com>
Date: Mon, 12 Dec 2011 11:35:35 -0800
Message-ID: 
 <CAGEfnJNNkBV=u7xnaBFThxbY13jG98LC-qU=F38oWk8E4mh5_A@mail.gmail.com>
Subject: Re: best practices for simulating transactions in Cassandra
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=e0cb4e43d17b570b1504b3ea3ed5

--e0cb4e43d17b570b1504b3ea3ed5
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Dominic,

So I responded to your algorithm in another part of this thread (very
interesting) but this part of the paper caught my attention:

> When client application code releases a lock, that lock must not actually
be
> released for a period equal to one millisecond plus twice the maximum
possible
> drift of the clocks in the client computers accessing the Cassandra
databases

I've been worried about this, and added some arbitrary delay in the
releasing of my locks.  But I don't like it as it's (A) an arbitrary value
and (B) it will - perhaps greatly - reduce the throughput of the more
high-contention areas of my system.

To fix (B) I'll probably just have to try to get rid of locks all together
in these high-contention areas.

To fix (A), I'd need to know what the maximum possible drift of my clocks
will be.  How did you determine this?  What value do you use, out of
curiosity?  What does the network layout of your client machines look like?
 (Are any of your hosts geographically separated or all running in the same
DC?  What's the maximum latency between hosts?  etc?)  Do you monitor the
clock skew on an ongoing basis?  Am I worrying too much?

Sorry for all the questions but I'm very concerned about this particular
problem :)

Thanks,
John


On Mon, Dec 12, 2011 at 4:36 AM, Dominic Williams <
dwilliams@fightmymonster.com> wrote:

> Hi guys, just thought I'd chip in...
>
> Fight My Monster is still using Cages, which is working fine, but...
>
> I'm looking at using Cassandra to replace Cages/ZooKeeper(!) There are 2
> main reasons:-
>
> 1. Although a fast ZooKeeper cluster can handle a lot of load (we aren't
> getting anywhere near to capacity and we do a *lot* of serialisation) at
> some point it will be necessary to start hashing lock paths onto separate
> ZooKeeper clusters, and I tend to believe that these days you should choo=
se
> platforms that handle sharding themselves (e.g. choose Cassandra rather
> than MySQL)
>
> 2. Why have more components in your system when you can have less!!! KISS
>
> Recently I therefore tried to devise an algorithm which can be used to ad=
d
> a distributed locking layer to clients such as Pelops, Hector, Pycassa et=
c.
>
> There is a doc describing the algorithm, to which may be added an appendi=
x
> describing a protocol so that locking can be interoperable between the
> clients. That could be extended to describe a protocol for transactions.
> Word of warning this is a *beta* algorithm that has only been seen by a
> select group so far, and therefore not even 100% sure it works but there =
is
> a useful general discussion regarding serialization of reads/writes so I
> include it anyway (and since this algorithm is going to be out there now,
> if there's anyone out there who fancies doing a Z proof or disproof, that
> would be fantastic).
> http://media.fightmymonster.com/Shared/docs/Wait%20Chain%20Algorithm.pdf
>
> Final word on this re transactions: if/when transactions are added to
> locking system in Pelops/Hector/Pycassa, Cassandra will provide better
> performance than ZooKeeper for storing snapshots, especially as transacti=
on
> size increases
>
> Best, Dominic
>
> On 11 December 2011 01:53, Guy Incognito <dnd1066@gmail.com> wrote:
>
>>  you could try writing with the clock of the initial replay entry?
>>
>> On 06/12/2011 20:26, John Laban wrote:
>>
>> Ah, neat.  It is similar to what was proposed in (4) above with adding
>> transactions to Cages, but instead of snapshotting the data to be rolled
>> back (the "before" data), you snapshot the data to be replayed (the "aft=
er"
>> data).  And then later, if you find that the transaction didn't complete=
,
>> you just keep replaying the transaction until it takes.
>>
>>  The part I don't understand with this approach though:  how do you
>> ensure that someone else didn't change the data between your initial fai=
led
>> transaction and the later replaying of the transaction?  You could get l=
ost
>> writes in that situation.
>>
>>  Dominic (in the Cages blog post) explained a workaround with that for
>> his rollback proposal:  all subsequent readers or writers of that data
>> would have to check for abandoned transactions and roll them back
>> themselves before they could read the data.  I don't think this is possi=
ble
>> with the XACT_LOG "replay" approach in these slides though, based on how
>> the data is indexed (cassandra node token + timeUUID).
>>
>>
>>  PS:  How are you liking Cages?
>>
>>
>>
>>
>> 2011/12/6 J=E9r=E9my SEVELLEC <jsevellec@gmail.com>
>>
>>> Hi John,
>>>
>>>  I had exactly the same reflexions.
>>>
>>>  I'm using zookeeper and cage to lock et isolate.
>>>
>>>  but how to rollback?
>>> It's impossible so try replay!
>>>
>>>  the idea is explained in this presentation
>>> http://www.slideshare.net/mattdennis/cassandra-data-modeling (starting
>>> from slide 24)
>>>
>>>  - insert your whole data into one column
>>> - make the job
>>> - remove (or expire) your column.
>>>
>>>  if there is a problem during "making the job", you keep the
>>> possibility to replay and replay and replay (synchronously or in a batc=
h).
>>>
>>>  Regards
>>>
>>>  J=E9r=E9my
>>>
>>>
>>> 2011/12/5 John Laban <john@pagerduty.com>
>>>
>>>> Hello,
>>>>
>>>>  I'm building a system using Cassandra as a datastore and I have a few
>>>> places where I am need of transactions.
>>>>
>>>>  I'm using ZooKeeper to provide locking when I'm in need of some
>>>> concurrency control or isolation, so that solves that half of the puzz=
le.
>>>>
>>>>  What I need now is to sometimes be able to get atomicity across
>>>> multiple writes by simulating the "begin/rollback/commit" abilities of=
 a
>>>> relational DB.  In other words, there are places where I need to perfo=
rm
>>>> multiple updates/inserts, and if I fail partway through, I would ideal=
ly be
>>>> able to rollback the partially-applied updates.
>>>>
>>>>  Now, I *know* this isn't possible with Cassandra.  What I'm looking
>>>> for are all the best practices, or at least tips and tricks, so that I=
 can
>>>> get around this limitation in Cassandra and still maintain a consisten=
t
>>>> datastore.  (I am using quorum reads/writes so that eventual consisten=
cy
>>>> doesn't kick my ass here as well.)
>>>>
>>>>  Below are some ideas I've been able to dig up.  Please let me know if
>>>> any of them don't make sense, or if there are better approaches:
>>>>
>>>>
>>>>  1) Updates to a row in a column family are atomic.  So try to model
>>>> your data so that you would only ever need to update a single row in a
>>>> single CF at once.  Essentially, you model your data around transactio=
ns.
>>>>  This is tricky but can certainly be done in some situations.
>>>>
>>>>  2) If you are only dealing with multiple row *inserts* (and not
>>>> updates), have one of the rows act as a 'commit' by essentially valida=
ting
>>>> the presence of the other rows.  For example, say you were performing =
an
>>>> operation where you wanted to create an Account row and 5 User rows al=
l at
>>>> once (this is an unlikely example, but bear with me).  You could inser=
t 5
>>>> rows into the Users CF, and then the 1 row into the Accounts CF, which=
 acts
>>>> as the commit.  If something went wrong before the Account could be
>>>> created, any Users that had been created so far would be orphaned and
>>>> unusable, as your business logic can ensure that they can't exist with=
out
>>>> an Account.  You could also have an offline cleanup process that swept=
 away
>>>> orphans.
>>>>
>>>>  3) Try to model your updates as idempotent column inserts instead.
>>>>  How do you model updates as inserts?  Instead of munging the value
>>>> directly, you could insert a column containing the operation you want =
to
>>>> perform (like "+5").  It would work kind of like the Consistent Vote
>>>> Counting implementation: ( https://gist.github.com/416666 ).  How do
>>>> you make the inserts idempotent?  Make sure the column names correspon=
d to
>>>> a request ID or some other identifier that would be identical across
>>>> re-drives of a given (perhaps originally failed) request.  This could =
leave
>>>> your datastore in a temporarily inconsistent state, but would eventual=
ly
>>>> become consistent after a successful re-drive of the original request.
>>>>
>>>>  4) You could take an approach like Dominic Williams proposed with
>>>> Cages:
>>>> http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-c=
assandra-using-cages/   The gist is that you snapshot all the original valu=
es that you're about
>>>> to munge somewhere else (in his case, ZooKeeper), make your updates, a=
nd
>>>> then delete the snapshot (and that delete needs to be atomic).  If the
>>>> snapshot data was never deleted, then subsequent accessors (even reade=
rs)
>>>> of the data rows need to do the rollback of the previous transaction
>>>> themselves before they can read/write this data.  They do the rollback=
 by
>>>> just overwriting the current values with what is in the snapshot.  It
>>>> offloads the work of the rollback to the next worker that accesses the
>>>> data.  This approach probably needs an generic/high-level programming =
layer
>>>> to handle all of the details and complexity, and it doesn't seem like =
it
>>>> was ever added to Cages.
>>>>
>>>>
>>>>  Are there other approaches or best practices that I missed?  I would
>>>> be very interested in hearing any opinions from those who have tackled
>>>> these problems before.
>>>>
>>>>  Thanks!
>>>>  John
>>>>
>>>>
>>>>
>>>
>>>
>>>   --
>>> J=E9r=E9my
>>>
>>
>>
>>
>

--e0cb4e43d17b570b1504b3ea3ed5
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div><div>Hi Dominic,<br class=3D"Apple-interchange-newline"><br class=3D"A=
pple-interchange-newline">So I responded to your algorithm in another part =
of this thread (very interesting) but this part of the paper caught my atte=
ntion:=A0<br class=3D"Apple-interchange-newline">

<br></div><div><div>&gt; When client application code releases a lock, that=
 lock must not actually be=A0</div><div>&gt; released for a period equal to=
 one millisecond plus twice the maximum possible=A0</div><div>&gt; drift of=
 the clocks in the client computers accessing the Cassandra databases</div>

</div><div><br></div><div>I&#39;ve been worried about this, and added some =
arbitrary delay in the releasing of my locks. =A0But I don&#39;t like it as=
 it&#39;s (A) an arbitrary value and (B) it will - perhaps greatly - reduce=
 the throughput of the more high-contention areas of my system.</div>

</div><div><br></div><div>To fix (B) I&#39;ll probably just have to try to =
get rid of locks all together in these high-contention areas.</div><div><br=
></div><div>To fix (A), I&#39;d need to know what the maximum possible drif=
t of my clocks will be. =A0How did you determine this? =A0What value do you=
 use, out of curiosity? =A0What does the network layout of your client mach=
ines look like? =A0(Are any of your hosts geographically separated or all r=
unning in the same DC? =A0What&#39;s the maximum latency between hosts? =A0=
etc?) =A0Do you monitor the clock skew on an ongoing basis? =A0Am I worryin=
g too much?</div>

<div><br></div><div>Sorry for all the questions but I&#39;m very concerned =
about this particular problem :)</div><div><br></div><div>Thanks,</div><div=
>John</div><div><br></div><br><div class=3D"gmail_quote">On Mon, Dec 12, 20=
11 at 4:36 AM, Dominic Williams <span dir=3D"ltr">&lt;<a href=3D"mailto:dwi=
lliams@fightmymonster.com">dwilliams@fightmymonster.com</a>&gt;</span> wrot=
e:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi guys, just thought I&#39;d chip in...<div=
><br></div><div>Fight My Monster is still using Cages, which is working fin=
e, but...=A0</div>

<div><br></div><div>I&#39;m looking at using Cassandra to replace Cages/Zoo=
Keeper(!) There are 2 main reasons:-</div>

<div><br></div><div>1. Although a fast ZooKeeper cluster can handle a lot o=
f load (we aren&#39;t getting anywhere near to capacity and we do a *lot* o=
f=A0serialisation) at some point it will be necessary to start hashing lock=
 paths onto separate ZooKeeper clusters, and I tend to believe that these d=
ays you should choose platforms that handle sharding themselves (e.g. choos=
e Cassandra rather than MySQL)</div>


<div><br></div><div>2. Why have more components in your system when you can=
 have less!!! KISS</div><div><br></div><div>Recently I therefore tried to d=
evise an algorithm which can be used to add a distributed locking layer to =
clients such as Pelops, Hector, Pycassa etc.</div>


<div><br></div><div>There is a doc describing the algorithm, to which may b=
e added an appendix describing a protocol so that locking can be interopera=
ble between the clients. That could be extended to describe a protocol for =
transactions. Word of warning this is a *beta* algorithm that has only been=
 seen by a select group so far, and therefore not even 100% sure it works b=
ut there is a useful general discussion regarding serialization of reads/wr=
ites so I include it anyway=A0(and since this algorithm is going to be out =
there now, if there&#39;s anyone out there who fancies doing a Z proof or d=
isproof, that would be fantastic).</div>


<div><a href=3D"http://media.fightmymonster.com/Shared/docs/Wait%20Chain%20=
Algorithm.pdf" target=3D"_blank">http://media.fightmymonster.com/Shared/doc=
s/Wait%20Chain%20Algorithm.pdf</a></div><div><br></div><div>Final word on t=
his re transactions: if/when transactions are added to locking system in Pe=
lops/Hector/Pycassa, Cassandra will provide better performance than ZooKeep=
er for storing snapshots, especially as transaction size increases</div>


<div><br></div><div>Best, Dominic</div><div><div></div><div class=3D"h5"><d=
iv><br><div class=3D"gmail_quote">On 11 December 2011 01:53, Guy Incognito =
<span dir=3D"ltr">&lt;<a href=3D"mailto:dnd1066@gmail.com" target=3D"_blank=
">dnd1066@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">


 =20
   =20
 =20
  <div bgcolor=3D"#FFFFFF" text=3D"#000000">
    you could try writing with the clock of the initial replay entry?<br>
    <br>
    On 06/12/2011 20:26, John Laban wrote:
    <blockquote type=3D"cite">Ah, neat. =A0It is similar to what was propos=
ed in (4)
      above with adding transactions to Cages, but instead of
      snapshotting the data to be rolled back (the &quot;before&quot; data)=
, you
      snapshot the data to be replayed (the &quot;after&quot; data). =A0And=
 then
      later, if you find that the transaction didn&#39;t complete, you just
      keep replaying the transaction until it takes.
      <div>
        <br>
      </div>
      <div>The part I don&#39;t understand with this approach though: =A0ho=
w
        do you ensure that someone else didn&#39;t change the data between
        your initial failed transaction and the later replaying of the
        transaction? =A0You could get lost writes in that situation.</div>
      <div><br>
      </div>
      <div>Dominic (in the Cages blog post) explained a workaround with
        that for his rollback proposal: =A0all subsequent readers or
        writers of that data would have to check for abandoned
        transactions and roll them back themselves before they could
        read the data. =A0I don&#39;t think this is possible with the XACT_=
LOG
        &quot;replay&quot; approach in these slides though, based on how th=
e data
        is indexed (cassandra node token + timeUUID).</div>
      <div><br>
      </div>
      <div><br>
      </div>
      <div>PS: =A0How are you liking Cages?<br>
        <div><br>
        </div>
        <div><br>
        </div>
        <div><br>
          <br>
          <div class=3D"gmail_quote">2011/12/6 J=E9r=E9my SEVELLEC <span di=
r=3D"ltr">&lt;<a href=3D"mailto:jsevellec@gmail.com" target=3D"_blank">jsev=
ellec@gmail.com</a>&gt;</span><br>
            <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bo=
rder-left:1px #ccc solid;padding-left:1ex">
              Hi John,
              <div><br>
              </div>
              <div>I had exactly the same reflexions.</div>
              <div><br>
              </div>
              <div>I&#39;m using zookeeper and cage to lock et isolate.</di=
v>
              <div><br>
              </div>
              <div>but how to rollback?=A0</div>
              <div>It&#39;s impossible so try replay!</div>
              <div><br>
              </div>
              <div>the idea is explained in this presentation=A0<a href=3D"=
http://www.slideshare.net/mattdennis/cassandra-data-modeling" target=3D"_bl=
ank">http://www.slideshare.net/mattdennis/cassandra-data-modeling</a>=A0(st=
arting
                from slide 24)</div>
              <div><br>
              </div>
              <div>- insert your whole data into one column</div>
              <div>- make the job</div>
              <div>- remove (or expire) your column.</div>
              <div><br>
              </div>
              <div>if there is a problem during &quot;making the job&quot;,=
 you
                keep the possibility to replay and replay and replay
                (synchronously or in a batch).</div>
              <div><br>
              </div>
              <div>Regards</div>
              <div><br>
              </div>
              <div>J=E9r=E9my</div>
              <div>
                <div>
                  <div><br>
                    <br>
                    <div class=3D"gmail_quote">2011/12/5 John Laban <span d=
ir=3D"ltr">&lt;<a href=3D"mailto:john@pagerduty.com" target=3D"_blank">john=
@pagerduty.com</a>&gt;</span><br>
                      <blockquote class=3D"gmail_quote" style=3D"margin:0 0=
 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello,
                        <div><br>
                        </div>
                        <div>I&#39;m building a system using Cassandra as a
                          datastore and I have a few places where I am
                          need of transactions.</div>
                        <div><br>
                        </div>
                        <div>I&#39;m using ZooKeeper to provide locking whe=
n
                          I&#39;m in need of some concurrency control or
                          isolation, so that solves that half of the
                          puzzle.</div>
                        <div><br>
                        </div>
                        <div>What I need now is to sometimes be able to
                          get atomicity across multiple writes by
                          simulating the &quot;begin/rollback/commit&quot;
                          abilities of a relational DB. =A0In other words,
                          there are places where I need to perform
                          multiple updates/inserts, and if I fail
                          partway through, I would ideally be able to
                          rollback the partially-applied updates.</div>
                        <div><br>
                        </div>
                        <div>Now, I *know* this isn&#39;t possible with
                          Cassandra. =A0What I&#39;m looking for are all th=
e
                          best practices, or at least tips and tricks,
                          so that I can get around this limitation in
                          Cassandra and still maintain a consistent
                          datastore. =A0(I am using quorum reads/writes so
                          that eventual consistency doesn&#39;t kick my ass
                          here as well.)</div>
                        <div><br>
                        </div>
                        <div>Below are some ideas I&#39;ve been able to dig
                          up. =A0Please let me know if any of them don&#39;=
t
                          make sense, or if there are better approaches:</d=
iv>
                        <div><br>
                        </div>
                        <div><br>
                        </div>
                        <div>1) Updates to a row in a column family are
                          atomic. =A0So try to model your data so that you
                          would only ever need to update a single row in
                          a single CF at once. =A0Essentially, you model
                          your data around transactions. =A0This is tricky
                          but can certainly be done in some situations.</di=
v>
                        <div><br>
                        </div>
                        <div>2) If you are only dealing with multiple
                          row *inserts* (and not updates), have one of
                          the rows act as a &#39;commit&#39; by essentially
                          validating the presence of the other rows.
                          =A0For example, say you were performing an
                          operation where you wanted to create an
                          Account row and 5 User rows all at once (this
                          is an unlikely example, but bear with me).
                          =A0You could insert 5 rows into the Users CF,
                          and then the 1 row into the Accounts CF, which
                          acts as the commit. =A0If something went wrong
                          before the Account could be created, any Users
                          that had been created so far would be orphaned
                          and unusable, as your business logic can
                          ensure that they can&#39;t exist without an
                          Account. =A0You could also have an offline
                          cleanup process that swept away orphans.</div>
                        <div><br>
                        </div>
                        <div>3) Try to model your updates as idempotent
                          column inserts instead. =A0How do you model
                          updates as inserts? =A0Instead of munging the
                          value directly, you could insert a column
                          containing the operation you want to perform
                          (like &quot;+5&quot;). =A0It would work kind of l=
ike the
                          Consistent Vote Counting implementation: ( <a hre=
f=3D"https://gist.github.com/416666" target=3D"_blank">https://gist.github.=
com/416666</a>
                          ). =A0How do you make the inserts idempotent?
                          =A0Make sure the column names correspond to a
                          request ID or some other identifier that would
                          be identical across re-drives of a given
                          (perhaps originally failed) request. =A0This
                          could leave your datastore in a temporarily
                          inconsistent state, but would eventually
                          become consistent after a successful re-drive
                          of the original request.</div>
                        <div><br>
                        </div>
                        <div>4) You could take an approach like Dominic
                          Williams proposed with Cages: =A0<a href=3D"http:=
//ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-u=
sing-cages/" target=3D"_blank">http://ria101.wordpress.com/2010/05/12/locki=
ng-and-transactions-over-cassandra-using-cages/</a>
                          =A0 =A0The gist is that you snapshot all the
                          original values that you&#39;re about to munge
                          somewhere else (in his case, ZooKeeper), make
                          your updates, and then delete the snapshot
                          (and that delete needs to be atomic). =A0If the
                          snapshot data was never deleted, then
                          subsequent accessors (even readers) of the
                          data rows need to do the rollback of the
                          previous transaction themselves before they
                          can read/write this data. =A0They do the
                          rollback by just overwriting the current
                          values with what is in the snapshot. =A0It
                          offloads the work of the rollback to the next
                          worker that accesses the data. =A0This approach
                          probably needs an generic/high-level
                          programming layer to handle all of the details
                          and complexity, and it doesn&#39;t seem like it
                          was ever added to Cages.</div>
                        <div><br>
                        </div>
                        <div><br>
                        </div>
                        <div>Are there other approaches or best
                          practices that I missed? =A0I would be very
                          interested in hearing any opinions from those
                          who have tackled these problems before.</div>
                        <div><br>
                        </div>
                        <div>
                          Thanks!</div>
                        <span><font color=3D"#888888">
                            <div>John</div>
                            <div><br>
                            </div>
                            <div>
                              <br>
                            </div>
                          </font></span></blockquote>
                    </div>
                    <br>
                    <br clear=3D"all"><span><font color=3D"#888888">
                    <div><br>
                    </div>
                  </font></span></div><span><font color=3D"#888888">
                </font></span></div><span><font color=3D"#888888">
                <font color=3D"#888888">-- <br>
                  J=E9r=E9my<br>
                </font></font></span></div><span><font color=3D"#888888">
            </font></span></blockquote><span><font color=3D"#888888">
          </font></span></div><span><font color=3D"#888888">
          <br>
        </font></span></div>
      </div>
    </blockquote>
    <br>
  </div>

</blockquote></div><br></div>
</div></div></blockquote></div><br>

--e0cb4e43d17b570b1504b3ea3ed5--