incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AJ ...@dude.podzone.net>
Subject Re: Strong Consistency with ONE read/writes
Date Wed, 13 Jul 2011 05:19:46 GMT
On 7/12/2011 10:48 AM, Yang wrote:
> for example,
> coord writes record 1,2 ,3 ,4,5 in sequence
> if u have replica A, B, C
> currently A can have 1 , 3
> B can have 1,3,4,
> C can have 2345
>
> by "prefix", I mean I want them to have only 1---n  where n is some
> number  between 1 and 5,
> for example A having 1,2,3
> B having 1,2,3,4
> C having 1,2,3,4,5
>
> the way we enforce this prefix pattern is that
> 1) the leader is ensured to have everything that's sent out, otherwise
> it's removed from leader position
> 2) non-leader replicas is guaranteed to receive a prefix, because of
> FIFO of the connection between replica and coordinator, if this
> connection breaks, replica must catchup from the authoritative source
> of leader
>
> there is one point I hand-waved a bit: there are many coordinators,
> the "prefix" from each of them is different, still need to think about
> this, worst case is that we need to force the traffic come from the
> leader, which is less interesting because it's almost hbase then...
>

Are you saying:  All replicas will receive the value whether or not they 
actually own the key range for the value.  If a node is not a replica 
for a value, it will not store it, but it will still write it in it's 
transaction log as a backup in case the leader dies.  Is that right?

>
>
> On Tue, Jul 12, 2011 at 7:37 AM, AJ<aj@dude.podzone.net>  wrote:
>> Yang, I'm not sure I understand what you mean by "prefix of the HLog".
>>   Also, can you explain what failure scenario you are talking about?  The
>> major failure that I see is when the leader node confirms to the client a
>> successful local write, but then fails before the write can be replicated to
>> any other replica node.  But, then again, you also say that the leader does
>> not forward replicas in your idea; so it's not real clear.
>>
>> I'm still trying to figure out how to make this work with normal Cass
>> operation.
>>
>> aj
>>
>> On 7/11/2011 3:48 PM, Yang wrote:
>>> I'm not proposing any changes to be done, but this looks like a very
>>> interesting topic for thought/hack/learning, so the following are only
>>> for thought exercises ....
>>>
>>>
>>> HBase enforces a single write/read entry point, so you can achieve
>>> strong consistency by writing/reading only one node.  but just writing
>>> to one node exposes you to loss of data if that node fails. so the
>>> region server HLog is replicated to 3 HDFS data nodes.  the
>>> interesting thing here is that each replica sees a complete *prefix*
>>> of the HLog: it won't miss a record, if a record sync() to a data node
>>> fails, all the existing bytes in the block are replicated to a new
>>> data node.
>>>
>>> if we employ a similar "leader" node among the N replicas of
>>> cassandra (coordinator always waits for the reply from leader, but
>>> leader does not do further replication like in HBase or counters), the
>>> leader sees all writes onto the key range, but the other replicas
>>> could miss some writes, as a result, each of the non-leader replicas'
>>> write history has some "holes", so when the leader dies, and when we
>>> elect a new one, no one is going to have a complete history. so you'd
>>> have to do a repair amongst all the replicas to reconstruct the full
>>> history, which is slow.
>>>
>>> it seems possible that we could utilize the FIFO property of the
>>> InComingTCPConnection to simplify history reconstruction, just like
>>> Zookeeper. if the IncomingTcpConnection of a replica fails, that means
>>> that it may have missed some edits, then when it reconnects, we force
>>> it to talk to the active leader first, to catch up to date. when the
>>> leader dies, the next leader is elected to be the replica with the
>>> most recent history.  by maintaining the property that each node has a
>>> complete prefix of history, we only need to catch up on the tail of
>>> history, and avoid doing a complete repair on the entire
>>> memtable+SStable.  but one issue is that the history at the leader has
>>> to be kept really long ----- if a non-leader replica goes off for 2
>>> days, the leader has to keep all the history for 2 days to feed them
>>> to the replica when it comes back online. but possibly this could be
>>> limited to some max length so that over that length, the woken replica
>>> simply does a complete bootstrap.
>>>
>>>
>>> thanks
>>> yang
>>> On Sun, Jul 3, 2011 at 8:25 PM, AJ<aj@dude.podzone.net>    wrote:
>>>> We seem to be having a fundamental misunderstanding.  Thanks for your
>>>> comments. aj
>>>>
>>>> On 7/3/2011 8:28 PM, William Oberman wrote:
>>>>
>>>> I'm using cassandra as a tool, like a black box with a certain contract
>>>> to
>>>> the world.  Without modifying the "core", C* will send the updates to all
>>>> replicas, so your plan would cause the extra write (for the placeholder).
>>>>   I
>>>> wasn't assuming a modification to how C* fundamentally works.
>>>> Sounds like you are hacking (or at least looking) at the source, so all
>>>> the
>>>> power to you if/when you try these kind of changes.
>>>> will
>>>> On Sun, Jul 3, 2011 at 8:45 PM, AJ<aj@dude.podzone.net>    wrote:
>>>>> On 7/3/2011 6:32 PM, William Oberman wrote:
>>>>>
>>>>> Was just going off of: " Send the value to the primary replica and send
>>>>> placeholder values to the other replicas".  Sounded like you wanted to
>>>>> write
>>>>> the value to one, and write the placeholder to N-1 to me.
>>>>>
>>>>> Yes, that is what I was suggesting.  The point of the placeholders is
to
>>>>> handle the crash case that I talked about... "like" a WAL does.
>>>>>
>>>>> But, C* will propagate the value to N-1 eventually anyways, 'cause
>>>>> that's
>>>>> just what it does anyways :-)
>>>>> will
>>>>>
>>>>> On Sun, Jul 3, 2011 at 7:47 PM, AJ<aj@dude.podzone.net>    wrote:
>>>>>> On 7/3/2011 3:49 PM, Will Oberman wrote:
>>>>>>
>>>>>> Why not send the value itself instead of a placeholder?  Now it takes
>>>>>> 2x
>>>>>> writes on a random node to do a single update (write placeholder,
write
>>>>>> update) and N*x writes from the client (write value, write placeholder
>>>>>> to
>>>>>> N-1). Where N is replication factor.  Seems like extra network and
IO
>>>>>> instead of less...
>>>>>>
>>>>>> To send the value to each node is 1.) unnecessary, 2.) will only
cause
>>>>>> a
>>>>>> large burst of network traffic.  Think about if it's a large data
>>>>>> value,
>>>>>> such as a document.  Just let C* do it's thing.  The extra messages
are
>>>>>> tiny
>>>>>> and doesn't significantly increase latency since they are all sent
>>>>>> asynchronously.
>>>>>>
>>>>>>
>>>>>> Of course, I still think this sounds like reimplementing Cassandra
>>>>>> internals in a Cassandra client (just guessing, I'm not a cassandra
>>>>>> dev)
>>>>>>
>>>>>> I don't see how.  Maybe you should take a peek at the source.
>>>>>>
>>>>>>
>>>>>> On Jul 3, 2011, at 5:20 PM, AJ<aj@dude.podzone.net>    wrote:
>>>>>>
>>>>>> Yang,
>>>>>>
>>>>>> How would you deal with the problem when the 1st node responds success
>>>>>> but then crashes before completely forwarding any replicas?  Then,
>>>>>> after
>>>>>> switching to the next primary, a read would return stale data.
>>>>>>
>>>>>> Here's a quick-n-dirty way:  Send the value to the primary replica
and
>>>>>> send placeholder values to the other replicas.  The placeholder value
>>>>>> is
>>>>>> something like, "PENDING_UPDATE".  The placeholder values are sent
with
>>>>>> timestamps 1 less than the timestamp for the actual value that went
to
>>>>>> the
>>>>>> primary.  Later, when the changes propagate, the actual values will
>>>>>> overwrite the placeholders.  In event of a crash before the placeholder
>>>>>> gets
>>>>>> overwritten, the next read value will tell the client so.  The client
>>>>>> will
>>>>>> report to the user that the key/column is unavailable.  The downside
is
>>>>>> you've overwritten your data and maybe would like to know what the
old
>>>>>> data
>>>>>> was!  But, maybe there's another way using other columns or with
MVCC.
>>>>>>   The
>>>>>> client would want a success from the primary and the secondary replicas
>>>>>> to
>>>>>> be certain of future read consistency in case the primary goes down
>>>>>> immediately as I said above.  The ability to set an "update_pending"
>>>>>> flag on
>>>>>> any column value would probably make this work.  But, I'll think
more
>>>>>> on
>>>>>> this later.
>>>>>>
>>>>>> aj
>>>>>>
>>>>>> On 7/2/2011 10:55 AM, Yang wrote:
>>>>>>
>>>>>> there is a JIRA completed in 0.7.x that "Prefers" a certain node
in
>>>>>> snitch, so this does roughly what you want MOST of the time
>>>>>>
>>>>>> but the problem is that it does not GUARANTEE that the same node
will
>>>>>> always be read.  I recently read into the HBase vs Cassandra comparison
>>>>>> thread that started after Facebook dropped Cassandra for their
>>>>>> messaging
>>>>>> system, and understood some of the differences. what you want is
>>>>>> essentially
>>>>>> what HBase does. the fundamental difference there is really due to
the
>>>>>> gossip protocol: it's a probablistic, or eventually consistent failure
>>>>>> detector  while HBase/Google Bigtable use Zookeeper/Chubby to provide
a
>>>>>> strong failure detector (a distributed lock).  so in HBase, if a
tablet
>>>>>> server goes down, it really goes down, it can not re-grab the tablet
>>>>>> from
>>>>>> the new tablet server without going through a start up protocol
>>>>>> (notifying
>>>>>> the master, which would notify the clients etc),  in other words
it is
>>>>>> guaranteed that one tablet is served by only one tablet server at
any
>>>>>> given
>>>>>> time.  in comparison the above JIRA only TRYIES to serve that key
from
>>>>>> one
>>>>>> particular replica. HBase can have that guarantee because the group
>>>>>> membership is maintained by the strong failure detector.
>>>>>> just for hacking curiosity, a strong failure detector + Cassandra
>>>>>> replicas is not impossible (actually seems not difficult), although
the
>>>>>> performance is not clear. what would such a strong failure detector
>>>>>> bring to
>>>>>> Cassandra besides this ONE-ONE strong consistency ? that is an
>>>>>> interesting
>>>>>> question I think.
>>>>>> considering that HBase has been deployed on big clusters, it is
>>>>>> probably
>>>>>> OK with the performance of the strong  Zookeeper failure detector.
then
>>>>>> a
>>>>>> further question was: why did Dynamo originally choose to use the
>>>>>> probablistic failure detector? yes Dynamo's main theme is "eventually
>>>>>> consistent", so the Phi-detector is **enough**, but if a strong
>>>>>> detector
>>>>>> buys us more with little cost, wouldn't that  be great?
>>>>>>
>>>>>>
>>>>>> On Fri, Jul 1, 2011 at 6:53 PM, AJ<aj@dude.podzone.net>   
wrote:
>>>>>>> Is this possible?
>>>>>>>
>>>>>>> All reads and writes for a given key will always go to the same
node
>>>>>>> from a client.  It seems the only thing needed is to allow the
clients
>>>>>>> to
>>>>>>> compute which node is the closes replica for the given key using
the
>>>>>>> same
>>>>>>> algorithm C* uses.  When the first replica receives the write
request,
>>>>>>> it
>>>>>>> will write to itself which should complete before any of the
other
>>>>>>> replicas
>>>>>>> and then return.  The loads should still stay balanced if using
random
>>>>>>> partitioner.  If the first replica becomes unavailable (however
that
>>>>>>> is
>>>>>>> defined), then the clients can send to the next repilca in the
ring
>>>>>>> and
>>>>>>> switch from ONE write/reads to QUORUM write/reads temporarily
until
>>>>>>> the
>>>>>>> first replica becomes available again.  QUORUM is required since
there
>>>>>>> could
>>>>>>> be some replicas that were not updated after the first replica
went
>>>>>>> down.
>>>>>>>
>>>>>>> Will this work?  The goal is to have strong consistency with
a
>>>>>>> read/write consistency level as low as possible while secondarily
a
>>>>>>> network
>>>>>>> performance boost.
>>>>>>
>>>>>
>>>>> --
>>>>> Will Oberman
>>>>> Civic Science, Inc.
>>>>> 3030 Penn Avenue., First Floor
>>>>> Pittsburgh, PA 15201
>>>>> (M) 412-480-7835
>>>>> (E) oberman@civicscience.com
>>>>>
>>>>
>>>> --
>>>> Will Oberman
>>>> Civic Science, Inc.
>>>> 3030 Penn Avenue., First Floor
>>>> Pittsburgh, PA 15201
>>>> (M) 412-480-7835
>>>> (E) oberman@civicscience.com
>>>>
>>>>
>>


Mime
View raw message