Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of teddyyyy123@gmail.com
 designates 209.85.161.172 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <4E1C5C12.8080601@dude.podzone.net>
References: <4E0E7A07.2050807@dude.podzone.net>
	<CAAnh3_-P3dcnf3NjEW7przb_XLeTTVzxUFKFnP8UpcL5obhA_w@mail.gmail.com>
	<4E10DD11.8010009@dude.podzone.net>
	<2D176E34-49FA-43A5-979A-2FD75F02BE15@civicscience.com>
	<4E10FFA8.6090103@dude.podzone.net>
	<CAAjbL_=sRNmTZpLkSW+0Q1UU6m=w0+7=w-rCJ5d+q7bWe3eZaQ@mail.gmail.com>
	<4E110D1F.9070806@dude.podzone.net>
	<CAAjbL_mjWo93EovbNzJH2Z7HLhVkb8b61FvRbH23_pWf4qu1_A@mail.gmail.com>
	<4E1132C0.1000806@dude.podzone.net>
	<CAAnh3_-RPJg+OpdN-rpvfyHfKLbXiqe8qaRihKzSBeLDpKthVQ@mail.gmail.com>
	<4E1C5C12.8080601@dude.podzone.net>
Date: Tue, 12 Jul 2011 09:48:19 -0700
Message-ID: 
 <CAAnh3_-FPtHkWq+YE-3UAKgfj8q2vCwbrdRUiVfeN+zCjeCXLw@mail.gmail.com>
Subject: Re: Strong Consistency with ONE read/writes
From: Yang <teddyyyy123@gmail.com>
To: user@cassandra.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

for example,
coord writes record 1,2 ,3 ,4,5 in sequence
if u have replica A, B, C
currently A can have 1 , 3
B can have 1,3,4,
C can have 2345

by "prefix", I mean I want them to have only 1---n  where n is some
number  between 1 and 5,
for example A having 1,2,3
B having 1,2,3,4
C having 1,2,3,4,5

the way we enforce this prefix pattern is that
1) the leader is ensured to have everything that's sent out, otherwise
it's removed from leader position
2) non-leader replicas is guaranteed to receive a prefix, because of
FIFO of the connection between replica and coordinator, if this
connection breaks, replica must catchup from the authoritative source
of leader

there is one point I hand-waved a bit: there are many coordinators,
the "prefix" from each of them is different, still need to think about
this, worst case is that we need to force the traffic come from the
leader, which is less interesting because it's almost hbase then...


On Tue, Jul 12, 2011 at 7:37 AM, AJ <aj@dude.podzone.net> wrote:
> Yang, I'm not sure I understand what you mean by "prefix of the HLog".
> =A0Also, can you explain what failure scenario you are talking about? =A0=
The
> major failure that I see is when the leader node confirms to the client a
> successful local write, but then fails before the write can be replicated=
 to
> any other replica node. =A0But, then again, you also say that the leader =
does
> not forward replicas in your idea; so it's not real clear.
>
> I'm still trying to figure out how to make this work with normal Cass
> operation.
>
> aj
>
> On 7/11/2011 3:48 PM, Yang wrote:
>>
>> I'm not proposing any changes to be done, but this looks like a very
>> interesting topic for thought/hack/learning, so the following are only
>> for thought exercises ....
>>
>>
>> HBase enforces a single write/read entry point, so you can achieve
>> strong consistency by writing/reading only one node. =A0but just writing
>> to one node exposes you to loss of data if that node fails. so the
>> region server HLog is replicated to 3 HDFS data nodes. =A0the
>> interesting thing here is that each replica sees a complete *prefix*
>> of the HLog: it won't miss a record, if a record sync() to a data node
>> fails, all the existing bytes in the block are replicated to a new
>> data node.
>>
>> if we employ a similar "leader" node among the N replicas of
>> cassandra (coordinator always waits for the reply from leader, but
>> leader does not do further replication like in HBase or counters), the
>> leader sees all writes onto the key range, but the other replicas
>> could miss some writes, as a result, each of the non-leader replicas'
>> write history has some "holes", so when the leader dies, and when we
>> elect a new one, no one is going to have a complete history. so you'd
>> have to do a repair amongst all the replicas to reconstruct the full
>> history, which is slow.
>>
>> it seems possible that we could utilize the FIFO property of the
>> InComingTCPConnection to simplify history reconstruction, just like
>> Zookeeper. if the IncomingTcpConnection of a replica fails, that means
>> that it may have missed some edits, then when it reconnects, we force
>> it to talk to the active leader first, to catch up to date. when the
>> leader dies, the next leader is elected to be the replica with the
>> most recent history. =A0by maintaining the property that each node has a
>> complete prefix of history, we only need to catch up on the tail of
>> history, and avoid doing a complete repair on the entire
>> memtable+SStable. =A0but one issue is that the history at the leader has
>> to be kept really long ----- if a non-leader replica goes off for 2
>> days, the leader has to keep all the history for 2 days to feed them
>> to the replica when it comes back online. but possibly this could be
>> limited to some max length so that over that length, the woken replica
>> simply does a complete bootstrap.
>>
>>
>> thanks
>> yang
>> On Sun, Jul 3, 2011 at 8:25 PM, AJ<aj@dude.podzone.net> =A0wrote:
>>>
>>> We seem to be having a fundamental misunderstanding. =A0Thanks for your
>>> comments. aj
>>>
>>> On 7/3/2011 8:28 PM, William Oberman wrote:
>>>
>>> I'm using cassandra as a tool, like a black box with a certain contract
>>> to
>>> the world. =A0Without modifying the "core", C* will send the updates to=
 all
>>> replicas, so your plan would cause the extra write (for the placeholder=
).
>>> =A0I
>>> wasn't assuming a modification to how C* fundamentally works.
>>> Sounds like you are hacking (or at least looking) at the source, so all
>>> the
>>> power to you if/when you try these kind of changes.
>>> will
>>> On Sun, Jul 3, 2011 at 8:45 PM, AJ<aj@dude.podzone.net> =A0wrote:
>>>>
>>>> On 7/3/2011 6:32 PM, William Oberman wrote:
>>>>
>>>> Was just going off of: " Send the value to the primary replica and sen=
d
>>>> placeholder values to the other replicas". =A0Sounded like you wanted =
to
>>>> write
>>>> the value to one, and write the placeholder to N-1 to me.
>>>>
>>>> Yes, that is what I was suggesting. =A0The point of the placeholders i=
s to
>>>> handle the crash case that I talked about... "like" a WAL does.
>>>>
>>>> But, C* will propagate the value to N-1 eventually anyways, 'cause
>>>> that's
>>>> just what it does anyways :-)
>>>> will
>>>>
>>>> On Sun, Jul 3, 2011 at 7:47 PM, AJ<aj@dude.podzone.net> =A0wrote:
>>>>>
>>>>> On 7/3/2011 3:49 PM, Will Oberman wrote:
>>>>>
>>>>> Why not send the value itself instead of a placeholder? =A0Now it tak=
es
>>>>> 2x
>>>>> writes on a random node to do a single update (write placeholder, wri=
te
>>>>> update) and N*x writes from the client (write value, write placeholde=
r
>>>>> to
>>>>> N-1). Where N is replication factor. =A0Seems like extra network and =
IO
>>>>> instead of less...
>>>>>
>>>>> To send the value to each node is 1.) unnecessary, 2.) will only caus=
e
>>>>> a
>>>>> large burst of network traffic. =A0Think about if it's a large data
>>>>> value,
>>>>> such as a document. =A0Just let C* do it's thing. =A0The extra messag=
es are
>>>>> tiny
>>>>> and doesn't significantly increase latency since they are all sent
>>>>> asynchronously.
>>>>>
>>>>>
>>>>> Of course, I still think this sounds like reimplementing Cassandra
>>>>> internals in a Cassandra client (just guessing, I'm not a cassandra
>>>>> dev)
>>>>>
>>>>> I don't see how. =A0Maybe you should take a peek at the source.
>>>>>
>>>>>
>>>>> On Jul 3, 2011, at 5:20 PM, AJ<aj@dude.podzone.net> =A0wrote:
>>>>>
>>>>> Yang,
>>>>>
>>>>> How would you deal with the problem when the 1st node responds succes=
s
>>>>> but then crashes before completely forwarding any replicas? =A0Then,
>>>>> after
>>>>> switching to the next primary, a read would return stale data.
>>>>>
>>>>> Here's a quick-n-dirty way: =A0Send the value to the primary replica =
and
>>>>> send placeholder values to the other replicas. =A0The placeholder val=
ue
>>>>> is
>>>>> something like, "PENDING_UPDATE". =A0The placeholder values are sent =
with
>>>>> timestamps 1 less than the timestamp for the actual value that went t=
o
>>>>> the
>>>>> primary. =A0Later, when the changes propagate, the actual values will
>>>>> overwrite the placeholders. =A0In event of a crash before the placeho=
lder
>>>>> gets
>>>>> overwritten, the next read value will tell the client so. =A0The clie=
nt
>>>>> will
>>>>> report to the user that the key/column is unavailable. =A0The downsid=
e is
>>>>> you've overwritten your data and maybe would like to know what the ol=
d
>>>>> data
>>>>> was! =A0But, maybe there's another way using other columns or with MV=
CC.
>>>>> =A0The
>>>>> client would want a success from the primary and the secondary replic=
as
>>>>> to
>>>>> be certain of future read consistency in case the primary goes down
>>>>> immediately as I said above. =A0The ability to set an "update_pending=
"
>>>>> flag on
>>>>> any column value would probably make this work. =A0But, I'll think mo=
re
>>>>> on
>>>>> this later.
>>>>>
>>>>> aj
>>>>>
>>>>> On 7/2/2011 10:55 AM, Yang wrote:
>>>>>
>>>>> there is a JIRA completed in 0.7.x that "Prefers" a certain node in
>>>>> snitch, so this does roughly what you want MOST of the time
>>>>>
>>>>> but the problem is that it does not GUARANTEE that the same node will
>>>>> always be read. =A0I recently read into the HBase vs Cassandra compar=
ison
>>>>> thread that started after Facebook dropped Cassandra for their
>>>>> messaging
>>>>> system, and understood some of the differences. what you want is
>>>>> essentially
>>>>> what HBase does. the fundamental difference there is really due to th=
e
>>>>> gossip protocol: it's a probablistic, or eventually consistent failur=
e
>>>>> detector =A0while HBase/Google Bigtable use Zookeeper/Chubby to provi=
de a
>>>>> strong failure detector (a distributed lock). =A0so in HBase, if a ta=
blet
>>>>> server goes down, it really goes down, it can not re-grab the tablet
>>>>> from
>>>>> the new tablet server without going through a start up protocol
>>>>> (notifying
>>>>> the master, which would notify the clients etc), =A0in other words it=
 is
>>>>> guaranteed that one tablet is served by only one tablet server at any
>>>>> given
>>>>> time. =A0in comparison the above JIRA only TRYIES to serve that key f=
rom
>>>>> one
>>>>> particular replica. HBase can have that guarantee because the group
>>>>> membership is maintained by the strong failure detector.
>>>>> just for hacking curiosity, a strong failure detector + Cassandra
>>>>> replicas is not impossible (actually seems not difficult), although t=
he
>>>>> performance is not clear. what would such a strong failure detector
>>>>> bring to
>>>>> Cassandra besides this ONE-ONE strong consistency ? that is an
>>>>> interesting
>>>>> question I think.
>>>>> considering that HBase has been deployed on big clusters, it is
>>>>> probably
>>>>> OK with the performance of the strong =A0Zookeeper failure detector. =
then
>>>>> a
>>>>> further question was: why did Dynamo originally choose to use the
>>>>> probablistic failure detector? yes Dynamo's main theme is "eventually
>>>>> consistent", so the Phi-detector is **enough**, but if a strong
>>>>> detector
>>>>> buys us more with little cost, wouldn't that =A0be great?
>>>>>
>>>>>
>>>>> On Fri, Jul 1, 2011 at 6:53 PM, AJ<aj@dude.podzone.net> =A0wrote:
>>>>>>
>>>>>> Is this possible?
>>>>>>
>>>>>> All reads and writes for a given key will always go to the same node
>>>>>> from a client. =A0It seems the only thing needed is to allow the cli=
ents
>>>>>> to
>>>>>> compute which node is the closes replica for the given key using the
>>>>>> same
>>>>>> algorithm C* uses. =A0When the first replica receives the write requ=
est,
>>>>>> it
>>>>>> will write to itself which should complete before any of the other
>>>>>> replicas
>>>>>> and then return. =A0The loads should still stay balanced if using ra=
ndom
>>>>>> partitioner. =A0If the first replica becomes unavailable (however th=
at
>>>>>> is
>>>>>> defined), then the clients can send to the next repilca in the ring
>>>>>> and
>>>>>> switch from ONE write/reads to QUORUM write/reads temporarily until
>>>>>> the
>>>>>> first replica becomes available again. =A0QUORUM is required since t=
here
>>>>>> could
>>>>>> be some replicas that were not updated after the first replica went
>>>>>> down.
>>>>>>
>>>>>> Will this work? =A0The goal is to have strong consistency with a
>>>>>> read/write consistency level as low as possible while secondarily a
>>>>>> network
>>>>>> performance boost.
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Will Oberman
>>>> Civic Science, Inc.
>>>> 3030 Penn Avenue., First Floor
>>>> Pittsburgh, PA 15201
>>>> (M) 412-480-7835
>>>> (E) oberman@civicscience.com
>>>>
>>>
>>>
>>> --
>>> Will Oberman
>>> Civic Science, Inc.
>>> 3030 Penn Avenue., First Floor
>>> Pittsburgh, PA 15201
>>> (M) 412-480-7835
>>> (E) oberman@civicscience.com
>>>
>>>
>
>