bookkeeper-distributedlog-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xi Liu <xi.liu....@gmail.com>
Subject Re: [Discuss] Transaction Support
Date Mon, 19 Dec 2016 16:29:56 GMT
Thanks liang.

- Xi

On Sun, Dec 18, 2016 at 5:52 PM, liang xie <xieliang007@gmail.com> wrote:

> The developers upload the design doc onto JIRA at least for
> HADOOP/HBase/Cassandra/... projects
>
> On Mon, Dec 19, 2016 at 12:48 AM, Sijie Guo <sijie@apache.org> wrote:
> > Hi Xi,
> >
> > sorry for late response. I will review it soon.
> >
> > regarding this, a separate question "are we going to use google doc
> instead
> > of email thread for any discussion"? I am a bit worried that the
> discussion
> > will become lost after moving to google doc. No idea on how other apache
> > projects are doing.
> >
> > - Sijie
> >
> > On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi.liu.ant@gmail.com> wrote:
> >
> >> Hi all,
> >>
> >> I finalized the first version of the design. This time I used a google
> doc
> >> so that it is easier for commenting and add a link the wiki page. I will
> >> update this to the wiki page once we come to the finalized design.
> >>
> >> https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK
> >> bSIGgSzXuTI5BA/edit
> >>
> >> Let me know if you have any questions. Appreciate your reviews!
> >>
> >> - Xi
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart
> >> <lstewart@twitter.com.invalid
> >> > wrote:
> >>
> >> > Interesting proposal. A couple quick notes while you continue to flesh
> >> this
> >> > out.
> >> >
> >> > a. just to be sure - does this eliminate the need to save seqno with
> >> > checkpoint?
> >> >
> >> > b. i.e. another way to describe this kind of improvement is "support
> >> > records (atomic writes) larger than 1MB", iiuc. the advantage being it
> >> > avoids the baggage of transactions. disadvantages include inability
> to do
> >> > cross stream transactions, and flexibility (interleaving, etc) (are
> there
> >> > others?).
> >> >
> >> > c. proxy use case is for supporting multiple writers - have you
> thought
> >> > about how this would work with multiple writers?
> >> >
> >> > Thanks!
> >> >
> >> >
> >> > On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo <sijieg@twitter.com.invalid
> >
> >> > wrote:
> >> >
> >> > > Sound good to me. look forward to the detailed proposal.
> >> > >
> >> > > (I don't mind the format if it makes things easier to you)
> >> > >
> >> > > Sijie
> >> > >
> >> > > On Friday, October 14, 2016, Xi Liu <xi.liu.ant@gmail.com> wrote:
> >> > >
> >> > > > Thank you, Sijie
> >> > > >
> >> > > > We have some internal discussions to sort out some details. We
are
> >> > ready
> >> > > to
> >> > > > collaborate with the community for adding the transaction support
> in
> >> > DL.
> >> > > > We'd like to share more.
> >> > > >
> >> > > > I created a proposal wiki here -
> >> > > > https://cwiki.apache.org/confluence/display/DL/DP-1+-+
> >> > > > DistributedLog+Transaction+Support
> >> > > >
> >> > > > (I followed KIP format and named it as DP (DistributedLog
> Proposal -
> >> DP
> >> > > is
> >> > > > also short for Dynamic Programming). I don't know if you guys
like
> >> this
> >> > > > name or not. Feel free to change it :D)
> >> > > >
> >> > > > I basically put my initial email as the content there so far.
> Once we
> >> > > > finished our final discussion, I will update with more details.
At
> >> the
> >> > > same
> >> > > > time, any comments are welcome.
> >> > > >
> >> > > > - Xi
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <sijie@apache.org
> >> > > <javascript:;>>
> >> > > > wrote:
> >> > > >
> >> > > > > Xi,
> >> > > > >
> >> > > > > I just granted you the edit permission.
> >> > > > >
> >> > > > > - Sijie
> >> > > > >
> >> > > > > On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu.ant@gmail.com
> >> > > > <javascript:;>> wrote:
> >> > > > >
> >> > > > > > I still can not edit the wiki. Can any of the pmc members
> grant
> >> me
> >> > > the
> >> > > > > > permissions?
> >> > > > > >
> >> > > > > > - Xi
> >> > > > > >
> >> > > > > > On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <
> xi.liu.ant@gmail.com
> >> > > > <javascript:;>> wrote:
> >> > > > > >
> >> > > > > > > Sijie,
> >> > > > > > >
> >> > > > > > > I attempted to create a wiki page under that space.
I found
> >> that
> >> > I
> >> > > am
> >> > > > > not
> >> > > > > > > authorized with edit permission.
> >> > > > > > >
> >> > > > > > > Can any of the committers grant me the wiki edit
> permission? My
> >> > > > account
> >> > > > > > is
> >> > > > > > > "xi.liu.ant".
> >> > > > > > >
> >> > > > > > > - Xi
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <
> sijie@apache.org
> >> > > > <javascript:;>> wrote:
> >> > > > > > >
> >> > > > > > >> This sounds interesting ... I will take a
closer look and
> give
> >> > my
> >> > > > > > comments
> >> > > > > > >> later.
> >> > > > > > >>
> >> > > > > > >> At the same time, do you mind creating a wiki
page to put
> your
> >> > > idea
> >> > > > > > there?
> >> > > > > > >> You can add your wiki page under
> >> > > > > > >> https://cwiki.apache.org/confluence/display/DL/Project+
> >> > Proposals
> >> > > > > > >>
> >> > > > > > >> You might need to ask in the dev list to grant
the wiki
> edit
> >> > > > > permissions
> >> > > > > > >> to
> >> > > > > > >> you once you have a wiki account.
> >> > > > > > >>
> >> > > > > > >> - Sijie
> >> > > > > > >>
> >> > > > > > >>
> >> > > > > > >> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <
> xi.liu.ant@gmail.com
> >> > > > <javascript:;>> wrote:
> >> > > > > > >>
> >> > > > > > >> > Hello,
> >> > > > > > >> >
> >> > > > > > >> > I asked the transaction support in distributedlog
user
> group
> >> > two
> >> > > > > > months
> >> > > > > > >> > ago. I want to raise this up again, as
we are looking for
> >> > using
> >> > > > > > >> > distributedlog for building a transactional
data
> service. It
> >> > is
> >> > > a
> >> > > > > > major
> >> > > > > > >> > feature that is missing in distributedlog.
We have some
> >> ideas
> >> > to
> >> > > > add
> >> > > > > > >> this
> >> > > > > > >> > to distributedlog and want to know if
they make sense or
> >> not.
> >> > If
> >> > > > > they
> >> > > > > > >> are
> >> > > > > > >> > good, we'd like to contribute and develop
with the
> >> community.
> >> > > > > > >> >
> >> > > > > > >> > Here are the thoughts:
> >> > > > > > >> >
> >> > > > > > >> > -------------------------------------------------
> >> > > > > > >> >
> >> > > > > > >> > From our understanding, DL can provide
"at-least-once"
> >> > delivery
> >> > > > > > semantic
> >> > > > > > >> > (if not, please correct me) but not "exactly-once"
> delivery
> >> > > > > semantic.
> >> > > > > > >> That
> >> > > > > > >> > means that a message can be delivered
one or more times
> if
> >> the
> >> > > > > reader
> >> > > > > > >> > doesn't handle duplicates.
> >> > > > > > >> >
> >> > > > > > >> > The duplicates come from two places,
one is at writer
> side
> >> > (this
> >> > > > > > assumes
> >> > > > > > >> > using write proxy not the core library),
while the other
> one
> >> > is
> >> > > at
> >> > > > > > >> reader
> >> > > > > > >> > side.
> >> > > > > > >> >
> >> > > > > > >> > - writer side: if the client attempts
to write a record
> to
> >> the
> >> > > > write
> >> > > > > > >> > proxies and gets a network error (e.g
timeouts) then
> >> retries,
> >> > > the
> >> > > > > > >> retrying
> >> > > > > > >> > will potentially result in duplicates.
> >> > > > > > >> > - reader side:if the reader reads a message
from a stream
> >> and
> >> > > then
> >> > > > > > >> crashes,
> >> > > > > > >> > when the reader restarts it would restart
from last known
> >> > > position
> >> > > > > > >> (DLSN).
> >> > > > > > >> > If the reader fails after processing
a record and before
> >> > > recording
> >> > > > > the
> >> > > > > > >> > position, the processed record will be
delivered again.
> >> > > > > > >> >
> >> > > > > > >> > The reader problem can be properly addressed
by making
> use
> >> of
> >> > > the
> >> > > > > > >> sequence
> >> > > > > > >> > numbers of records and doing proper checkpointing.
For
> >> > example,
> >> > > in
> >> > > > > > >> > database, it can checkpoint the indexed
data with the
> >> sequence
> >> > > > > number
> >> > > > > > of
> >> > > > > > >> > records; in flink, it can checkpoint
the state with the
> >> > sequence
> >> > > > > > >> numbers.
> >> > > > > > >> >
> >> > > > > > >> > The writer problem can be addressed by
implementing an
> >> > > idempotent
> >> > > > > > >> writer.
> >> > > > > > >> > However, an alternative and more powerful
approach is to
> >> > support
> >> > > > > > >> > transactions.
> >> > > > > > >> >
> >> > > > > > >> > *What does transaction mean?*
> >> > > > > > >> >
> >> > > > > > >> > A transaction means a collection of records
can be
> written
> >> > > > > > >> transactionally
> >> > > > > > >> > within a stream or across multiple streams.
They will be
> >> > > consumed
> >> > > > by
> >> > > > > > the
> >> > > > > > >> > reader together when a transaction is
committed, or will
> >> never
> >> > > be
> >> > > > > > >> consumed
> >> > > > > > >> > by the reader when the transaction is
aborted.
> >> > > > > > >> >
> >> > > > > > >> > The transaction will expose following
guarantees:
> >> > > > > > >> >
> >> > > > > > >> > - The reader should not be exposed to
records written
> from
> >> > > > > uncommitted
> >> > > > > > >> > transactions (mandatory)
> >> > > > > > >> > - The reader should consume the records
in the
> transaction
> >> > > commit
> >> > > > > > order
> >> > > > > > >> > rather than the record written order
(mandatory)
> >> > > > > > >> > - No duplicated records within a transaction
(mandatory)
> >> > > > > > >> > - Allow interleaving transactional writes
and
> >> > non-transactional
> >> > > > > writes
> >> > > > > > >> > (optional)
> >> > > > > > >> >
> >> > > > > > >> > *Stream Transaction & Namespace Transaction*
> >> > > > > > >> >
> >> > > > > > >> > There will be two types of transaction,
one is Stream
> level
> >> > > > > > transaction
> >> > > > > > >> > (local transaction), while the other
one is Namespace
> level
> >> > > > > > transaction
> >> > > > > > >> > (global transaction).
> >> > > > > > >> >
> >> > > > > > >> > The stream level transaction is a transactional
> operation on
> >> > > > writing
> >> > > > > > >> > records to one stream; the namespace
level transaction
> is a
> >> > > > > > >> transactional
> >> > > > > > >> > operation on writing records to multiple
streams.
> >> > > > > > >> >
> >> > > > > > >> > *Implementation Thoughts*
> >> > > > > > >> >
> >> > > > > > >> > - A transaction is consist of begin control
record, a
> series
> >> > of
> >> > > > data
> >> > > > > > >> > records and commit/abort control record.
> >> > > > > > >> > - The begin/commit/abort control record
is written to a
> >> > `commit`
> >> > > > log
> >> > > > > > >> > stream, while the data records will be
written to normal
> >> data
> >> > > log
> >> > > > > > >> streams.
> >> > > > > > >> > - The `commit` log stream will be the
same log stream for
> >> > > > > stream-level
> >> > > > > > >> > transaction,  while it will be a *system*
stream (or
> >> multiple
> >> > > > system
> >> > > > > > >> > streams) for namespace-level transactions.
> >> > > > > > >> > - The transaction code looks like as
below:
> >> > > > > > >> >
> >> > > > > > >> > <code>
> >> > > > > > >> >
> >> > > > > > >> > Transaction txn = client.transaction();
> >> > > > > > >> > Future<DLSN> result1 = txn.write(stream-0,
record);
> >> > > > > > >> > Future<DLSN> result2 = txn.write(stream-1,
record);
> >> > > > > > >> > Future<DLSN> result3 = txn.write(stream-2,
record);
> >> > > > > > >> > Future<Pair<DLSN, DLSN>>
result = txn.commit();
> >> > > > > > >> >
> >> > > > > > >> > </code>
> >> > > > > > >> >
> >> > > > > > >> > if the txn is committed, all the write
futures will be
> >> > satisfied
> >> > > > > with
> >> > > > > > >> their
> >> > > > > > >> > written DLSNs. if the txn is aborted,
all the write
> futures
> >> > will
> >> > > > be
> >> > > > > > >> failed
> >> > > > > > >> > together. there is no partial failure
state.
> >> > > > > > >> >
> >> > > > > > >> > - The actually data flow will be:
> >> > > > > > >> >
> >> > > > > > >> > 1. writer get a transaction id from the
owner of the
> >> `commit'
> >> > > log
> >> > > > > > stream
> >> > > > > > >> > 1. write the begin control record (synchronously)
with
> the
> >> > > > > transaction
> >> > > > > > >> id
> >> > > > > > >> > 2. for each write within the same txn,
it will be
> assigned a
> >> > > local
> >> > > > > > >> sequence
> >> > > > > > >> > number starting from 0. the combination
of transaction id
> >> and
> >> > > > local
> >> > > > > > >> > sequence number will be used later on
by the readers to
> >> > > > de-duplicate
> >> > > > > > >> > records.
> >> > > > > > >> > 3. the commit/abort control record will
be written based
> on
> >> > the
> >> > > > > > results
> >> > > > > > >> > from 2.
> >> > > > > > >> >
> >> > > > > > >> > - Application can supply a timeout for
the transaction
> when
> >> > > > > #begin() a
> >> > > > > > >> > transaction. The owner of the `commit`
log stream can
> abort
> >> > > > > > transactions
> >> > > > > > >> > that never be committed/aborted within
their timeout.
> >> > > > > > >> >
> >> > > > > > >> > - Failures:
> >> > > > > > >> >
> >> > > > > > >> > * all the log records can be simply retried
as they will
> be
> >> > > > > > >> de-duplicated
> >> > > > > > >> > probably at the reader side.
> >> > > > > > >> >
> >> > > > > > >> > - Reader:
> >> > > > > > >> >
> >> > > > > > >> > * Reader can be configured to read uncommitted
records or
> >> > > > committed
> >> > > > > > >> records
> >> > > > > > >> > only (by default read uncommitted records)
> >> > > > > > >> > * If reader is configured to read committed
records only,
> >> the
> >> > > read
> >> > > > > > ahead
> >> > > > > > >> > cache will be changed to maintain one
additional pending
> >> > > committed
> >> > > > > > >> records.
> >> > > > > > >> > the pending committed records map is
bounded and records
> >> will
> >> > be
> >> > > > > > dropped
> >> > > > > > >> > when read ahead is moving.
> >> > > > > > >> > * when the reader hits a commit record,
it will rewind to
> >> the
> >> > > > begin
> >> > > > > > >> record
> >> > > > > > >> > and start reading from there. leveraging
the proper read
> >> ahead
> >> > > > cache
> >> > > > > > and
> >> > > > > > >> > pending commit records cache, it would
be good for both
> >> short
> >> > > > > > >> transactions
> >> > > > > > >> > and long transactions.
> >> > > > > > >> >
> >> > > > > > >> > - DLSN, SequenceId:
> >> > > > > > >> >
> >> > > > > > >> > * We will add a fourth field to DLSN.
It is `local
> sequence
> >> > > > number`
> >> > > > > > >> within
> >> > > > > > >> > a transaction session. So the new DLSN
of records in a
> >> > > transaction
> >> > > > > > will
> >> > > > > > >> be
> >> > > > > > >> > the DLSN of commit control record plus
its local sequence
> >> > > number.
> >> > > > > > >> > * The sequence id will be still the position
of the
> commit
> >> > > record
> >> > > > > plus
> >> > > > > > >> its
> >> > > > > > >> > local sequence number. The position will
be advanced with
> >> > total
> >> > > > > number
> >> > > > > > >> of
> >> > > > > > >> > written records on writing the commit
control record.
> >> > > > > > >> >
> >> > > > > > >> > - Transaction Group & Namespace Transaction
> >> > > > > > >> >
> >> > > > > > >> > using one single log stream for namespace
transaction can
> >> > cause
> >> > > > the
> >> > > > > > >> > bottleneck problem since all the begin/commit/end
control
> >> > > records
> >> > > > > will
> >> > > > > > >> have
> >> > > > > > >> > to go through one log stream.
> >> > > > > > >> >
> >> > > > > > >> > the idea of 'transaction group' is to
allow partitioning
> the
> >> > > > writers
> >> > > > > > >> into
> >> > > > > > >> > different transaction groups.
> >> > > > > > >> >
> >> > > > > > >> > clients can specify the `group-name`
when starting the
> >> > > > transaction.
> >> > > > > if
> >> > > > > > >> > there is no `group-name` specified, it
will use the
> default
> >> > > > `commit`
> >> > > > > > >> log in
> >> > > > > > >> > the namespace for creating transactions.
> >> > > > > > >> >
> >> > > > > > >> > -------------------------------------------------
> >> > > > > > >> >
> >> > > > > > >> > I'd like to collect feedbacks on this
idea. Appreciate
> any
> >> > > > comments
> >> > > > > > and
> >> > > > > > >> if
> >> > > > > > >> > anyone is also interested in this idea,
we'd like to
> >> > collaborate
> >> > > > > with
> >> > > > > > >> the
> >> > > > > > >> > community.
> >> > > > > > >> >
> >> > > > > > >> >
> >> > > > > > >> > - Xi
> >> > > > > > >> >
> >> > > > > > >>
> >> > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message