distributedlog-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sijie Guo <si...@apache.org>
Subject Re: [Discuss] Transaction Support
Date Sun, 18 Dec 2016 16:48:42 GMT
Hi Xi,

sorry for late response. I will review it soon.

regarding this, a separate question "are we going to use google doc instead
of email thread for any discussion"? I am a bit worried that the discussion
will become lost after moving to google doc. No idea on how other apache
projects are doing.

- Sijie

On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi.liu.ant@gmail.com> wrote:

> Hi all,
>
> I finalized the first version of the design. This time I used a google doc
> so that it is easier for commenting and add a link the wiki page. I will
> update this to the wiki page once we come to the finalized design.
>
> https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK
> bSIGgSzXuTI5BA/edit
>
> Let me know if you have any questions. Appreciate your reviews!
>
> - Xi
>
>
>
>
>
> On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart
> <lstewart@twitter.com.invalid
> > wrote:
>
> > Interesting proposal. A couple quick notes while you continue to flesh
> this
> > out.
> >
> > a. just to be sure - does this eliminate the need to save seqno with
> > checkpoint?
> >
> > b. i.e. another way to describe this kind of improvement is "support
> > records (atomic writes) larger than 1MB", iiuc. the advantage being it
> > avoids the baggage of transactions. disadvantages include inability to do
> > cross stream transactions, and flexibility (interleaving, etc) (are there
> > others?).
> >
> > c. proxy use case is for supporting multiple writers - have you thought
> > about how this would work with multiple writers?
> >
> > Thanks!
> >
> >
> > On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo <sijieg@twitter.com.invalid>
> > wrote:
> >
> > > Sound good to me. look forward to the detailed proposal.
> > >
> > > (I don't mind the format if it makes things easier to you)
> > >
> > > Sijie
> > >
> > > On Friday, October 14, 2016, Xi Liu <xi.liu.ant@gmail.com> wrote:
> > >
> > > > Thank you, Sijie
> > > >
> > > > We have some internal discussions to sort out some details. We are
> > ready
> > > to
> > > > collaborate with the community for adding the transaction support in
> > DL.
> > > > We'd like to share more.
> > > >
> > > > I created a proposal wiki here -
> > > > https://cwiki.apache.org/confluence/display/DL/DP-1+-+
> > > > DistributedLog+Transaction+Support
> > > >
> > > > (I followed KIP format and named it as DP (DistributedLog Proposal -
> DP
> > > is
> > > > also short for Dynamic Programming). I don't know if you guys like
> this
> > > > name or not. Feel free to change it :D)
> > > >
> > > > I basically put my initial email as the content there so far. Once we
> > > > finished our final discussion, I will update with more details. At
> the
> > > same
> > > > time, any comments are welcome.
> > > >
> > > > - Xi
> > > >
> > > >
> > > >
> > > > On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <sijie@apache.org
> > > <javascript:;>>
> > > > wrote:
> > > >
> > > > > Xi,
> > > > >
> > > > > I just granted you the edit permission.
> > > > >
> > > > > - Sijie
> > > > >
> > > > > On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu.ant@gmail.com
> > > > <javascript:;>> wrote:
> > > > >
> > > > > > I still can not edit the wiki. Can any of the pmc members grant
> me
> > > the
> > > > > > permissions?
> > > > > >
> > > > > > - Xi
> > > > > >
> > > > > > On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <xi.liu.ant@gmail.com
> > > > <javascript:;>> wrote:
> > > > > >
> > > > > > > Sijie,
> > > > > > >
> > > > > > > I attempted to create a wiki page under that space. I found
> that
> > I
> > > am
> > > > > not
> > > > > > > authorized with edit permission.
> > > > > > >
> > > > > > > Can any of the committers grant me the wiki edit permission?
My
> > > > account
> > > > > > is
> > > > > > > "xi.liu.ant".
> > > > > > >
> > > > > > > - Xi
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <sijie@apache.org
> > > > <javascript:;>> wrote:
> > > > > > >
> > > > > > >> This sounds interesting ... I will take a closer look
and give
> > my
> > > > > > comments
> > > > > > >> later.
> > > > > > >>
> > > > > > >> At the same time, do you mind creating a wiki page
to put your
> > > idea
> > > > > > there?
> > > > > > >> You can add your wiki page under
> > > > > > >> https://cwiki.apache.org/confluence/display/DL/Project+
> > Proposals
> > > > > > >>
> > > > > > >> You might need to ask in the dev list to grant the
wiki edit
> > > > > permissions
> > > > > > >> to
> > > > > > >> you once you have a wiki account.
> > > > > > >>
> > > > > > >> - Sijie
> > > > > > >>
> > > > > > >>
> > > > > > >> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <xi.liu.ant@gmail.com
> > > > <javascript:;>> wrote:
> > > > > > >>
> > > > > > >> > Hello,
> > > > > > >> >
> > > > > > >> > I asked the transaction support in distributedlog
user group
> > two
> > > > > > months
> > > > > > >> > ago. I want to raise this up again, as we are
looking for
> > using
> > > > > > >> > distributedlog for building a transactional data
service. It
> > is
> > > a
> > > > > > major
> > > > > > >> > feature that is missing in distributedlog. We
have some
> ideas
> > to
> > > > add
> > > > > > >> this
> > > > > > >> > to distributedlog and want to know if they make
sense or
> not.
> > If
> > > > > they
> > > > > > >> are
> > > > > > >> > good, we'd like to contribute and develop with
the
> community.
> > > > > > >> >
> > > > > > >> > Here are the thoughts:
> > > > > > >> >
> > > > > > >> > -------------------------------------------------
> > > > > > >> >
> > > > > > >> > From our understanding, DL can provide "at-least-once"
> > delivery
> > > > > > semantic
> > > > > > >> > (if not, please correct me) but not "exactly-once"
delivery
> > > > > semantic.
> > > > > > >> That
> > > > > > >> > means that a message can be delivered one or more
times if
> the
> > > > > reader
> > > > > > >> > doesn't handle duplicates.
> > > > > > >> >
> > > > > > >> > The duplicates come from two places, one is at
writer side
> > (this
> > > > > > assumes
> > > > > > >> > using write proxy not the core library), while
the other one
> > is
> > > at
> > > > > > >> reader
> > > > > > >> > side.
> > > > > > >> >
> > > > > > >> > - writer side: if the client attempts to write
a record to
> the
> > > > write
> > > > > > >> > proxies and gets a network error (e.g timeouts)
then
> retries,
> > > the
> > > > > > >> retrying
> > > > > > >> > will potentially result in duplicates.
> > > > > > >> > - reader side:if the reader reads a message from
a stream
> and
> > > then
> > > > > > >> crashes,
> > > > > > >> > when the reader restarts it would restart from
last known
> > > position
> > > > > > >> (DLSN).
> > > > > > >> > If the reader fails after processing a record
and before
> > > recording
> > > > > the
> > > > > > >> > position, the processed record will be delivered
again.
> > > > > > >> >
> > > > > > >> > The reader problem can be properly addressed by
making use
> of
> > > the
> > > > > > >> sequence
> > > > > > >> > numbers of records and doing proper checkpointing.
For
> > example,
> > > in
> > > > > > >> > database, it can checkpoint the indexed data with
the
> sequence
> > > > > number
> > > > > > of
> > > > > > >> > records; in flink, it can checkpoint the state
with the
> > sequence
> > > > > > >> numbers.
> > > > > > >> >
> > > > > > >> > The writer problem can be addressed by implementing
an
> > > idempotent
> > > > > > >> writer.
> > > > > > >> > However, an alternative and more powerful approach
is to
> > support
> > > > > > >> > transactions.
> > > > > > >> >
> > > > > > >> > *What does transaction mean?*
> > > > > > >> >
> > > > > > >> > A transaction means a collection of records can
be written
> > > > > > >> transactionally
> > > > > > >> > within a stream or across multiple streams. They
will be
> > > consumed
> > > > by
> > > > > > the
> > > > > > >> > reader together when a transaction is committed,
or will
> never
> > > be
> > > > > > >> consumed
> > > > > > >> > by the reader when the transaction is aborted.
> > > > > > >> >
> > > > > > >> > The transaction will expose following guarantees:
> > > > > > >> >
> > > > > > >> > - The reader should not be exposed to records
written from
> > > > > uncommitted
> > > > > > >> > transactions (mandatory)
> > > > > > >> > - The reader should consume the records in the
transaction
> > > commit
> > > > > > order
> > > > > > >> > rather than the record written order (mandatory)
> > > > > > >> > - No duplicated records within a transaction (mandatory)
> > > > > > >> > - Allow interleaving transactional writes and
> > non-transactional
> > > > > writes
> > > > > > >> > (optional)
> > > > > > >> >
> > > > > > >> > *Stream Transaction & Namespace Transaction*
> > > > > > >> >
> > > > > > >> > There will be two types of transaction, one is
Stream level
> > > > > > transaction
> > > > > > >> > (local transaction), while the other one is Namespace
level
> > > > > > transaction
> > > > > > >> > (global transaction).
> > > > > > >> >
> > > > > > >> > The stream level transaction is a transactional
operation on
> > > > writing
> > > > > > >> > records to one stream; the namespace level transaction
is a
> > > > > > >> transactional
> > > > > > >> > operation on writing records to multiple streams.
> > > > > > >> >
> > > > > > >> > *Implementation Thoughts*
> > > > > > >> >
> > > > > > >> > - A transaction is consist of begin control record,
a series
> > of
> > > > data
> > > > > > >> > records and commit/abort control record.
> > > > > > >> > - The begin/commit/abort control record is written
to a
> > `commit`
> > > > log
> > > > > > >> > stream, while the data records will be written
to normal
> data
> > > log
> > > > > > >> streams.
> > > > > > >> > - The `commit` log stream will be the same log
stream for
> > > > > stream-level
> > > > > > >> > transaction,  while it will be a *system* stream
(or
> multiple
> > > > system
> > > > > > >> > streams) for namespace-level transactions.
> > > > > > >> > - The transaction code looks like as below:
> > > > > > >> >
> > > > > > >> > <code>
> > > > > > >> >
> > > > > > >> > Transaction txn = client.transaction();
> > > > > > >> > Future<DLSN> result1 = txn.write(stream-0,
record);
> > > > > > >> > Future<DLSN> result2 = txn.write(stream-1,
record);
> > > > > > >> > Future<DLSN> result3 = txn.write(stream-2,
record);
> > > > > > >> > Future<Pair<DLSN, DLSN>> result =
txn.commit();
> > > > > > >> >
> > > > > > >> > </code>
> > > > > > >> >
> > > > > > >> > if the txn is committed, all the write futures
will be
> > satisfied
> > > > > with
> > > > > > >> their
> > > > > > >> > written DLSNs. if the txn is aborted, all the
write futures
> > will
> > > > be
> > > > > > >> failed
> > > > > > >> > together. there is no partial failure state.
> > > > > > >> >
> > > > > > >> > - The actually data flow will be:
> > > > > > >> >
> > > > > > >> > 1. writer get a transaction id from the owner
of the
> `commit'
> > > log
> > > > > > stream
> > > > > > >> > 1. write the begin control record (synchronously)
with the
> > > > > transaction
> > > > > > >> id
> > > > > > >> > 2. for each write within the same txn, it will
be assigned a
> > > local
> > > > > > >> sequence
> > > > > > >> > number starting from 0. the combination of transaction
id
> and
> > > > local
> > > > > > >> > sequence number will be used later on by the readers
to
> > > > de-duplicate
> > > > > > >> > records.
> > > > > > >> > 3. the commit/abort control record will be written
based on
> > the
> > > > > > results
> > > > > > >> > from 2.
> > > > > > >> >
> > > > > > >> > - Application can supply a timeout for the transaction
when
> > > > > #begin() a
> > > > > > >> > transaction. The owner of the `commit` log stream
can abort
> > > > > > transactions
> > > > > > >> > that never be committed/aborted within their timeout.
> > > > > > >> >
> > > > > > >> > - Failures:
> > > > > > >> >
> > > > > > >> > * all the log records can be simply retried as
they will be
> > > > > > >> de-duplicated
> > > > > > >> > probably at the reader side.
> > > > > > >> >
> > > > > > >> > - Reader:
> > > > > > >> >
> > > > > > >> > * Reader can be configured to read uncommitted
records or
> > > > committed
> > > > > > >> records
> > > > > > >> > only (by default read uncommitted records)
> > > > > > >> > * If reader is configured to read committed records
only,
> the
> > > read
> > > > > > ahead
> > > > > > >> > cache will be changed to maintain one additional
pending
> > > committed
> > > > > > >> records.
> > > > > > >> > the pending committed records map is bounded and
records
> will
> > be
> > > > > > dropped
> > > > > > >> > when read ahead is moving.
> > > > > > >> > * when the reader hits a commit record, it will
rewind to
> the
> > > > begin
> > > > > > >> record
> > > > > > >> > and start reading from there. leveraging the proper
read
> ahead
> > > > cache
> > > > > > and
> > > > > > >> > pending commit records cache, it would be good
for both
> short
> > > > > > >> transactions
> > > > > > >> > and long transactions.
> > > > > > >> >
> > > > > > >> > - DLSN, SequenceId:
> > > > > > >> >
> > > > > > >> > * We will add a fourth field to DLSN. It is `local
sequence
> > > > number`
> > > > > > >> within
> > > > > > >> > a transaction session. So the new DLSN of records
in a
> > > transaction
> > > > > > will
> > > > > > >> be
> > > > > > >> > the DLSN of commit control record plus its local
sequence
> > > number.
> > > > > > >> > * The sequence id will be still the position of
the commit
> > > record
> > > > > plus
> > > > > > >> its
> > > > > > >> > local sequence number. The position will be advanced
with
> > total
> > > > > number
> > > > > > >> of
> > > > > > >> > written records on writing the commit control
record.
> > > > > > >> >
> > > > > > >> > - Transaction Group & Namespace Transaction
> > > > > > >> >
> > > > > > >> > using one single log stream for namespace transaction
can
> > cause
> > > > the
> > > > > > >> > bottleneck problem since all the begin/commit/end
control
> > > records
> > > > > will
> > > > > > >> have
> > > > > > >> > to go through one log stream.
> > > > > > >> >
> > > > > > >> > the idea of 'transaction group' is to allow partitioning
the
> > > > writers
> > > > > > >> into
> > > > > > >> > different transaction groups.
> > > > > > >> >
> > > > > > >> > clients can specify the `group-name` when starting
the
> > > > transaction.
> > > > > if
> > > > > > >> > there is no `group-name` specified, it will use
the default
> > > > `commit`
> > > > > > >> log in
> > > > > > >> > the namespace for creating transactions.
> > > > > > >> >
> > > > > > >> > -------------------------------------------------
> > > > > > >> >
> > > > > > >> > I'd like to collect feedbacks on this idea. Appreciate
any
> > > > comments
> > > > > > and
> > > > > > >> if
> > > > > > >> > anyone is also interested in this idea, we'd like
to
> > collaborate
> > > > > with
> > > > > > >> the
> > > > > > >> > community.
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > - Xi
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message