distributedlog-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sijie Guo <si...@apache.org>
Subject Re: [Discuss] Transaction Support
Date Wed, 04 Jan 2017 07:18:19 GMT
Sorry for late response. I think Leigh and you already had some very
valuable discussions in the doc. I will try to add some of my questions to
the discussion.

Beside that, I had a discussion with Leigh today about this. first of all,
I think it is very good to add transaction support in distributedlog. It is
one of the primitives that would help building distributed service. But we
have a concern about making this system become complicated and introduce
operational overhead when it runs in the large scale system on production.
There are two major suggestions that I have for this feature -

Build the 'minimum' logic in core - I think the minimum logic that need to
be added to the core is -  the special control records (begin, commit and
abort) and make the reader be able to detect those special control records
and know what do they mean and how to interrupt with them. Since they are
special control records, there is not overhead to other readers that
doesn't require this feature.

Build the transaction coordinator as a separated proxy service  - I think
the major concern that we have is putting more complexities into the 'write
proxy' service. We architected distributedlog in a more microservice-like
way - we have the core as the stream store, the proxy for serving write and
read traffic. It would be good that the transaction feature can be done in
a similar way. So the architecture would be like this -

*[ write service ] [ read service ] [ transaction coordinator ]*
*[ stream store
            ]*

if people doesn't need the transaction feature, they can turn if off
completely without any operational overhead.

Beside that, I have one general question - What is the major goal for this
feature? Are you targeting on building a general XA transaction coordinator
or just for supporting things like `copy-modify-write' style workflow?


Thanks,
Sijie





On Wed, Dec 28, 2016 at 1:12 PM, Xi Liu <xi.liu.ant@gmail.com> wrote:

> Ping?
>
> On Mon, Dec 19, 2016 at 8:28 AM, Xi Liu <xi.liu.ant@gmail.com> wrote:
>
> > Sijie,
> >
> > No. I thought it might be easier for people to comment on a google doc to
> > gather the initial feedback. I will put the content back to wiki page
> once
> > addressing the comments. Does that sound good to you?
> >
> > And thank you in advance.
> >
> > - Xi
> >
> >
> >
> > On Sun, Dec 18, 2016 at 8:48 AM, Sijie Guo <sijie@apache.org> wrote:
> >
> >> Hi Xi,
> >>
> >> sorry for late response. I will review it soon.
> >>
> >> regarding this, a separate question "are we going to use google doc
> >> instead
> >> of email thread for any discussion"? I am a bit worried that the
> >> discussion
> >> will become lost after moving to google doc. No idea on how other apache
> >> projects are doing.
> >>
> >> - Sijie
> >>
> >> On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi.liu.ant@gmail.com> wrote:
> >>
> >> > Hi all,
> >> >
> >> > I finalized the first version of the design. This time I used a google
> >> doc
> >> > so that it is easier for commenting and add a link the wiki page. I
> will
> >> > update this to the wiki page once we come to the finalized design.
> >> >
> >> > https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK
> >> > bSIGgSzXuTI5BA/edit
> >> >
> >> > Let me know if you have any questions. Appreciate your reviews!
> >> >
> >> > - Xi
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart
> >> > <lstewart@twitter.com.invalid
> >> > > wrote:
> >> >
> >> > > Interesting proposal. A couple quick notes while you continue to
> flesh
> >> > this
> >> > > out.
> >> > >
> >> > > a. just to be sure - does this eliminate the need to save seqno with
> >> > > checkpoint?
> >> > >
> >> > > b. i.e. another way to describe this kind of improvement is "support
> >> > > records (atomic writes) larger than 1MB", iiuc. the advantage being
> it
> >> > > avoids the baggage of transactions. disadvantages include inability
> >> to do
> >> > > cross stream transactions, and flexibility (interleaving, etc) (are
> >> there
> >> > > others?).
> >> > >
> >> > > c. proxy use case is for supporting multiple writers - have you
> >> thought
> >> > > about how this would work with multiple writers?
> >> > >
> >> > > Thanks!
> >> > >
> >> > >
> >> > > On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo
> <sijieg@twitter.com.invalid
> >> >
> >> > > wrote:
> >> > >
> >> > > > Sound good to me. look forward to the detailed proposal.
> >> > > >
> >> > > > (I don't mind the format if it makes things easier to you)
> >> > > >
> >> > > > Sijie
> >> > > >
> >> > > > On Friday, October 14, 2016, Xi Liu <xi.liu.ant@gmail.com>
wrote:
> >> > > >
> >> > > > > Thank you, Sijie
> >> > > > >
> >> > > > > We have some internal discussions to sort out some details.
We
> are
> >> > > ready
> >> > > > to
> >> > > > > collaborate with the community for adding the transaction
> support
> >> in
> >> > > DL.
> >> > > > > We'd like to share more.
> >> > > > >
> >> > > > > I created a proposal wiki here -
> >> > > > > https://cwiki.apache.org/confluence/display/DL/DP-1+-+
> >> > > > > DistributedLog+Transaction+Support
> >> > > > >
> >> > > > > (I followed KIP format and named it as DP (DistributedLog
> >> Proposal -
> >> > DP
> >> > > > is
> >> > > > > also short for Dynamic Programming). I don't know if you
guys
> like
> >> > this
> >> > > > > name or not. Feel free to change it :D)
> >> > > > >
> >> > > > > I basically put my initial email as the content there so
far.
> >> Once we
> >> > > > > finished our final discussion, I will update with more details.
> At
> >> > the
> >> > > > same
> >> > > > > time, any comments are welcome.
> >> > > > >
> >> > > > > - Xi
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <sijie@apache.org
> >> > > > <javascript:;>>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Xi,
> >> > > > > >
> >> > > > > > I just granted you the edit permission.
> >> > > > > >
> >> > > > > > - Sijie
> >> > > > > >
> >> > > > > > On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu.ant@gmail.com
> >> > > > > <javascript:;>> wrote:
> >> > > > > >
> >> > > > > > > I still can not edit the wiki. Can any of the
pmc members
> >> grant
> >> > me
> >> > > > the
> >> > > > > > > permissions?
> >> > > > > > >
> >> > > > > > > - Xi
> >> > > > > > >
> >> > > > > > > On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <
> >> xi.liu.ant@gmail.com
> >> > > > > <javascript:;>> wrote:
> >> > > > > > >
> >> > > > > > > > Sijie,
> >> > > > > > > >
> >> > > > > > > > I attempted to create a wiki page under that
space. I
> found
> >> > that
> >> > > I
> >> > > > am
> >> > > > > > not
> >> > > > > > > > authorized with edit permission.
> >> > > > > > > >
> >> > > > > > > > Can any of the committers grant me the wiki
edit
> >> permission? My
> >> > > > > account
> >> > > > > > > is
> >> > > > > > > > "xi.liu.ant".
> >> > > > > > > >
> >> > > > > > > > - Xi
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo
<
> >> sijie@apache.org
> >> > > > > <javascript:;>> wrote:
> >> > > > > > > >
> >> > > > > > > >> This sounds interesting ... I will take
a closer look and
> >> give
> >> > > my
> >> > > > > > > comments
> >> > > > > > > >> later.
> >> > > > > > > >>
> >> > > > > > > >> At the same time, do you mind creating
a wiki page to put
> >> your
> >> > > > idea
> >> > > > > > > there?
> >> > > > > > > >> You can add your wiki page under
> >> > > > > > > >> https://cwiki.apache.org/confluence/display/DL/Project+
> >> > > Proposals
> >> > > > > > > >>
> >> > > > > > > >> You might need to ask in the dev list
to grant the wiki
> >> edit
> >> > > > > > permissions
> >> > > > > > > >> to
> >> > > > > > > >> you once you have a wiki account.
> >> > > > > > > >>
> >> > > > > > > >> - Sijie
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu
<
> >> xi.liu.ant@gmail.com
> >> > > > > <javascript:;>> wrote:
> >> > > > > > > >>
> >> > > > > > > >> > Hello,
> >> > > > > > > >> >
> >> > > > > > > >> > I asked the transaction support
in distributedlog user
> >> group
> >> > > two
> >> > > > > > > months
> >> > > > > > > >> > ago. I want to raise this up again,
as we are looking
> for
> >> > > using
> >> > > > > > > >> > distributedlog for building a transactional
data
> >> service. It
> >> > > is
> >> > > > a
> >> > > > > > > major
> >> > > > > > > >> > feature that is missing in distributedlog.
We have some
> >> > ideas
> >> > > to
> >> > > > > add
> >> > > > > > > >> this
> >> > > > > > > >> > to distributedlog and want to know
if they make sense
> or
> >> > not.
> >> > > If
> >> > > > > > they
> >> > > > > > > >> are
> >> > > > > > > >> > good, we'd like to contribute and
develop with the
> >> > community.
> >> > > > > > > >> >
> >> > > > > > > >> > Here are the thoughts:
> >> > > > > > > >> >
> >> > > > > > > >> > -------------------------------------------------
> >> > > > > > > >> >
> >> > > > > > > >> > From our understanding, DL can provide
"at-least-once"
> >> > > delivery
> >> > > > > > > semantic
> >> > > > > > > >> > (if not, please correct me) but
not "exactly-once"
> >> delivery
> >> > > > > > semantic.
> >> > > > > > > >> That
> >> > > > > > > >> > means that a message can be delivered
one or more times
> >> if
> >> > the
> >> > > > > > reader
> >> > > > > > > >> > doesn't handle duplicates.
> >> > > > > > > >> >
> >> > > > > > > >> > The duplicates come from two places,
one is at writer
> >> side
> >> > > (this
> >> > > > > > > assumes
> >> > > > > > > >> > using write proxy not the core library),
while the
> other
> >> one
> >> > > is
> >> > > > at
> >> > > > > > > >> reader
> >> > > > > > > >> > side.
> >> > > > > > > >> >
> >> > > > > > > >> > - writer side: if the client attempts
to write a record
> >> to
> >> > the
> >> > > > > write
> >> > > > > > > >> > proxies and gets a network error
(e.g timeouts) then
> >> > retries,
> >> > > > the
> >> > > > > > > >> retrying
> >> > > > > > > >> > will potentially result in duplicates.
> >> > > > > > > >> > - reader side:if the reader reads
a message from a
> stream
> >> > and
> >> > > > then
> >> > > > > > > >> crashes,
> >> > > > > > > >> > when the reader restarts it would
restart from last
> known
> >> > > > position
> >> > > > > > > >> (DLSN).
> >> > > > > > > >> > If the reader fails after processing
a record and
> before
> >> > > > recording
> >> > > > > > the
> >> > > > > > > >> > position, the processed record will
be delivered again.
> >> > > > > > > >> >
> >> > > > > > > >> > The reader problem can be properly
addressed by making
> >> use
> >> > of
> >> > > > the
> >> > > > > > > >> sequence
> >> > > > > > > >> > numbers of records and doing proper
checkpointing. For
> >> > > example,
> >> > > > in
> >> > > > > > > >> > database, it can checkpoint the
indexed data with the
> >> > sequence
> >> > > > > > number
> >> > > > > > > of
> >> > > > > > > >> > records; in flink, it can checkpoint
the state with the
> >> > > sequence
> >> > > > > > > >> numbers.
> >> > > > > > > >> >
> >> > > > > > > >> > The writer problem can be addressed
by implementing an
> >> > > > idempotent
> >> > > > > > > >> writer.
> >> > > > > > > >> > However, an alternative and more
powerful approach is
> to
> >> > > support
> >> > > > > > > >> > transactions.
> >> > > > > > > >> >
> >> > > > > > > >> > *What does transaction mean?*
> >> > > > > > > >> >
> >> > > > > > > >> > A transaction means a collection
of records can be
> >> written
> >> > > > > > > >> transactionally
> >> > > > > > > >> > within a stream or across multiple
streams. They will
> be
> >> > > > consumed
> >> > > > > by
> >> > > > > > > the
> >> > > > > > > >> > reader together when a transaction
is committed, or
> will
> >> > never
> >> > > > be
> >> > > > > > > >> consumed
> >> > > > > > > >> > by the reader when the transaction
is aborted.
> >> > > > > > > >> >
> >> > > > > > > >> > The transaction will expose following
guarantees:
> >> > > > > > > >> >
> >> > > > > > > >> > - The reader should not be exposed
to records written
> >> from
> >> > > > > > uncommitted
> >> > > > > > > >> > transactions (mandatory)
> >> > > > > > > >> > - The reader should consume the
records in the
> >> transaction
> >> > > > commit
> >> > > > > > > order
> >> > > > > > > >> > rather than the record written order
(mandatory)
> >> > > > > > > >> > - No duplicated records within a
transaction
> (mandatory)
> >> > > > > > > >> > - Allow interleaving transactional
writes and
> >> > > non-transactional
> >> > > > > > writes
> >> > > > > > > >> > (optional)
> >> > > > > > > >> >
> >> > > > > > > >> > *Stream Transaction & Namespace
Transaction*
> >> > > > > > > >> >
> >> > > > > > > >> > There will be two types of transaction,
one is Stream
> >> level
> >> > > > > > > transaction
> >> > > > > > > >> > (local transaction), while the other
one is Namespace
> >> level
> >> > > > > > > transaction
> >> > > > > > > >> > (global transaction).
> >> > > > > > > >> >
> >> > > > > > > >> > The stream level transaction is
a transactional
> >> operation on
> >> > > > > writing
> >> > > > > > > >> > records to one stream; the namespace
level transaction
> >> is a
> >> > > > > > > >> transactional
> >> > > > > > > >> > operation on writing records to
multiple streams.
> >> > > > > > > >> >
> >> > > > > > > >> > *Implementation Thoughts*
> >> > > > > > > >> >
> >> > > > > > > >> > - A transaction is consist of begin
control record, a
> >> series
> >> > > of
> >> > > > > data
> >> > > > > > > >> > records and commit/abort control
record.
> >> > > > > > > >> > - The begin/commit/abort control
record is written to a
> >> > > `commit`
> >> > > > > log
> >> > > > > > > >> > stream, while the data records will
be written to
> normal
> >> > data
> >> > > > log
> >> > > > > > > >> streams.
> >> > > > > > > >> > - The `commit` log stream will be
the same log stream
> for
> >> > > > > > stream-level
> >> > > > > > > >> > transaction,  while it will be a
*system* stream (or
> >> > multiple
> >> > > > > system
> >> > > > > > > >> > streams) for namespace-level transactions.
> >> > > > > > > >> > - The transaction code looks like
as below:
> >> > > > > > > >> >
> >> > > > > > > >> > <code>
> >> > > > > > > >> >
> >> > > > > > > >> > Transaction txn = client.transaction();
> >> > > > > > > >> > Future<DLSN> result1 = txn.write(stream-0,
record);
> >> > > > > > > >> > Future<DLSN> result2 = txn.write(stream-1,
record);
> >> > > > > > > >> > Future<DLSN> result3 = txn.write(stream-2,
record);
> >> > > > > > > >> > Future<Pair<DLSN, DLSN>>
result = txn.commit();
> >> > > > > > > >> >
> >> > > > > > > >> > </code>
> >> > > > > > > >> >
> >> > > > > > > >> > if the txn is committed, all the
write futures will be
> >> > > satisfied
> >> > > > > > with
> >> > > > > > > >> their
> >> > > > > > > >> > written DLSNs. if the txn is aborted,
all the write
> >> futures
> >> > > will
> >> > > > > be
> >> > > > > > > >> failed
> >> > > > > > > >> > together. there is no partial failure
state.
> >> > > > > > > >> >
> >> > > > > > > >> > - The actually data flow will be:
> >> > > > > > > >> >
> >> > > > > > > >> > 1. writer get a transaction id from
the owner of the
> >> > `commit'
> >> > > > log
> >> > > > > > > stream
> >> > > > > > > >> > 1. write the begin control record
(synchronously) with
> >> the
> >> > > > > > transaction
> >> > > > > > > >> id
> >> > > > > > > >> > 2. for each write within the same
txn, it will be
> >> assigned a
> >> > > > local
> >> > > > > > > >> sequence
> >> > > > > > > >> > number starting from 0. the combination
of transaction
> id
> >> > and
> >> > > > > local
> >> > > > > > > >> > sequence number will be used later
on by the readers to
> >> > > > > de-duplicate
> >> > > > > > > >> > records.
> >> > > > > > > >> > 3. the commit/abort control record
will be written
> based
> >> on
> >> > > the
> >> > > > > > > results
> >> > > > > > > >> > from 2.
> >> > > > > > > >> >
> >> > > > > > > >> > - Application can supply a timeout
for the transaction
> >> when
> >> > > > > > #begin() a
> >> > > > > > > >> > transaction. The owner of the `commit`
log stream can
> >> abort
> >> > > > > > > transactions
> >> > > > > > > >> > that never be committed/aborted
within their timeout.
> >> > > > > > > >> >
> >> > > > > > > >> > - Failures:
> >> > > > > > > >> >
> >> > > > > > > >> > * all the log records can be simply
retried as they
> will
> >> be
> >> > > > > > > >> de-duplicated
> >> > > > > > > >> > probably at the reader side.
> >> > > > > > > >> >
> >> > > > > > > >> > - Reader:
> >> > > > > > > >> >
> >> > > > > > > >> > * Reader can be configured to read
uncommitted records
> or
> >> > > > > committed
> >> > > > > > > >> records
> >> > > > > > > >> > only (by default read uncommitted
records)
> >> > > > > > > >> > * If reader is configured to read
committed records
> only,
> >> > the
> >> > > > read
> >> > > > > > > ahead
> >> > > > > > > >> > cache will be changed to maintain
one additional
> pending
> >> > > > committed
> >> > > > > > > >> records.
> >> > > > > > > >> > the pending committed records map
is bounded and
> records
> >> > will
> >> > > be
> >> > > > > > > dropped
> >> > > > > > > >> > when read ahead is moving.
> >> > > > > > > >> > * when the reader hits a commit
record, it will rewind
> to
> >> > the
> >> > > > > begin
> >> > > > > > > >> record
> >> > > > > > > >> > and start reading from there. leveraging
the proper
> read
> >> > ahead
> >> > > > > cache
> >> > > > > > > and
> >> > > > > > > >> > pending commit records cache, it
would be good for both
> >> > short
> >> > > > > > > >> transactions
> >> > > > > > > >> > and long transactions.
> >> > > > > > > >> >
> >> > > > > > > >> > - DLSN, SequenceId:
> >> > > > > > > >> >
> >> > > > > > > >> > * We will add a fourth field to
DLSN. It is `local
> >> sequence
> >> > > > > number`
> >> > > > > > > >> within
> >> > > > > > > >> > a transaction session. So the new
DLSN of records in a
> >> > > > transaction
> >> > > > > > > will
> >> > > > > > > >> be
> >> > > > > > > >> > the DLSN of commit control record
plus its local
> sequence
> >> > > > number.
> >> > > > > > > >> > * The sequence id will be still
the position of the
> >> commit
> >> > > > record
> >> > > > > > plus
> >> > > > > > > >> its
> >> > > > > > > >> > local sequence number. The position
will be advanced
> with
> >> > > total
> >> > > > > > number
> >> > > > > > > >> of
> >> > > > > > > >> > written records on writing the commit
control record.
> >> > > > > > > >> >
> >> > > > > > > >> > - Transaction Group & Namespace
Transaction
> >> > > > > > > >> >
> >> > > > > > > >> > using one single log stream for
namespace transaction
> can
> >> > > cause
> >> > > > > the
> >> > > > > > > >> > bottleneck problem since all the
begin/commit/end
> control
> >> > > > records
> >> > > > > > will
> >> > > > > > > >> have
> >> > > > > > > >> > to go through one log stream.
> >> > > > > > > >> >
> >> > > > > > > >> > the idea of 'transaction group'
is to allow
> partitioning
> >> the
> >> > > > > writers
> >> > > > > > > >> into
> >> > > > > > > >> > different transaction groups.
> >> > > > > > > >> >
> >> > > > > > > >> > clients can specify the `group-name`
when starting the
> >> > > > > transaction.
> >> > > > > > if
> >> > > > > > > >> > there is no `group-name` specified,
it will use the
> >> default
> >> > > > > `commit`
> >> > > > > > > >> log in
> >> > > > > > > >> > the namespace for creating transactions.
> >> > > > > > > >> >
> >> > > > > > > >> > -------------------------------------------------
> >> > > > > > > >> >
> >> > > > > > > >> > I'd like to collect feedbacks on
this idea. Appreciate
> >> any
> >> > > > > comments
> >> > > > > > > and
> >> > > > > > > >> if
> >> > > > > > > >> > anyone is also interested in this
idea, we'd like to
> >> > > collaborate
> >> > > > > > with
> >> > > > > > > >> the
> >> > > > > > > >> > community.
> >> > > > > > > >> >
> >> > > > > > > >> >
> >> > > > > > > >> > - Xi
> >> > > > > > > >> >
> >> > > > > > > >>
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message