asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From abdullah alamoudi <bamou...@gmail.com>
Subject Re: MultiTransactionJobletEventListenerFactory
Date Fri, 17 Nov 2017 22:53:31 GMT
Keep in mind that one option looks up the map once per job while the other looks it up once
per record.

Cheers,
Abdullah.

> On Nov 17, 2017, at 2:23 PM, Xikui Wang <xikuiw@uci.edu> wrote:
> 
> If I understand Abdullah's proposal correctly, for option 1, you can create
> a dataset id to transaction id map in the
> MultiTransactionJobletEventListener. When committing, the commit runtime
> can take the dataset-id to ask for the transaction-id and commit the
> sub-transaction. Here we are putting up an assumption that there will not
> be a feed connected to a dataset twice in a feed job. This is a fair
> assumption in most cases.
> 
> But, this doesn't really solve the problem, right? IMHO, if we all agree
> that there is a one-to-one mapping from transaction to Hyracks job, we
> probably should use a single transaction id for the combined job, i.e.,
> option 2... As Murtadha suggested, we can now register multiple resources
> with the transaction context with the patch he merged yesterday (it took me
> some time to catch up on the transaction codebase so sorry for joining
> late.). I think this can offer us a nice and clean solution.
> 
> Best,
> Xikui
> 
> On Fri, Nov 17, 2017 at 11:58 AM, Steven Jacobs <sjaco002@ucr.edu> wrote:
> 
>> If that's true than that solution seems best to me, but we had discussed
>> this earlier and Xikui mentioned that that might not be true.
>> @Xikui?
>> Steven
>> 
>> On Fri, Nov 17, 2017 at 11:55 AM, abdullah alamoudi <bamousaa@gmail.com>
>> wrote:
>> 
>>> Right now, they can't, so datasetId can be safely used.
>>>> On Nov 17, 2017, at 11:51 AM, Steven Jacobs <sjaco002@ucr.edu> wrote:
>>>> 
>>>> For option 1, I think the dataset id is not a unique identifier.
>> Couldn't
>>>> multiple transactions in one job work on the same dataset?
>>>> 
>>>> Steven
>>>> 
>>>> On Fri, Nov 17, 2017 at 11:38 AM, abdullah alamoudi <
>> bamousaa@gmail.com>
>>>> wrote:
>>>> 
>>>>> So, there are three options to do this:
>>>>> 1. Each of these operators work on a a specific dataset. So we can
>> pass
>>>>> the datasetId to the JobEventListenerFactory when requesting the
>>>>> transaction id.
>>>>> 2. We make 1 transaction works for multiple datasets by using a map
>> from
>>>>> datasetId to primary opTracker and use it when reporting commits by
>> the
>>> log
>>>>> flusher thread.
>>>>> 3. Prevent a job from having multiple transactions. (For the record,
I
>>>>> dislike this option since the price we pay is very high IMO)
>>>>> 
>>>>> Cheers,
>>>>> Abdullah.
>>>>> 
>>>>>> On Nov 17, 2017, at 11:32 AM, Steven Jacobs <sjaco002@ucr.edu>
>> wrote:
>>>>>> 
>>>>>> Well, we've solved the problem when there is only one transaction
id
>>> per
>>>>>> job. The operators can fetch the transaction ids from the
>>>>>> JobEventListenerFactory (you can find this in master now). The issue
>>> is,
>>>>>> when we are trying to combine multiple job specs into one feed job,
>> the
>>>>>> operators at runtime don't have a memory of which "job spec" they
>>>>>> originally belonged to which could tell them which one of the
>>> transaction
>>>>>> ids that they should use.
>>>>>> 
>>>>>> Steven
>>>>>> 
>>>>>> On Fri, Nov 17, 2017 at 11:25 AM, abdullah alamoudi <
>>> bamousaa@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> I think that this works and seems like the question is how different
>>>>>>> operators in the job can get their transaction ids.
>>>>>>> 
>>>>>>> ~Abdullah.
>>>>>>> 
>>>>>>>> On Nov 17, 2017, at 11:21 AM, Steven Jacobs <sjaco002@ucr.edu>
>>> wrote:
>>>>>>>> 
>>>>>>>> From the conversation, it seems like nobody has the full
picture to
>>>>>>> propose
>>>>>>>> the design?
>>>>>>>> For deployed jobs, the idea is to use the same job specification
>> but
>>>>>>> create
>>>>>>>> a new Hyracks job and Asterix Transaction for each execution.
>>>>>>>> 
>>>>>>>> Steven
>>>>>>>> 
>>>>>>>> On Fri, Nov 17, 2017 at 11:10 AM, abdullah alamoudi <
>>>>> bamousaa@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> I can e-meet anytime (moved to Sunnyvale). We can also
look at a
>>>>>>> proposed
>>>>>>>>> design and see if it can work
>>>>>>>>> Back to my question, how were you planning to change
the
>> transaction
>>>>> id
>>>>>>> if
>>>>>>>>> we forget about the case with multiple datasets (feed
job)?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Nov 17, 2017, at 10:38 AM, Steven Jacobs <sjaco002@ucr.edu>
>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Maybe it would be good to have a meeting about this
with all
>>>>> interested
>>>>>>>>>> parties?
>>>>>>>>>> 
>>>>>>>>>> I can be on-campus at UCI on Tuesday if that would
be a good day
>> to
>>>>>>> meet.
>>>>>>>>>> 
>>>>>>>>>> Steven
>>>>>>>>>> 
>>>>>>>>>> On Fri, Nov 17, 2017 at 9:36 AM, abdullah alamoudi
<
>>>>> bamousaa@gmail.com
>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Also, was wondering how would you do the same
for a single
>> dataset
>>>>>>>>>>> (non-feed). How would you get the transaction
id and change it
>>> when
>>>>>>> you
>>>>>>>>>>> re-run?
>>>>>>>>>>> 
>>>>>>>>>>> On Nov 17, 2017 7:12 AM, "Murtadha Hubail" <hubailmor@gmail.com
>>> 
>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> For atomic transactions, the change was merged
yesterday. For
>>>>> entity
>>>>>>>>>>> level
>>>>>>>>>>>> transactions, it should be a very small change.
>>>>>>>>>>>> 
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Murtadha
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Nov 17, 2017, at 6:07 PM, abdullah
alamoudi <
>>>>> bamousaa@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I understand that is not the case right
now but what you're
>>>>> working
>>>>>>>>> on?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Abdullah.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Nov 17, 2017, at 7:04 AM, Murtadha
Hubail <
>>>>> hubailmor@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> A transaction context can register
multiple primary indexes.
>>>>>>>>>>>>>> Since each entity commit log contains
the dataset id, you can
>>>>>>>>>>> decrement
>>>>>>>>>>>> the active operations on
>>>>>>>>>>>>>> the operation tracker associated
with that dataset id.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 17/11/2017, 5:52 PM, "abdullah
alamoudi" <
>>> bamousaa@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Can you illustrate how a deadlock
can happen? I am anxious to
>>>>> know.
>>>>>>>>>>>>>> Moreover, the reason for the multiple
transaction ids in
>> feeds
>>> is
>>>>>>>>>>> not
>>>>>>>>>>>> simply because we compile them differently.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> How would a commit operator know
which dataset active
>> operation
>>>>>>>>>>>> counter to decrement if they share the same
id for example?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Nov 16, 2017, at 9:46 PM,
Xikui Wang <xikuiw@uci.edu>
>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Yes. That deadlock could happen.
Currently, we have
>> one-to-one
>>>>>>>>>>>> mappings for
>>>>>>>>>>>>>>> the jobs and transactions, except
for the feeds.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> @Abdullah, after some digging
into the code, I think
>> probably
>>> we
>>>>>>> can
>>>>>>>>>>>> use a
>>>>>>>>>>>>>>> single transaction id for the
job which feeds multiple
>>> datasets?
>>>>>>> See
>>>>>>>>>>>> if I
>>>>>>>>>>>>>>> can convince you. :)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The reason we have multiple transaction
ids in feeds is that
>>> we
>>>>>>>>>>> compile
>>>>>>>>>>>>>>> each connection job separately
and combine them into a
>> single
>>>>> feed
>>>>>>>>>>>> job. A
>>>>>>>>>>>>>>> new transaction id is created
and assigned to each
>> connection
>>>>> job,
>>>>>>>>>>>> thus for
>>>>>>>>>>>>>>> the combined job, we have to
handle the different
>> transactions
>>>>> as
>>>>>>>>>>> they
>>>>>>>>>>>>>>> are embedded in the connection
job specifications. But, what
>>> if
>>>>> we
>>>>>>>>>>>> create a
>>>>>>>>>>>>>>> single transaction id for the
combined job? That transaction
>>> id
>>>>>>> will
>>>>>>>>>>> be
>>>>>>>>>>>>>>> embedded into each connection
so they can write logs freely,
>>> but
>>>>>>> the
>>>>>>>>>>>>>>> transaction will be started and
committed only once as there
>>> is
>>>>>>> only
>>>>>>>>>>>> one
>>>>>>>>>>>>>>> feed job. In this way, we won't
need
>>>>>>> multiTransactionJobletEventLis
>>>>>>>>>>>> tener
>>>>>>>>>>>>>>> and the transaction id can be
removed from the job
>>> specification
>>>>>>>>>>>> easily as
>>>>>>>>>>>>>>> well (for Steven's change).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Xikui
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Thu, Nov 16, 2017 at 4:26
PM, Mike Carey <
>>> dtabass@gmail.com
>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I worry about deadlocks.
 The waits for graph may not
>>>>> understand
>>>>>>>>>>> that
>>>>>>>>>>>>>>>> making t1 wait will also
make t2 wait since they may share
>> a
>>>>>>> thread
>>>>>>>>>>> -
>>>>>>>>>>>>>>>> right?  Or do we have jobs
and transactions separately
>>>>>>> represented
>>>>>>>>>>>> there
>>>>>>>>>>>>>>>> now?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Nov 16, 2017 3:10
PM, "abdullah alamoudi" <
>>>>>>> bamousaa@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> We are using multiple
transactions in a single job in case
>>> of
>>>>>>> feed
>>>>>>>>>>>> and I
>>>>>>>>>>>>>>>>> think that this is the
correct way.
>>>>>>>>>>>>>>>>> Having a single job for
a feed that feeds into multiple
>>>>> datasets
>>>>>>>>>>> is a
>>>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>>> thing since job resources/feed
resources are consolidated.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Here are some points:
>>>>>>>>>>>>>>>>> - We can't use the same
transaction id to feed multiple
>>>>>>> datasets.
>>>>>>>>>>> The
>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>> other option is to have
multiple jobs each feeding a
>>> different
>>>>>>>>>>>> dataset.
>>>>>>>>>>>>>>>>> - Having multiple jobs
(in addition to the extra resources
>>>>> used,
>>>>>>>>>>>> memory
>>>>>>>>>>>>>>>>> and CPU) would then forces
us to either read data from
>>>>> external
>>>>>>>>>>>> sources
>>>>>>>>>>>>>>>>> multiple times, parse
records multiple times, etc
>>>>>>>>>>>>>>>>> or having to have a synchronization
between the different
>>> jobs
>>>>>>> and
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> feed source within asterixdb.
IMO, this is far more
>>>>> complicated
>>>>>>>>>>> than
>>>>>>>>>>>>>>>> having
>>>>>>>>>>>>>>>>> multiple transactions
within a single job and the cost far
>>>>>>>>> outweigh
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> benefits.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> P.S,
>>>>>>>>>>>>>>>>> We are also using this
for bucket connections in Couchbase
>>>>>>>>>>> Analytics.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Nov 16, 2017,
at 2:57 PM, Till Westmann <
>>> tillw@apache.org
>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> If there are a number
of issue with supporting multiple
>>>>>>>>>>> transaction
>>>>>>>>>>>> ids
>>>>>>>>>>>>>>>>>> and no clear benefits/use-cases,
I’d vote for
>>> simplification
>>>>> :)
>>>>>>>>>>>>>>>>>> Also, code that’s
not being used has a tendency to "rot"
>>> and
>>>>>>> so I
>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>> that it’s usefulness
might be limited by the time we’d
>>> find a
>>>>>>> use
>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> this functionality.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> My 2c,
>>>>>>>>>>>>>>>>>> Till
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On 16 Nov 2017,
at 13:57, Xikui Wang wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I'm separating
the connections into different jobs in
>> some
>>>>> of
>>>>>>> my
>>>>>>>>>>>>>>>>>>> experiments...
but that was intended to be used for the
>>>>>>>>>>>> experimental
>>>>>>>>>>>>>>>>>>> settings (i.e.,
not for master now)...
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I think the interesting
question here is whether we want
>>> to
>>>>>>>>> allow
>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>> Hyracks job to
carry multiple transactions. I personally
>>>>> think
>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>> be allowed as
the transaction and job are two separate
>>>>>>> concepts,
>>>>>>>>>>>> but I
>>>>>>>>>>>>>>>>>>> couldn't find
such use cases other than the feeds. Does
>>>>> anyone
>>>>>>>>>>>> have a
>>>>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>>>>> example on this?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Another question
is, if we do allow multiple
>> transactions
>>>>> in a
>>>>>>>>>>>> single
>>>>>>>>>>>>>>>>>>> Hyracks job,
how do we enable commit runtime to obtain
>> the
>>>>>>>>>>> correct
>>>>>>>>>>>> TXN
>>>>>>>>>>>>>>>>> id
>>>>>>>>>>>>>>>>>>> without having
that embedded as part of the job
>>>>> specification.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>> Xikui
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Thu, Nov 16,
2017 at 1:01 PM, abdullah alamoudi <
>>>>>>>>>>>>>>>> bamousaa@gmail.com>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I am curious
as to how feed will work without this?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> ~Abdullah.
>>>>>>>>>>>>>>>>>>>>> On Nov
16, 2017, at 12:43 PM, Steven Jacobs <
>>>>>>> sjaco002@ucr.edu
>>>>>>>>>> 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>> We currently
have MultiTransactionJobletEventLis
>>>>>>> tenerFactory,
>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>> allows
>>>>>>>>>>>>>>>>>>>>> for one
Hyracks job to run multiple Asterix
>> transactions
>>>>>>>>>>>> together.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> This
class is only used by feeds, and feeds are in
>>> process
>>>>>>> of
>>>>>>>>>>>>>>>>> changing to
>>>>>>>>>>>>>>>>>>>>> no longer
need this feature. As part of the work in
>>>>>>>>>>> pre-deploying
>>>>>>>>>>>>>>>> job
>>>>>>>>>>>>>>>>>>>>> specifications
to be used by multiple hyracks jobs,
>> I've
>>>>>>> been
>>>>>>>>>>>>>>>> working
>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>> removing
the transaction id from the job
>> specifications,
>>>>> as
>>>>>>> we
>>>>>>>>>>>> use a
>>>>>>>>>>>>>>>>> new
>>>>>>>>>>>>>>>>>>>>> transaction
for each invocation of a deployed job.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> There
is currently no clear way to remove the
>>> transaction
>>>>> id
>>>>>>>>>>> from
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> job
>>>>>>>>>>>>>>>>>>>>> spec
and keep the option for
>>>>> MultiTransactionJobletEventLis
>>>>>>>>>>>>>>>>> tenerFactory.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> The question
for the group is, do we see a need to
>>>>> maintain
>>>>>>>>>>> this
>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>> will
no longer be used by any current code? Or, an
>> other
>>>>>>>>> words,
>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>> strong
possibility that in the future we will want
>>>>> multiple
>>>>>>>>>>>>>>>>> transactions
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> share
a single Hyracks job, meaning that it is worth
>>>>>>> figuring
>>>>>>>>>>> out
>>>>>>>>>>>>>>>> how
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> maintain
this class?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Steven
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>> 


Mime
View raw message