asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From abdullah alamoudi <bamou...@gmail.com>
Subject Re: MultiTransactionJobletEventListenerFactory
Date Fri, 17 Nov 2017 19:38:30 GMT
So, there are three options to do this:
1. Each of these operators work on a a specific dataset. So we can pass the datasetId to the
JobEventListenerFactory when requesting the transaction id.
2. We make 1 transaction works for multiple datasets by using a map from datasetId to primary
opTracker and use it when reporting commits by the log flusher thread.
3. Prevent a job from having multiple transactions. (For the record, I dislike this option
since the price we pay is very high IMO)

Cheers,
Abdullah.

> On Nov 17, 2017, at 11:32 AM, Steven Jacobs <sjaco002@ucr.edu> wrote:
> 
> Well, we've solved the problem when there is only one transaction id per
> job. The operators can fetch the transaction ids from the
> JobEventListenerFactory (you can find this in master now). The issue is,
> when we are trying to combine multiple job specs into one feed job, the
> operators at runtime don't have a memory of which "job spec" they
> originally belonged to which could tell them which one of the transaction
> ids that they should use.
> 
> Steven
> 
> On Fri, Nov 17, 2017 at 11:25 AM, abdullah alamoudi <bamousaa@gmail.com>
> wrote:
> 
>> 
>> I think that this works and seems like the question is how different
>> operators in the job can get their transaction ids.
>> 
>> ~Abdullah.
>> 
>>> On Nov 17, 2017, at 11:21 AM, Steven Jacobs <sjaco002@ucr.edu> wrote:
>>> 
>>> From the conversation, it seems like nobody has the full picture to
>> propose
>>> the design?
>>> For deployed jobs, the idea is to use the same job specification but
>> create
>>> a new Hyracks job and Asterix Transaction for each execution.
>>> 
>>> Steven
>>> 
>>> On Fri, Nov 17, 2017 at 11:10 AM, abdullah alamoudi <bamousaa@gmail.com>
>>> wrote:
>>> 
>>>> I can e-meet anytime (moved to Sunnyvale). We can also look at a
>> proposed
>>>> design and see if it can work
>>>> Back to my question, how were you planning to change the transaction id
>> if
>>>> we forget about the case with multiple datasets (feed job)?
>>>> 
>>>> 
>>>>> On Nov 17, 2017, at 10:38 AM, Steven Jacobs <sjaco002@ucr.edu>
wrote:
>>>>> 
>>>>> Maybe it would be good to have a meeting about this with all interested
>>>>> parties?
>>>>> 
>>>>> I can be on-campus at UCI on Tuesday if that would be a good day to
>> meet.
>>>>> 
>>>>> Steven
>>>>> 
>>>>> On Fri, Nov 17, 2017 at 9:36 AM, abdullah alamoudi <bamousaa@gmail.com
>>> 
>>>>> wrote:
>>>>> 
>>>>>> Also, was wondering how would you do the same for a single dataset
>>>>>> (non-feed). How would you get the transaction id and change it when
>> you
>>>>>> re-run?
>>>>>> 
>>>>>> On Nov 17, 2017 7:12 AM, "Murtadha Hubail" <hubailmor@gmail.com>
>> wrote:
>>>>>> 
>>>>>>> For atomic transactions, the change was merged yesterday. For
entity
>>>>>> level
>>>>>>> transactions, it should be a very small change.
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Murtadha
>>>>>>> 
>>>>>>>> On Nov 17, 2017, at 6:07 PM, abdullah alamoudi <bamousaa@gmail.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> I understand that is not the case right now but what you're
working
>>>> on?
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Abdullah.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Nov 17, 2017, at 7:04 AM, Murtadha Hubail <hubailmor@gmail.com>
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> A transaction context can register multiple primary indexes.
>>>>>>>>> Since each entity commit log contains the dataset id,
you can
>>>>>> decrement
>>>>>>> the active operations on
>>>>>>>>> the operation tracker associated with that dataset id.
>>>>>>>>> 
>>>>>>>>> On 17/11/2017, 5:52 PM, "abdullah alamoudi" <bamousaa@gmail.com>
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Can you illustrate how a deadlock can happen? I am anxious
to know.
>>>>>>>>> Moreover, the reason for the multiple transaction ids
in feeds is
>>>>>> not
>>>>>>> simply because we compile them differently.
>>>>>>>>> 
>>>>>>>>> How would a commit operator know which dataset active
operation
>>>>>>> counter to decrement if they share the same id for example?
>>>>>>>>> 
>>>>>>>>>> On Nov 16, 2017, at 9:46 PM, Xikui Wang <xikuiw@uci.edu>
wrote:
>>>>>>>>>> 
>>>>>>>>>> Yes. That deadlock could happen. Currently, we have
one-to-one
>>>>>>> mappings for
>>>>>>>>>> the jobs and transactions, except for the feeds.
>>>>>>>>>> 
>>>>>>>>>> @Abdullah, after some digging into the code, I think
probably we
>> can
>>>>>>> use a
>>>>>>>>>> single transaction id for the job which feeds multiple
datasets?
>> See
>>>>>>> if I
>>>>>>>>>> can convince you. :)
>>>>>>>>>> 
>>>>>>>>>> The reason we have multiple transaction ids in feeds
is that we
>>>>>> compile
>>>>>>>>>> each connection job separately and combine them into
a single feed
>>>>>>> job. A
>>>>>>>>>> new transaction id is created and assigned to each
connection job,
>>>>>>> thus for
>>>>>>>>>> the combined job, we have to handle the different
transactions as
>>>>>> they
>>>>>>>>>> are embedded in the connection job specifications.
But, what if we
>>>>>>> create a
>>>>>>>>>> single transaction id for the combined job? That
transaction id
>> will
>>>>>> be
>>>>>>>>>> embedded into each connection so they can write logs
freely, but
>> the
>>>>>>>>>> transaction will be started and committed only once
as there is
>> only
>>>>>>> one
>>>>>>>>>> feed job. In this way, we won't need
>> multiTransactionJobletEventLis
>>>>>>> tener
>>>>>>>>>> and the transaction id can be removed from the job
specification
>>>>>>> easily as
>>>>>>>>>> well (for Steven's change).
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Xikui
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Thu, Nov 16, 2017 at 4:26 PM, Mike Carey <dtabass@gmail.com>
>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> I worry about deadlocks.  The waits for graph
may not understand
>>>>>> that
>>>>>>>>>>> making t1 wait will also make t2 wait since they
may share a
>> thread
>>>>>> -
>>>>>>>>>>> right?  Or do we have jobs and transactions separately
>> represented
>>>>>>> there
>>>>>>>>>>> now?
>>>>>>>>>>> 
>>>>>>>>>>>> On Nov 16, 2017 3:10 PM, "abdullah alamoudi"
<
>> bamousaa@gmail.com>
>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> We are using multiple transactions in a single
job in case of
>> feed
>>>>>>> and I
>>>>>>>>>>>> think that this is the correct way.
>>>>>>>>>>>> Having a single job for a feed that feeds
into multiple datasets
>>>>>> is a
>>>>>>>>>>> good
>>>>>>>>>>>> thing since job resources/feed resources
are consolidated.
>>>>>>>>>>>> 
>>>>>>>>>>>> Here are some points:
>>>>>>>>>>>> - We can't use the same transaction id to
feed multiple
>> datasets.
>>>>>> The
>>>>>>>>>>> only
>>>>>>>>>>>> other option is to have multiple jobs each
feeding a different
>>>>>>> dataset.
>>>>>>>>>>>> - Having multiple jobs (in addition to the
extra resources used,
>>>>>>> memory
>>>>>>>>>>>> and CPU) would then forces us to either read
data from external
>>>>>>> sources
>>>>>>>>>>>> multiple times, parse records multiple times,
etc
>>>>>>>>>>>> or having to have a synchronization between
the different jobs
>> and
>>>>>>> the
>>>>>>>>>>>> feed source within asterixdb. IMO, this is
far more complicated
>>>>>> than
>>>>>>>>>>> having
>>>>>>>>>>>> multiple transactions within a single job
and the cost far
>>>> outweigh
>>>>>>> the
>>>>>>>>>>>> benefits.
>>>>>>>>>>>> 
>>>>>>>>>>>> P.S,
>>>>>>>>>>>> We are also using this for bucket connections
in Couchbase
>>>>>> Analytics.
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Nov 16, 2017, at 2:57 PM, Till Westmann
<tillw@apache.org>
>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If there are a number of issue with supporting
multiple
>>>>>> transaction
>>>>>>> ids
>>>>>>>>>>>>> and no clear benefits/use-cases, I’d
vote for simplification :)
>>>>>>>>>>>>> Also, code that’s not being used has
a tendency to "rot" and
>> so I
>>>>>>> think
>>>>>>>>>>>>> that it’s usefulness might be limited
by the time we’d find a
>> use
>>>>>>> for
>>>>>>>>>>>>> this functionality.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> My 2c,
>>>>>>>>>>>>> Till
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 16 Nov 2017, at 13:57, Xikui Wang
wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I'm separating the connections into
different jobs in some of
>> my
>>>>>>>>>>>>>> experiments... but that was intended
to be used for the
>>>>>>> experimental
>>>>>>>>>>>>>> settings (i.e., not for master now)...
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I think the interesting question
here is whether we want to
>>>> allow
>>>>>>> one
>>>>>>>>>>>>>> Hyracks job to carry multiple transactions.
I personally think
>>>>>> that
>>>>>>>>>>>> should
>>>>>>>>>>>>>> be allowed as the transaction and
job are two separate
>> concepts,
>>>>>>> but I
>>>>>>>>>>>>>> couldn't find such use cases other
than the feeds. Does anyone
>>>>>>> have a
>>>>>>>>>>>> good
>>>>>>>>>>>>>> example on this?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Another question is, if we do allow
multiple transactions in a
>>>>>>> single
>>>>>>>>>>>>>> Hyracks job, how do we enable commit
runtime to obtain the
>>>>>> correct
>>>>>>> TXN
>>>>>>>>>>>> id
>>>>>>>>>>>>>> without having that embedded as part
of the job specification.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Xikui
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Thu, Nov 16, 2017 at 1:01 PM,
abdullah alamoudi <
>>>>>>>>>>> bamousaa@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I am curious as to how feed will
work without this?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> ~Abdullah.
>>>>>>>>>>>>>>>> On Nov 16, 2017, at 12:43
PM, Steven Jacobs <
>> sjaco002@ucr.edu
>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>> We currently have MultiTransactionJobletEventLis
>> tenerFactory,
>>>>>>> which
>>>>>>>>>>>>>>> allows
>>>>>>>>>>>>>>>> for one Hyracks job to run
multiple Asterix transactions
>>>>>>> together.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This class is only used by
feeds, and feeds are in process
>> of
>>>>>>>>>>>> changing to
>>>>>>>>>>>>>>>> no longer need this feature.
As part of the work in
>>>>>> pre-deploying
>>>>>>>>>>> job
>>>>>>>>>>>>>>>> specifications to be used
by multiple hyracks jobs, I've
>> been
>>>>>>>>>>> working
>>>>>>>>>>>> on
>>>>>>>>>>>>>>>> removing the transaction
id from the job specifications, as
>> we
>>>>>>> use a
>>>>>>>>>>>> new
>>>>>>>>>>>>>>>> transaction for each invocation
of a deployed job.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> There is currently no clear
way to remove the transaction id
>>>>>> from
>>>>>>>>>>> the
>>>>>>>>>>>> job
>>>>>>>>>>>>>>>> spec and keep the option
for MultiTransactionJobletEventLis
>>>>>>>>>>>> tenerFactory.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The question for the group
is, do we see a need to maintain
>>>>>> this
>>>>>>>>>>> class
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> will no longer be used by
any current code? Or, an other
>>>> words,
>>>>>>> is
>>>>>>>>>>>> there
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> strong possibility that in
the future we will want multiple
>>>>>>>>>>>> transactions
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> share a single Hyracks job,
meaning that it is worth
>> figuring
>>>>>> out
>>>>>>>>>>> how
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> maintain this class?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Steven
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>> 


Mime
View raw message