asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raman Grover <ramangrove...@gmail.com>
Subject Re: [jira] [Updated] (ASTERIXDB-1085) Sporadic failures in Feed related tests
Date Thu, 01 Oct 2015 08:17:27 GMT
Hi

The sequence of steps in constructing the flow of records along a feed, as
interpreted by Abdullah is exactly right.
Answering to Abdullah's question first... there are  good reasons for not
having a single job but have a separate intake and collect job. These are
described next.

a) *Isolation of Failure*:
A single job is vulnerable to any failure that occurs along any operator.
The goal is to build a cascade network of feeds such that
multiple feeds receive records from a shared channel. The impact of a
failure of an operator (hosted machine dies) need to be restricted to the
disruption of flow along the ingestion pipeline that involved the operator
and not affect in any way the flow of records along any other pipelines
that are joined in the common cascade network.

b) *Fault-Tolerant retrieval from external source:* Let us assume there
exists a single pipeline and not a cascade network. So reason (a) is not
applicable anymore. However a downstream failure would result in the lone
Hyracks job to be terminated thus abruptly shutting down the flow of
records into AsterixDB. A disruption may work for pull-based ingestion
wherein on resurrection, the adaptor instance(s) may reconnect and may
travel back in time to retrieve missed records.  However, it is common to
have push-based sources (e.g. twitter) that want to push data independent
of any failures occurring in the recipient system. Any disruption in
receiving records means loss of data as one may never be able to catch up
due to sheer rate of arrival of data. A two-part ingestion pipeline
provides increased robustness to failures in the tail (compute and store
stage) as intake stage continues to receive records and keep it around
until the tail is reconstructed. The accumulated records in the failure
window (2-4 seconds) are then sent downstream.

*Synchronization between the intake job and the collect job *
This part is absolutely critical. The way synchronization is done is
described next. Subsequent to launching the feed intake job, the
FeedLifecycleListener (implements Hyracks jobEventListener) is notified of
the job being scheduled and started. The locations of the involved
operators are also tracked.

Launch of the collect job is delayed until the above notification and
associated information has been received. It is critical to know where the
intake operators have been scheduled as the tail must know where the head
is so that it can be attached. This synchronization is also described in
more detail in my thesis. I think there is a bug in this synchronization
that is causing sporadic errors and which needs to be revisited.

Regards,
Raman




On Thu, Oct 1, 2015 at 4:02 AM, Mike Carey <dtabass@gmail.com> wrote:

> Approach number two seems right, ie, synchronize the steps so that the
> input is ready first....!
> On Sep 30, 2015 2:39 PM, "Steven Jacobs" <sjaco002@ucr.edu> wrote:
>
> > I think the problem with doing a single job (as mentioned) is that the
> > intake job will exist for many connection jobs, meaning that there is a
> > single intake job for feed, and a connection job for each connection to a
> > dataset.
> > Steven
> >
> > On Wed, Sep 30, 2015 at 12:05 PM, abdullah alamoudi <bamousaa@gmail.com>
> > wrote:
> >
> > > So I might have an idea about what could cause this.
> > > Following are some information about how feeds work. Please, correct me
> > if
> > > I am wrong as I am just starting to dive deep into this.
> > >
> > > -- Creating and Dropping feeds are just Metadata operations.
> > > -- When you connect a primary feed to a dataset, this is what happens:
> > > 1. Feed event subscriber is created for the feed and registered with
> feed
> > > lifecycle listener(Singleton running on master).
> > > 2. A feed intake job is constructed that consists of just the feed
> intake
> > > operator and a sink operator. When this job starts, it sits in memory
> > doing
> > > nothing because it has no subscribers yet.
> > > 3. Once the job [2] is submitted, the listener in [1] gets notified and
> > > construct an adm command that creates a Hyracks job which has a feed
> > > collect operator that gets records from the running intake job[2] and
> > feeds
> > > it into the dataset.
> > > 4. There is no synchronization between [2] and [3] and there is a
> chance
> > > that [3] starts before [2] is ready and that it doesn't find the intake
> > > runtime and throws an exception. I know the chance is slim but it is
> > there
> > > (It has happened to me).
> > > 5. At that time, the intake job will never return since it is just
> > setting
> > > in memory.
> > >
> > > I am not sure about this but I am guessing that the larger the cluster,
> > the
> > > higher the chance that one runs into this.
> > >
> > > The question I have is: Since at the connect statement, we already know
> > > everything about the dataset that will be fed into by the feed, why
> don't
> > > we construct a single job that has two roots (the sink and the commit)?
> > > Another option would be to make sure that the intake is ready in all
> > nodes
> > > before the subscription is submitted.
> > >
> > > Does any of this make sense?
> > >
> > >
> > > Amoudi, Abdullah.
> > >
> > > On Mon, Sep 14, 2015 at 8:23 PM, Till Westmann (JIRA) <jira@apache.org
> >
> > > wrote:
> > >
> > > >
> > > >      [
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/ASTERIXDB-1085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> > > > ]
> > > >
> > > > Till Westmann updated ASTERIXDB-1085:
> > > > -------------------------------------
> > > >     Assignee: Abdullah Alamoudi
> > > >
> > > > > Sporadic failures in Feed related tests
> > > > > ---------------------------------------
> > > > >
> > > > >                 Key: ASTERIXDB-1085
> > > > >                 URL:
> > > > https://issues.apache.org/jira/browse/ASTERIXDB-1085
> > > > >             Project: Apache AsterixDB
> > > > >          Issue Type: Bug
> > > > >          Components: AsterixDB, Feeds
> > > > >            Reporter: Abdullah Alamoudi
> > > > >            Assignee: Abdullah Alamoudi
> > > > >
> > > > > Sporadically, test cases which use Feeds (Not necessarily in the
> feed
> > > > test group) fail. There are no exception thrown but records which are
> > > > supposed to be in the dataset are not. and subsequent queries return
> > > empty
> > > > results.
> > > >
> > > >
> > > >
> > > > --
> > > > This message was sent by Atlassian JIRA
> > > > (v6.3.4#6332)
> > > >
> > >
> >
>



-- 
Raman

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message