asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From abdullah alamoudi <bamou...@gmail.com>
Subject Re: The solution to the sporadic connection refused exceptions
Date Fri, 28 Aug 2015 06:15:36 GMT
Hi Till,
I am glad that you're interested and let me first say that a change has
been submitted and reviewed by Murtadha, now being reviewed by Chris. Not
only this but I first implemented it completely using RMI and then
re-implemented it completely using Zookeeper.

All what you stated is correct. The solution that was implemented only
deals with knowing when the cluster is up during the startup process. This
seemed urgent to me since I am facing it with almost every change that I
try to verify before I push to Gerrit, and others have seen it too. Knowing
the state of the cluster (i.e. through the Managix describe command) still
relies on checking if the processes are running (Someone correct me if this
is wrong).

So what I did is the following:
When Managix starts the CC, it simply listens on Zookeeper until CC reports
its state. This is currently only done during the startup process. As Ian
has said, he was/is using a polling mechanism to determine if the server is
up. I still think what we implemented is a more elegant solution that
doesn't involve polling at all.

Anyone is welcome to look at the change, suggest changes to it before we
merge it :-)
~Abdullah.

On Fri, Aug 28, 2015 at 8:58 AM, Till Westmann <tillw@apache.org> wrote:

> I’m not really deep into this topic, but I’d like to understand a little
> better.
>
> As I understand it, we currently have 2 ways to deploy/manage AsterixDB:
> a) using Managix and b) using YARN.
> And Managix uses Zookeeper to mange its information, but YARN doesn’t.
> Also, neither the Asterix CC or NC depend on the existence of Zookeeper.
>
> Is this correct so far?
>
> Now we are trying to find a way to ensure that an AsterixDB client can
> reliably know if the cluster is up or down.
>
> My first assumption for the properties that the solution to this problem
> would have is:
> 1) The knowledge if the cluster is up or down is available in the CC (as
> it controls the cluster).
> 2) The mechanism used to expose that information works for both ways to
> deploy/manage a cluster.
>
> As simple way to do that seems to be to send a request “waitUntilStarted”
> to the CC that returns to the client once the CC has determined that
> everything has started. The response to that request would either be “yes"
> (cluster is up), “no” (an error occurred and it won’t be up without
> intervention), or “not sure” (timeout - please ask again later). This would
> imply that the client is polling, but it wouldn’t be very busy if the
> timeout is reasonable.
>
> Now this doesn’t seem to be where the discussion is going and I’d like to
> find out where is is going and why.
>
> Could you help me?
>
> Thanks,
> Till
>
>
> > On Aug 25, 2015, at 7:23 AM, Raman Grover <ramangrover29@gmail.com>
> wrote:
> >
> > As I mentioned before...
> > "The information for an AsterixDB instance is "lazily" refreshed when a
> > management operation is invoked (using managix set of commands) or an
> > explicit describe command is invoked. "
> >
> > Above, the commands are the Managix set of commands (create, start,
> > describe etc.) that trigger a refresh and so its "lazy". Currently CC
> does
> > not notify Managix. what we are discussing are the elegant way to have CC
> > relay information to Managix.
> >
> > On Tue, Aug 25, 2015 at 4:10 AM, abdullah alamoudi <bamousaa@gmail.com>
> > wrote:
> >
> >> I don't think that is there yet but the intention is to have it at some
> >> point in the future.
> >>
> >> Cheers,
> >> Abdullah.
> >>
> >> On Tue, Aug 25, 2015 at 12:38 PM, Chris Hillery <chillery@hillery.land>
> >> wrote:
> >>
> >>> Very interesting, thank you. Can you point out a couple places in the
> >> code
> >>> where some of this logic is kept? Specifically where "CC can update
> this
> >>> information and notify Managix" sounds interesting...
> >>>
> >>> Ceej
> >>> aka Chris Hillery
> >>>
> >>> On Tue, Aug 25, 2015 at 12:49 AM, Raman Grover <
> ramangrover29@gmail.com>
> >>> wrote:
> >>>
> >>>>> , and what code is
> >>>>> responsible for keeping it up-to-date?
> >>>>>
> >>>> Apparently, no one is :-)
> >>>>
> >>>> The information for an AsterixDB instance is "lazily" refreshed when
a
> >>>> management operation is invoked (using managix set of commands) or an
> >>>> explicit describe command is invoked.
> >>>> Between the time t1 (when state of an AsterixDB instance changes, say
> >> due
> >>>> to NC failure) and t2 (when  a management operation is invoked), the
> >>>> information about the AsterixDB instance inside Zookeeper remains
> >> stale.
> >>> CC
> >>>> can update this information and notify Managix; this way Managix
> >> realizes
> >>>> the changed state as soon as it has occurred. This can be particularly
> >>>> useful when showing on a management console the up-to-date state of
an
> >>>> instance in real time or having Managix respond to an event.
> >>>>
> >>>> Regards,
> >>>> Raman
> >>>>
> >>>> ---------- Forwarded message ----------
> >>>> From: abdullah alamoudi <bamousaa@gmail.com>
> >>>> Date: Tue, Aug 25, 2015 at 12:27 AM
> >>>> Subject: Re: The solution to the sporadic connection refused
> exceptions
> >>>> To: dev@asterixdb.incubator.apache.org
> >>>>
> >>>>
> >>>> On Tue, Aug 25, 2015 at 3:40 AM, Chris Hillery <chillery@hillery.land
> >
> >>>> wrote:
> >>>>
> >>>>> Perhaps an aside, but: exactly what is kept in Zookeeper
> >>>>
> >>>>
> >>>> A serialized instance of
> >> edu.uci.ics.asterix.event.model.AsterixInstance
> >>>>
> >>>>
> >>>>> , and what code is
> >>>>> responsible for keeping it up-to-date?
> >>>>>
> >>>> Apparently, no one is :-)
> >>>>
> >>>>
> >>>>>
> >>>>> Ceej
> >>>>>
> >>>>> On Mon, Aug 24, 2015 at 5:28 PM, Raman Grover <
> >> ramangrover29@gmail.com
> >>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Well, the state of an instance (and metadata including
> >> configuration)
> >>>> is
> >>>>>> kept in Zookeeper instance that is accessible to Managix and
CC. CC
> >>>>> should
> >>>>>> be able to set the state of the cluster in Zookeeper under the
> >> right
> >>>>> znode
> >>>>>> which can viewed by Managix.
> >>>>>>
> >>>>>> There exists a communication channel for CC and Managix to share
> >>>>>> information on state etc. I am not sure if we need another channel
> >>> such
> >>>>> as
> >>>>>> RMI between Managix and CC.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Raman
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Aug 24, 2015 at 12:58 PM, abdullah alamoudi <
> >>>> bamousaa@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Well, it depends on your definition of the boundaries of
managix.
> >>>> What
> >>>>> I
> >>>>>>> did is that I added an RMI object in the InstallerDriver
which
> >>>>> basically
> >>>>>>> listen for state changes from the cluster controller. This
means
> >>> some
> >>>>>>> additional logic in the CCApplicationEntryPoint where after
the
> >> CC
> >>> is
> >>>>>>> ready, it contacts the InstallerDriver using RMI and at
that
> >> point
> >>>>> only,
> >>>>>>> the InstallerDriver can return to managix and tells it that
the
> >>>> startup
> >>>>>> is
> >>>>>>> complete.
> >>>>>>>
> >>>>>>> Not sure if this is the right way to do it but it definitely
is
> >>>> better
> >>>>>> than
> >>>>>>> what we currently have.
> >>>>>>> Abdullah.
> >>>>>>>
> >>>>>>> On Mon, Aug 24, 2015 at 10:00 PM, Chris Hillery
> >>>> <chillery@hillery.land
> >>>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hopefully the solution won't involve additional important
logic
> >>>>> inside
> >>>>>>>> Managix itself?
> >>>>>>>>
> >>>>>>>> Ceej
> >>>>>>>> aka Chris Hillery
> >>>>>>>>
> >>>>>>>> On Mon, Aug 24, 2015 at 7:26 AM, abdullah alamoudi <
> >>>>> bamousaa@gmail.com
> >>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> That works but it doesn't feel right doing it this
way. I am
> >>>> going
> >>>>> to
> >>>>>>> fix
> >>>>>>>>> this one for good.
> >>>>>>>>>
> >>>>>>>>> Cheers,
> >>>>>>>>> Abdullah.
> >>>>>>>>>
> >>>>>>>>> On Mon, Aug 24, 2015 at 5:11 PM, Ian Maxon <imaxon@uci.edu>
> >>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> The way I assured liveness for the YARN installer
was to
> >> try
> >>>>>> running
> >>>>>>>> "for
> >>>>>>>>>> $x in dataset Metadata.Dataset return $x" via
the API. I
> >> just
> >>>>>> polled
> >>>>>>>> for
> >>>>>>>>> a
> >>>>>>>>>> reasonable amount of time  (though honestly,
thinking about
> >>> it
> >>>>> now,
> >>>>>>> the
> >>>>>>>>>> correct parameter to use for the polling interval
is the
> >>>> startup
> >>>>>> wait
> >>>>>>>>> time
> >>>>>>>>>> in the parameters file :) ). It's not perfect,
but it gives
> >>>> less
> >>>>>>> false
> >>>>>>>>>> positives than just checking ps for processes
that look
> >> like
> >>>>>> CCs/NCs.
> >>>>>>>>>>
> >>>>>>>>>> - Ian.
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Aug 24, 2015 at 5:03 AM, abdullah alamoudi
<
> >>>>>>> bamousaa@gmail.com
> >>>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Now that I think about it. Maybe we should
provide
> >> multiple
> >>>>> ways
> >>>>>> to
> >>>>>>>> do
> >>>>>>>>>>> this. A polling mechanism to be used for
arbitrary time
> >>> and a
> >>>>>>> pushing
> >>>>>>>>>>> mechanism on startup.
> >>>>>>>>>>> I am going to start implementation of this
and will
> >>> probably
> >>>>> use
> >>>>>>> RMI
> >>>>>>>>> for
> >>>>>>>>>>> this task both ways (CC to InstallerDriver
and
> >>>> InstallerDriver
> >>>>> to
> >>>>>>>> CC).
> >>>>>>>>>>>
> >>>>>>>>>>> Cheers,
> >>>>>>>>>>> Abdullah.
> >>>>>>>>>>>
> >>>>>>>>>>> On Mon, Aug 24, 2015 at 2:19 PM, abdullah
alamoudi <
> >>>>>>>> bamousaa@gmail.com
> >>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> So after further investigation, turned
out our startup
> >>>>> process
> >>>>>>> just
> >>>>>>>>>>> starts
> >>>>>>>>>>>> the CC and NC processes and then make
sure the
> >> processes
> >>>> are
> >>>>>>>> running
> >>>>>>>>>> and
> >>>>>>>>>>> if
> >>>>>>>>>>>> the processes were found to be running,
it returns the
> >>>> state
> >>>>> of
> >>>>>>> the
> >>>>>>>>>>> cluster
> >>>>>>>>>>>> to be active and the subsequent test
commands can start
> >>>>>>>> immediately.
> >>>>>>>>>>>>
> >>>>>>>>>>>> This means that the CC could've started
but is not yet
> >>>> ready
> >>>>>> when
> >>>>>>>> we
> >>>>>>>>>> try
> >>>>>>>>>>>> to process the next command. To address
this, we need a
> >>>>> better
> >>>>>>> way
> >>>>>>>> to
> >>>>>>>>>>> tell
> >>>>>>>>>>>> when the startup procedure has completed.
we can do
> >> this
> >>> by
> >>>>>>> pushing
> >>>>>>>>> (CC
> >>>>>>>>>>>> informs installer driver when the startup
is complete)
> >> or
> >>>>>> polling
> >>>>>>>>> (The
> >>>>>>>>>>>> installer driver needs to actually query
the CC for the
> >>>> state
> >>>>>> of
> >>>>>>>> the
> >>>>>>>>>>>> cluster).
> >>>>>>>>>>>>
> >>>>>>>>>>>> I can do either way so let's vote. My
vote goes to the
> >>>>> pushing
> >>>>>>>>>> mechanism.
> >>>>>>>>>>>> Thoughts?
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Mon, Aug 24, 2015 at 10:15 AM, abdullah
alamoudi <
> >>>>>>>>>> bamousaa@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> This solution turned out to be incorrect.
Actually,
> >> the
> >>>> test
> >>>>>>> cases
> >>>>>>>>>> when
> >>>>>>>>>>> I
> >>>>>>>>>>>>> build after using the join method
never fails but
> >>> running
> >>>> an
> >>>>>>>> actual
> >>>>>>>>>>> asterix
> >>>>>>>>>>>>> instance never succeeds which is
quite confusing.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I also think that the startup script
has a major bug
> >>> where
> >>>>> it
> >>>>>>>> might
> >>>>>>>>>>>>> returns before the startup is complete.
More on this
> >>>>>> later......
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Mon, Aug 24, 2015 at 7:48 AM,
abdullah alamoudi <
> >>>>>>>>>> bamousaa@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> It is highly unlikely that it
is related.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>> Abdullah.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Mon, Aug 24, 2015 at 5:45
AM, Chen Li <
> >>>> chenli@gmail.com
> >>>>>>
> >>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> @Abdullah: Is this issue
related to
> >>>>>>>>>>>>>>>
> >> https://issues.apache.org/jira/browse/ASTERIXDB-1074?
> >>>> Ian
> >>>>>>> and I
> >>>>>>>>>> plan
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>> look into the details on
Monday.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Sun, Aug 23, 2015 at
10:08 AM, abdullah alamoudi
> >> <
> >>>>>>>>>>> bamousaa@gmail.com
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> About 3-4 days ago,
I was working on the addition
> >> of
> >>>> the
> >>>>>>>>>> filesystem
> >>>>>>>>>>>>>>> based
> >>>>>>>>>>>>>>>> feed adapter and it
didn't take anytime to
> >> complete.
> >>>>>>> However,
> >>>>>>>>>> when I
> >>>>>>>>>>>>>>> wanted
> >>>>>>>>>>>>>>>> to build and make sure
all tests pass, I kept
> >>> getting
> >>>>>>>>>>>>>>> ConnectionRefused
> >>>>>>>>>>>>>>>> errors which caused
the installer tests to fail
> >>> every
> >>>>> now
> >>>>>>> and
> >>>>>>>>>> then.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I knew the new change
had nothing to do with this
> >>>>> failure,
> >>>>>>>> yet,
> >>>>>>>>> I
> >>>>>>>>>>>>>>> couldn't
> >>>>>>>>>>>>>>>> direct my attention
away from this bug (It just
> >>>> bothered
> >>>>>> me
> >>>>>>> so
> >>>>>>>>>> much
> >>>>>>>>>>>>>>> and I
> >>>>>>>>>>>>>>>> knew it needs to be
resolved ASAP). After wasting
> >>>>>> countless
> >>>>>>>>>> hours, I
> >>>>>>>>>>>>>>> was
> >>>>>>>>>>>>>>>> finally able to figure
out what was happening :-)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> In the startup routine,
we start three Jetty web
> >>>> servers
> >>>>>>> (Web
> >>>>>>>>>>>>>>> interface
> >>>>>>>>>>>>>>>> server, JSON API server,
and Feed server).
> >> Sometime
> >>>> ago,
> >>>>>> we
> >>>>>>>> used
> >>>>>>>>>> to
> >>>>>>>>>>>>>>> end the
> >>>>>>>>>>>>>>>> startup call before
making sure the
> >>> server.isStarted()
> >>>>>>> method
> >>>>>>>>>>> returns
> >>>>>>>>>>>>>>> true
> >>>>>>>>>>>>>>>> on all servers. At that
time, I introduced the
> >>>>>>>>>> waitUntilServerStarts
> >>>>>>>>>>>>>>> method
> >>>>>>>>>>>>>>>> to make sure we don't
return before the servers
> >> are
> >>>>> ready.
> >>>>>>>>> Turned
> >>>>>>>>>>>>>>> out, that
> >>>>>>>>>>>>>>>> was an incorrect way
to handle this (We can blame
> >>>>>>>> stackoverflow
> >>>>>>>>>> for
> >>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>> one!) and it is not
enough that the server
> >>> isStarted()
> >>>>>>> returns
> >>>>>>>>>> true.
> >>>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>> correct way to do this
is to call the
> >> server.join()
> >>>>> method
> >>>>>>>> after
> >>>>>>>>>> the
> >>>>>>>>>>>>>>>> server.start().
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> See:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> http://stackoverflow.com/questions/15924874/embedded-jetty-why-to-use-join
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> This was equally satisfying
as it was frustrating
> >>> and
> >>>>> you
> >>>>>>> are
> >>>>>>>>>>> welcome
> >>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>> the future time I saved
each of you :)
> >>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>> Amoudi, Abdullah.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> --
> >>>>>>>>>>>>>> Amoudi, Abdullah.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> Amoudi, Abdullah.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Amoudi, Abdullah.
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Amoudi, Abdullah.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Amoudi, Abdullah.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Amoudi, Abdullah.
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Raman
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Amoudi, Abdullah.
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Raman
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Amoudi, Abdullah.
> >>
> >
> >
> >
> > --
> > Raman
>
>


-- 
Amoudi, Abdullah.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message