asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hillery <chill...@hillery.land>
Subject Re: The solution to the sporadic connection refused exceptions
Date Tue, 25 Aug 2015 09:38:48 GMT
Very interesting, thank you. Can you point out a couple places in the code
where some of this logic is kept? Specifically where "CC can update this
information and notify Managix" sounds interesting...

Ceej
aka Chris Hillery

On Tue, Aug 25, 2015 at 12:49 AM, Raman Grover <ramangrover29@gmail.com>
wrote:

> > , and what code is
> > responsible for keeping it up-to-date?
> >
> Apparently, no one is :-)
>
> The information for an AsterixDB instance is "lazily" refreshed when a
> management operation is invoked (using managix set of commands) or an
> explicit describe command is invoked.
> Between the time t1 (when state of an AsterixDB instance changes, say due
> to NC failure) and t2 (when  a management operation is invoked), the
> information about the AsterixDB instance inside Zookeeper remains stale. CC
> can update this information and notify Managix; this way Managix realizes
> the changed state as soon as it has occurred. This can be particularly
> useful when showing on a management console the up-to-date state of an
> instance in real time or having Managix respond to an event.
>
> Regards,
> Raman
>
> ---------- Forwarded message ----------
> From: abdullah alamoudi <bamousaa@gmail.com>
> Date: Tue, Aug 25, 2015 at 12:27 AM
> Subject: Re: The solution to the sporadic connection refused exceptions
> To: dev@asterixdb.incubator.apache.org
>
>
> On Tue, Aug 25, 2015 at 3:40 AM, Chris Hillery <chillery@hillery.land>
> wrote:
>
> > Perhaps an aside, but: exactly what is kept in Zookeeper
>
>
> A serialized instance of edu.uci.ics.asterix.event.model.AsterixInstance
>
>
> > , and what code is
> > responsible for keeping it up-to-date?
> >
> Apparently, no one is :-)
>
>
> >
> > Ceej
> >
> > On Mon, Aug 24, 2015 at 5:28 PM, Raman Grover <ramangrover29@gmail.com>
> > wrote:
> >
> > > Well, the state of an instance (and metadata including configuration)
> is
> > > kept in Zookeeper instance that is accessible to Managix and CC. CC
> > should
> > > be able to set the state of the cluster in Zookeeper under the right
> > znode
> > > which can viewed by Managix.
> > >
> > > There exists a communication channel for CC and Managix to share
> > > information on state etc. I am not sure if we need another channel such
> > as
> > > RMI between Managix and CC.
> > >
> > > Regards,
> > > Raman
> > >
> > >
> > >
> > > On Mon, Aug 24, 2015 at 12:58 PM, abdullah alamoudi <
> bamousaa@gmail.com>
> > > wrote:
> > >
> > > > Well, it depends on your definition of the boundaries of managix.
> What
> > I
> > > > did is that I added an RMI object in the InstallerDriver which
> > basically
> > > > listen for state changes from the cluster controller. This means some
> > > > additional logic in the CCApplicationEntryPoint where after the CC is
> > > > ready, it contacts the InstallerDriver using RMI and at that point
> > only,
> > > > the InstallerDriver can return to managix and tells it that the
> startup
> > > is
> > > > complete.
> > > >
> > > > Not sure if this is the right way to do it but it definitely is
> better
> > > than
> > > > what we currently have.
> > > > Abdullah.
> > > >
> > > > On Mon, Aug 24, 2015 at 10:00 PM, Chris Hillery
> <chillery@hillery.land
> > >
> > > > wrote:
> > > >
> > > > > Hopefully the solution won't involve additional important logic
> > inside
> > > > > Managix itself?
> > > > >
> > > > > Ceej
> > > > > aka Chris Hillery
> > > > >
> > > > > On Mon, Aug 24, 2015 at 7:26 AM, abdullah alamoudi <
> > bamousaa@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > That works but it doesn't feel right doing it this way. I am
> going
> > to
> > > > fix
> > > > > > this one for good.
> > > > > >
> > > > > > Cheers,
> > > > > > Abdullah.
> > > > > >
> > > > > > On Mon, Aug 24, 2015 at 5:11 PM, Ian Maxon <imaxon@uci.edu>
> wrote:
> > > > > >
> > > > > > > The way I assured liveness for the YARN installer was to
try
> > > running
> > > > > "for
> > > > > > > $x in dataset Metadata.Dataset return $x" via the API.
I just
> > > polled
> > > > > for
> > > > > > a
> > > > > > > reasonable amount of time  (though honestly, thinking about
it
> > now,
> > > > the
> > > > > > > correct parameter to use for the polling interval is the
> startup
> > > wait
> > > > > > time
> > > > > > > in the parameters file :) ). It's not perfect, but it gives
> less
> > > > false
> > > > > > > positives than just checking ps for processes that look
like
> > > CCs/NCs.
> > > > > > >
> > > > > > > - Ian.
> > > > > > >
> > > > > > > On Mon, Aug 24, 2015 at 5:03 AM, abdullah alamoudi <
> > > > bamousaa@gmail.com
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Now that I think about it. Maybe we should provide
multiple
> > ways
> > > to
> > > > > do
> > > > > > > > this. A polling mechanism to be used for arbitrary
time and a
> > > > pushing
> > > > > > > > mechanism on startup.
> > > > > > > > I am going to start implementation of this and will
probably
> > use
> > > > RMI
> > > > > > for
> > > > > > > > this task both ways (CC to InstallerDriver and
> InstallerDriver
> > to
> > > > > CC).
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Abdullah.
> > > > > > > >
> > > > > > > > On Mon, Aug 24, 2015 at 2:19 PM, abdullah alamoudi
<
> > > > > bamousaa@gmail.com
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > So after further investigation, turned out our
startup
> > process
> > > > just
> > > > > > > > starts
> > > > > > > > > the CC and NC processes and then make sure the
processes
> are
> > > > > running
> > > > > > > and
> > > > > > > > if
> > > > > > > > > the processes were found to be running, it returns
the
> state
> > of
> > > > the
> > > > > > > > cluster
> > > > > > > > > to be active and the subsequent test commands
can start
> > > > > immediately.
> > > > > > > > >
> > > > > > > > > This means that the CC could've started but is
not yet
> ready
> > > when
> > > > > we
> > > > > > > try
> > > > > > > > > to process the next command. To address this,
we need a
> > better
> > > > way
> > > > > to
> > > > > > > > tell
> > > > > > > > > when the startup procedure has completed. we
can do this by
> > > > pushing
> > > > > > (CC
> > > > > > > > > informs installer driver when the startup is
complete) or
> > > polling
> > > > > > (The
> > > > > > > > > installer driver needs to actually query the
CC for the
> state
> > > of
> > > > > the
> > > > > > > > > cluster).
> > > > > > > > >
> > > > > > > > > I can do either way so let's vote. My vote goes
to the
> > pushing
> > > > > > > mechanism.
> > > > > > > > > Thoughts?
> > > > > > > > >
> > > > > > > > > On Mon, Aug 24, 2015 at 10:15 AM, abdullah alamoudi
<
> > > > > > > bamousaa@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> This solution turned out to be incorrect.
Actually, the
> test
> > > > cases
> > > > > > > when
> > > > > > > > I
> > > > > > > > >> build after using the join method never fails
but running
> an
> > > > > actual
> > > > > > > > asterix
> > > > > > > > >> instance never succeeds which is quite confusing.
> > > > > > > > >>
> > > > > > > > >> I also think that the startup script has
a major bug where
> > it
> > > > > might
> > > > > > > > >> returns before the startup is complete. More
on this
> > > later......
> > > > > > > > >>
> > > > > > > > >> On Mon, Aug 24, 2015 at 7:48 AM, abdullah
alamoudi <
> > > > > > > bamousaa@gmail.com>
> > > > > > > > >> wrote:
> > > > > > > > >>
> > > > > > > > >>> It is highly unlikely that it is related.
> > > > > > > > >>>
> > > > > > > > >>> Cheers,
> > > > > > > > >>> Abdullah.
> > > > > > > > >>>
> > > > > > > > >>> On Mon, Aug 24, 2015 at 5:45 AM, Chen
Li <
> chenli@gmail.com
> > >
> > > > > wrote:
> > > > > > > > >>>
> > > > > > > > >>>> @Abdullah: Is this issue related
to
> > > > > > > > >>>> https://issues.apache.org/jira/browse/ASTERIXDB-1074?
> Ian
> > > > and I
> > > > > > > plan
> > > > > > > > to
> > > > > > > > >>>> look into the details on Monday.
> > > > > > > > >>>>
> > > > > > > > >>>> On Sun, Aug 23, 2015 at 10:08 AM,
abdullah alamoudi <
> > > > > > > > bamousaa@gmail.com
> > > > > > > > >>>> >
> > > > > > > > >>>> wrote:
> > > > > > > > >>>>
> > > > > > > > >>>> > About 3-4 days ago, I was working
on the addition of
> the
> > > > > > > filesystem
> > > > > > > > >>>> based
> > > > > > > > >>>> > feed adapter and it didn't take
anytime to complete.
> > > > However,
> > > > > > > when I
> > > > > > > > >>>> wanted
> > > > > > > > >>>> > to build and make sure all tests
pass, I kept getting
> > > > > > > > >>>> ConnectionRefused
> > > > > > > > >>>> > errors which caused the installer
tests to fail every
> > now
> > > > and
> > > > > > > then.
> > > > > > > > >>>> >
> > > > > > > > >>>> > I knew the new change had nothing
to do with this
> > failure,
> > > > > yet,
> > > > > > I
> > > > > > > > >>>> couldn't
> > > > > > > > >>>> > direct my attention away from
this bug (It just
> bothered
> > > me
> > > > so
> > > > > > > much
> > > > > > > > >>>> and I
> > > > > > > > >>>> > knew it needs to be resolved
ASAP). After wasting
> > > countless
> > > > > > > hours, I
> > > > > > > > >>>> was
> > > > > > > > >>>> > finally able to figure out what
was happening :-)
> > > > > > > > >>>> >
> > > > > > > > >>>> > In the startup routine, we start
three Jetty web
> servers
> > > > (Web
> > > > > > > > >>>> interface
> > > > > > > > >>>> > server, JSON API server, and
Feed server). Sometime
> ago,
> > > we
> > > > > used
> > > > > > > to
> > > > > > > > >>>> end the
> > > > > > > > >>>> > startup call before making sure
the server.isStarted()
> > > > method
> > > > > > > > returns
> > > > > > > > >>>> true
> > > > > > > > >>>> > on all servers. At that time,
I introduced the
> > > > > > > waitUntilServerStarts
> > > > > > > > >>>> method
> > > > > > > > >>>> > to make sure we don't return
before the servers are
> > ready.
> > > > > > Turned
> > > > > > > > >>>> out, that
> > > > > > > > >>>> > was an incorrect way to handle
this (We can blame
> > > > > stackoverflow
> > > > > > > for
> > > > > > > > >>>> this
> > > > > > > > >>>> > one!) and it is not enough that
the server isStarted()
> > > > returns
> > > > > > > true.
> > > > > > > > >>>> The
> > > > > > > > >>>> > correct way to do this is to
call the server.join()
> > method
> > > > > after
> > > > > > > the
> > > > > > > > >>>> > server.start().
> > > > > > > > >>>> >
> > > > > > > > >>>> > See:
> > > > > > > > >>>> >
> > > > > > > > >>>>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://stackoverflow.com/questions/15924874/embedded-jetty-why-to-use-join
> > > > > > > > >>>> >
> > > > > > > > >>>> > This was equally satisfying
as it was frustrating and
> > you
> > > > are
> > > > > > > > welcome
> > > > > > > > >>>> for
> > > > > > > > >>>> > the future time I saved each
of you :)
> > > > > > > > >>>> > --
> > > > > > > > >>>> > Amoudi, Abdullah.
> > > > > > > > >>>> >
> > > > > > > > >>>>
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>> --
> > > > > > > > >>> Amoudi, Abdullah.
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> --
> > > > > > > > >> Amoudi, Abdullah.
> > > > > > > > >>
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Amoudi, Abdullah.
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Amoudi, Abdullah.
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Amoudi, Abdullah.
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Amoudi, Abdullah.
> > > >
> > >
> > >
> > >
> > > --
> > > Raman
> > >
> >
>
>
>
> --
> Amoudi, Abdullah.
>
>
>
> --
> Raman
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message