asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Till Westmann <ti...@apache.org>
Subject Re: The solution to the sporadic connection refused exceptions
Date Sat, 29 Aug 2015 06:00:21 GMT
Continuing this discussion on https://asterix-gerrit.ics.uci.edu/#/c/365 <https://asterix-gerrit.ics.uci.edu/#/c/365>
(which gets mirrored on this list anyway).

Cheers,
Till

> On Aug 27, 2015, at 11:52 PM, Ian Maxon <imaxon@uci.edu> wrote:
> 
>> And Managix uses Zookeeper to mange its information, but YARN doesn’t.
> 
> To put some background into this, I only chose to eschew use of ZK
> because it isn't a requirement in a YARN 2.2.0 cluster, and I could do
> what I needed via HDFS and some polling on the CC. I'm not opposed to
> integrating it further though (and making the YARN client take use of
> that).
> 
> - Ian
> 
> On Thu, Aug 27, 2015 at 7:58 PM, Till Westmann <tillw@apache.org> wrote:
>> I’m not really deep into this topic, but I’d like to understand a little better.
>> 
>> As I understand it, we currently have 2 ways to deploy/manage AsterixDB: a) using
Managix and b) using YARN.
>> And Managix uses Zookeeper to mange its information, but YARN doesn’t.
>> Also, neither the Asterix CC or NC depend on the existence of Zookeeper.
>> 
>> Is this correct so far?
>> 
>> Now we are trying to find a way to ensure that an AsterixDB client can reliably know
if the cluster is up or down.
>> 
>> My first assumption for the properties that the solution to this problem would have
is:
>> 1) The knowledge if the cluster is up or down is available in the CC (as it controls
the cluster).
>> 2) The mechanism used to expose that information works for both ways to deploy/manage
a cluster.
>> 
>> As simple way to do that seems to be to send a request “waitUntilStarted” to
the CC that returns to the client once the CC has determined that everything has started.
The response to that request would either be “yes" (cluster is up), “no” (an error occurred
and it won’t be up without intervention), or “not sure” (timeout - please ask again
later). This would imply that the client is polling, but it wouldn’t be very busy if the
timeout is reasonable.
>> 
>> Now this doesn’t seem to be where the discussion is going and I’d like to find
out where is is going and why.
>> 
>> Could you help me?
>> 
>> Thanks,
>> Till
>> 
>> 
>>> On Aug 25, 2015, at 7:23 AM, Raman Grover <ramangrover29@gmail.com> wrote:
>>> 
>>> As I mentioned before...
>>> "The information for an AsterixDB instance is "lazily" refreshed when a
>>> management operation is invoked (using managix set of commands) or an
>>> explicit describe command is invoked. "
>>> 
>>> Above, the commands are the Managix set of commands (create, start,
>>> describe etc.) that trigger a refresh and so its "lazy". Currently CC does
>>> not notify Managix. what we are discussing are the elegant way to have CC
>>> relay information to Managix.
>>> 
>>> On Tue, Aug 25, 2015 at 4:10 AM, abdullah alamoudi <bamousaa@gmail.com>
>>> wrote:
>>> 
>>>> I don't think that is there yet but the intention is to have it at some
>>>> point in the future.
>>>> 
>>>> Cheers,
>>>> Abdullah.
>>>> 
>>>> On Tue, Aug 25, 2015 at 12:38 PM, Chris Hillery <chillery@hillery.land>
>>>> wrote:
>>>> 
>>>>> Very interesting, thank you. Can you point out a couple places in the
>>>> code
>>>>> where some of this logic is kept? Specifically where "CC can update this
>>>>> information and notify Managix" sounds interesting...
>>>>> 
>>>>> Ceej
>>>>> aka Chris Hillery
>>>>> 
>>>>> On Tue, Aug 25, 2015 at 12:49 AM, Raman Grover <ramangrover29@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>>> , and what code is
>>>>>>> responsible for keeping it up-to-date?
>>>>>>> 
>>>>>> Apparently, no one is :-)
>>>>>> 
>>>>>> The information for an AsterixDB instance is "lazily" refreshed when
a
>>>>>> management operation is invoked (using managix set of commands) or
an
>>>>>> explicit describe command is invoked.
>>>>>> Between the time t1 (when state of an AsterixDB instance changes,
say
>>>> due
>>>>>> to NC failure) and t2 (when  a management operation is invoked),
the
>>>>>> information about the AsterixDB instance inside Zookeeper remains
>>>> stale.
>>>>> CC
>>>>>> can update this information and notify Managix; this way Managix
>>>> realizes
>>>>>> the changed state as soon as it has occurred. This can be particularly
>>>>>> useful when showing on a management console the up-to-date state
of an
>>>>>> instance in real time or having Managix respond to an event.
>>>>>> 
>>>>>> Regards,
>>>>>> Raman
>>>>>> 
>>>>>> ---------- Forwarded message ----------
>>>>>> From: abdullah alamoudi <bamousaa@gmail.com>
>>>>>> Date: Tue, Aug 25, 2015 at 12:27 AM
>>>>>> Subject: Re: The solution to the sporadic connection refused exceptions
>>>>>> To: dev@asterixdb.incubator.apache.org
>>>>>> 
>>>>>> 
>>>>>> On Tue, Aug 25, 2015 at 3:40 AM, Chris Hillery <chillery@hillery.land>
>>>>>> wrote:
>>>>>> 
>>>>>>> Perhaps an aside, but: exactly what is kept in Zookeeper
>>>>>> 
>>>>>> 
>>>>>> A serialized instance of
>>>> edu.uci.ics.asterix.event.model.AsterixInstance
>>>>>> 
>>>>>> 
>>>>>>> , and what code is
>>>>>>> responsible for keeping it up-to-date?
>>>>>>> 
>>>>>> Apparently, no one is :-)
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> Ceej
>>>>>>> 
>>>>>>> On Mon, Aug 24, 2015 at 5:28 PM, Raman Grover <
>>>> ramangrover29@gmail.com
>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Well, the state of an instance (and metadata including
>>>> configuration)
>>>>>> is
>>>>>>>> kept in Zookeeper instance that is accessible to Managix
and CC. CC
>>>>>>> should
>>>>>>>> be able to set the state of the cluster in Zookeeper under
the
>>>> right
>>>>>>> znode
>>>>>>>> which can viewed by Managix.
>>>>>>>> 
>>>>>>>> There exists a communication channel for CC and Managix to
share
>>>>>>>> information on state etc. I am not sure if we need another
channel
>>>>> such
>>>>>>> as
>>>>>>>> RMI between Managix and CC.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Raman
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mon, Aug 24, 2015 at 12:58 PM, abdullah alamoudi <
>>>>>> bamousaa@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Well, it depends on your definition of the boundaries
of managix.
>>>>>> What
>>>>>>> I
>>>>>>>>> did is that I added an RMI object in the InstallerDriver
which
>>>>>>> basically
>>>>>>>>> listen for state changes from the cluster controller.
This means
>>>>> some
>>>>>>>>> additional logic in the CCApplicationEntryPoint where
after the
>>>> CC
>>>>> is
>>>>>>>>> ready, it contacts the InstallerDriver using RMI and
at that
>>>> point
>>>>>>> only,
>>>>>>>>> the InstallerDriver can return to managix and tells it
that the
>>>>>> startup
>>>>>>>> is
>>>>>>>>> complete.
>>>>>>>>> 
>>>>>>>>> Not sure if this is the right way to do it but it definitely
is
>>>>>> better
>>>>>>>> than
>>>>>>>>> what we currently have.
>>>>>>>>> Abdullah.
>>>>>>>>> 
>>>>>>>>> On Mon, Aug 24, 2015 at 10:00 PM, Chris Hillery
>>>>>> <chillery@hillery.land
>>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hopefully the solution won't involve additional important
logic
>>>>>>> inside
>>>>>>>>>> Managix itself?
>>>>>>>>>> 
>>>>>>>>>> Ceej
>>>>>>>>>> aka Chris Hillery
>>>>>>>>>> 
>>>>>>>>>> On Mon, Aug 24, 2015 at 7:26 AM, abdullah alamoudi
<
>>>>>>> bamousaa@gmail.com
>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> That works but it doesn't feel right doing it
this way. I am
>>>>>> going
>>>>>>> to
>>>>>>>>> fix
>>>>>>>>>>> this one for good.
>>>>>>>>>>> 
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Abdullah.
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Aug 24, 2015 at 5:11 PM, Ian Maxon <imaxon@uci.edu>
>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> The way I assured liveness for the YARN installer
was to
>>>> try
>>>>>>>> running
>>>>>>>>>> "for
>>>>>>>>>>>> $x in dataset Metadata.Dataset return $x"
via the API. I
>>>> just
>>>>>>>> polled
>>>>>>>>>> for
>>>>>>>>>>> a
>>>>>>>>>>>> reasonable amount of time  (though honestly,
thinking about
>>>>> it
>>>>>>> now,
>>>>>>>>> the
>>>>>>>>>>>> correct parameter to use for the polling
interval is the
>>>>>> startup
>>>>>>>> wait
>>>>>>>>>>> time
>>>>>>>>>>>> in the parameters file :) ). It's not perfect,
but it gives
>>>>>> less
>>>>>>>>> false
>>>>>>>>>>>> positives than just checking ps for processes
that look
>>>> like
>>>>>>>> CCs/NCs.
>>>>>>>>>>>> 
>>>>>>>>>>>> - Ian.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Aug 24, 2015 at 5:03 AM, abdullah
alamoudi <
>>>>>>>>> bamousaa@gmail.com
>>>>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Now that I think about it. Maybe we should
provide
>>>> multiple
>>>>>>> ways
>>>>>>>> to
>>>>>>>>>> do
>>>>>>>>>>>>> this. A polling mechanism to be used
for arbitrary time
>>>>> and a
>>>>>>>>> pushing
>>>>>>>>>>>>> mechanism on startup.
>>>>>>>>>>>>> I am going to start implementation of
this and will
>>>>> probably
>>>>>>> use
>>>>>>>>> RMI
>>>>>>>>>>> for
>>>>>>>>>>>>> this task both ways (CC to InstallerDriver
and
>>>>>> InstallerDriver
>>>>>>> to
>>>>>>>>>> CC).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Abdullah.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mon, Aug 24, 2015 at 2:19 PM, abdullah
alamoudi <
>>>>>>>>>> bamousaa@gmail.com
>>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So after further investigation, turned
out our startup
>>>>>>> process
>>>>>>>>> just
>>>>>>>>>>>>> starts
>>>>>>>>>>>>>> the CC and NC processes and then
make sure the
>>>> processes
>>>>>> are
>>>>>>>>>> running
>>>>>>>>>>>> and
>>>>>>>>>>>>> if
>>>>>>>>>>>>>> the processes were found to be running,
it returns the
>>>>>> state
>>>>>>> of
>>>>>>>>> the
>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>> to be active and the subsequent test
commands can start
>>>>>>>>>> immediately.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This means that the CC could've started
but is not yet
>>>>>> ready
>>>>>>>> when
>>>>>>>>>> we
>>>>>>>>>>>> try
>>>>>>>>>>>>>> to process the next command. To address
this, we need a
>>>>>>> better
>>>>>>>>> way
>>>>>>>>>> to
>>>>>>>>>>>>> tell
>>>>>>>>>>>>>> when the startup procedure has completed.
we can do
>>>> this
>>>>> by
>>>>>>>>> pushing
>>>>>>>>>>> (CC
>>>>>>>>>>>>>> informs installer driver when the
startup is complete)
>>>> or
>>>>>>>> polling
>>>>>>>>>>> (The
>>>>>>>>>>>>>> installer driver needs to actually
query the CC for the
>>>>>> state
>>>>>>>> of
>>>>>>>>>> the
>>>>>>>>>>>>>> cluster).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I can do either way so let's vote.
My vote goes to the
>>>>>>> pushing
>>>>>>>>>>>> mechanism.
>>>>>>>>>>>>>> Thoughts?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Mon, Aug 24, 2015 at 10:15 AM,
abdullah alamoudi <
>>>>>>>>>>>> bamousaa@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This solution turned out to be
incorrect. Actually,
>>>> the
>>>>>> test
>>>>>>>>> cases
>>>>>>>>>>>> when
>>>>>>>>>>>>> I
>>>>>>>>>>>>>>> build after using the join method
never fails but
>>>>> running
>>>>>> an
>>>>>>>>>> actual
>>>>>>>>>>>>> asterix
>>>>>>>>>>>>>>> instance never succeeds which
is quite confusing.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I also think that the startup
script has a major bug
>>>>> where
>>>>>>> it
>>>>>>>>>> might
>>>>>>>>>>>>>>> returns before the startup is
complete. More on this
>>>>>>>> later......
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Mon, Aug 24, 2015 at 7:48
AM, abdullah alamoudi <
>>>>>>>>>>>> bamousaa@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> It is highly unlikely that
it is related.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>> Abdullah.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Mon, Aug 24, 2015 at 5:45
AM, Chen Li <
>>>>>> chenli@gmail.com
>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> @Abdullah: Is this issue
related to
>>>>>>>>>>>>>>>>> 
>>>> https://issues.apache.org/jira/browse/ASTERIXDB-1074?
>>>>>> Ian
>>>>>>>>> and I
>>>>>>>>>>>> plan
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> look into the details
on Monday.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Sun, Aug 23, 2015
at 10:08 AM, abdullah alamoudi
>>>> <
>>>>>>>>>>>>> bamousaa@gmail.com
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> About 3-4 days ago,
I was working on the addition
>>>> of
>>>>>> the
>>>>>>>>>>>> filesystem
>>>>>>>>>>>>>>>>> based
>>>>>>>>>>>>>>>>>> feed adapter and
it didn't take anytime to
>>>> complete.
>>>>>>>>> However,
>>>>>>>>>>>> when I
>>>>>>>>>>>>>>>>> wanted
>>>>>>>>>>>>>>>>>> to build and make
sure all tests pass, I kept
>>>>> getting
>>>>>>>>>>>>>>>>> ConnectionRefused
>>>>>>>>>>>>>>>>>> errors which caused
the installer tests to fail
>>>>> every
>>>>>>> now
>>>>>>>>> and
>>>>>>>>>>>> then.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I knew the new change
had nothing to do with this
>>>>>>> failure,
>>>>>>>>>> yet,
>>>>>>>>>>> I
>>>>>>>>>>>>>>>>> couldn't
>>>>>>>>>>>>>>>>>> direct my attention
away from this bug (It just
>>>>>> bothered
>>>>>>>> me
>>>>>>>>> so
>>>>>>>>>>>> much
>>>>>>>>>>>>>>>>> and I
>>>>>>>>>>>>>>>>>> knew it needs to
be resolved ASAP). After wasting
>>>>>>>> countless
>>>>>>>>>>>> hours, I
>>>>>>>>>>>>>>>>> was
>>>>>>>>>>>>>>>>>> finally able to figure
out what was happening :-)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> In the startup routine,
we start three Jetty web
>>>>>> servers
>>>>>>>>> (Web
>>>>>>>>>>>>>>>>> interface
>>>>>>>>>>>>>>>>>> server, JSON API
server, and Feed server).
>>>> Sometime
>>>>>> ago,
>>>>>>>> we
>>>>>>>>>> used
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> end the
>>>>>>>>>>>>>>>>>> startup call before
making sure the
>>>>> server.isStarted()
>>>>>>>>> method
>>>>>>>>>>>>> returns
>>>>>>>>>>>>>>>>> true
>>>>>>>>>>>>>>>>>> on all servers. At
that time, I introduced the
>>>>>>>>>>>> waitUntilServerStarts
>>>>>>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>> to make sure we don't
return before the servers
>>>> are
>>>>>>> ready.
>>>>>>>>>>> Turned
>>>>>>>>>>>>>>>>> out, that
>>>>>>>>>>>>>>>>>> was an incorrect
way to handle this (We can blame
>>>>>>>>>> stackoverflow
>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>> one!) and it is not
enough that the server
>>>>> isStarted()
>>>>>>>>> returns
>>>>>>>>>>>> true.
>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>> correct way to do
this is to call the
>>>> server.join()
>>>>>>> method
>>>>>>>>>> after
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> server.start().
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> See:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> http://stackoverflow.com/questions/15924874/embedded-jetty-why-to-use-join
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> This was equally
satisfying as it was frustrating
>>>>> and
>>>>>>> you
>>>>>>>>> are
>>>>>>>>>>>>> welcome
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> the future time I
saved each of you :)
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Amoudi, Abdullah.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Amoudi, Abdullah.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Amoudi, Abdullah.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Amoudi, Abdullah.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Amoudi, Abdullah.
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Amoudi, Abdullah.
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Amoudi, Abdullah.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Raman
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Amoudi, Abdullah.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Raman
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Amoudi, Abdullah.
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Raman
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message