Return-Path: X-Original-To: apmail-asterixdb-dev-archive@minotaur.apache.org Delivered-To: apmail-asterixdb-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4039518E45 for ; Tue, 25 Aug 2015 11:17:12 +0000 (UTC) Received: (qmail 29524 invoked by uid 500); 25 Aug 2015 11:17:12 -0000 Delivered-To: apmail-asterixdb-dev-archive@asterixdb.apache.org Received: (qmail 29471 invoked by uid 500); 25 Aug 2015 11:17:12 -0000 Mailing-List: contact dev-help@asterixdb.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@asterixdb.incubator.apache.org Delivered-To: mailing list dev@asterixdb.incubator.apache.org Received: (qmail 29459 invoked by uid 99); 25 Aug 2015 11:17:11 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Aug 2015 11:17:11 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 630DDEDB2E for ; Tue, 25 Aug 2015 11:17:11 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.9 X-Spam-Level: ** X-Spam-Status: No, score=2.9 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id bmF51tU74zwk for ; Tue, 25 Aug 2015 11:16:59 +0000 (UTC) Received: from mail-oi0-f54.google.com (mail-oi0-f54.google.com [209.85.218.54]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 0A95242B9B for ; Tue, 25 Aug 2015 11:16:59 +0000 (UTC) Received: by oiev193 with SMTP id v193so98365563oie.3 for ; Tue, 25 Aug 2015 04:16:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=vyTF8V4ojXWQ1/9qBjCWPw0gSWOmPddowVvntQfPwyc=; b=0ZKqsd9VrBQMLzIARgaXI6efnDXBSmNpZN+JwF4dyQogZCwveMAlCDNZw7/llhNMCf mBM06nsy6yWQpDDIb1pI9RGEVTvE7x8e/ZQllphfuOUeMdXwQaJ2ix4MNnb18eNdUyC0 6+gD9u2KAbB9WRidr4rDqTYEw9C4E3gsA0OXMkA8bTZwHl1EH7+En51H6L/8tJMdbHtv 7GoKrMgDs6muazVc+j7GAQo0avbNNftk79jbl29tsewMNCqPordaOVfj9lNQSGxdLv6h xzMWLGlUiWVZ2+b127d+xFdW7W/zmMsrIdywQZIfs0EjPo/HZW0FFXqNfmxjyv8ILowl lS8g== MIME-Version: 1.0 X-Received: by 10.202.198.212 with SMTP id w203mr24981973oif.72.1440501009148; Tue, 25 Aug 2015 04:10:09 -0700 (PDT) Received: by 10.76.21.230 with HTTP; Tue, 25 Aug 2015 04:10:08 -0700 (PDT) In-Reply-To: References: Date: Tue, 25 Aug 2015 14:10:08 +0300 Message-ID: Subject: Re: The solution to the sporadic connection refused exceptions From: abdullah alamoudi To: dev@asterixdb.incubator.apache.org Content-Type: multipart/alternative; boundary=001a11c185c2f2f0f7051e20c64e --001a11c185c2f2f0f7051e20c64e Content-Type: text/plain; charset=UTF-8 I don't think that is there yet but the intention is to have it at some point in the future. Cheers, Abdullah. On Tue, Aug 25, 2015 at 12:38 PM, Chris Hillery wrote: > Very interesting, thank you. Can you point out a couple places in the code > where some of this logic is kept? Specifically where "CC can update this > information and notify Managix" sounds interesting... > > Ceej > aka Chris Hillery > > On Tue, Aug 25, 2015 at 12:49 AM, Raman Grover > wrote: > > > > , and what code is > > > responsible for keeping it up-to-date? > > > > > Apparently, no one is :-) > > > > The information for an AsterixDB instance is "lazily" refreshed when a > > management operation is invoked (using managix set of commands) or an > > explicit describe command is invoked. > > Between the time t1 (when state of an AsterixDB instance changes, say due > > to NC failure) and t2 (when a management operation is invoked), the > > information about the AsterixDB instance inside Zookeeper remains stale. > CC > > can update this information and notify Managix; this way Managix realizes > > the changed state as soon as it has occurred. This can be particularly > > useful when showing on a management console the up-to-date state of an > > instance in real time or having Managix respond to an event. > > > > Regards, > > Raman > > > > ---------- Forwarded message ---------- > > From: abdullah alamoudi > > Date: Tue, Aug 25, 2015 at 12:27 AM > > Subject: Re: The solution to the sporadic connection refused exceptions > > To: dev@asterixdb.incubator.apache.org > > > > > > On Tue, Aug 25, 2015 at 3:40 AM, Chris Hillery > > wrote: > > > > > Perhaps an aside, but: exactly what is kept in Zookeeper > > > > > > A serialized instance of edu.uci.ics.asterix.event.model.AsterixInstance > > > > > > > , and what code is > > > responsible for keeping it up-to-date? > > > > > Apparently, no one is :-) > > > > > > > > > > Ceej > > > > > > On Mon, Aug 24, 2015 at 5:28 PM, Raman Grover > > > > wrote: > > > > > > > Well, the state of an instance (and metadata including configuration) > > is > > > > kept in Zookeeper instance that is accessible to Managix and CC. CC > > > should > > > > be able to set the state of the cluster in Zookeeper under the right > > > znode > > > > which can viewed by Managix. > > > > > > > > There exists a communication channel for CC and Managix to share > > > > information on state etc. I am not sure if we need another channel > such > > > as > > > > RMI between Managix and CC. > > > > > > > > Regards, > > > > Raman > > > > > > > > > > > > > > > > On Mon, Aug 24, 2015 at 12:58 PM, abdullah alamoudi < > > bamousaa@gmail.com> > > > > wrote: > > > > > > > > > Well, it depends on your definition of the boundaries of managix. > > What > > > I > > > > > did is that I added an RMI object in the InstallerDriver which > > > basically > > > > > listen for state changes from the cluster controller. This means > some > > > > > additional logic in the CCApplicationEntryPoint where after the CC > is > > > > > ready, it contacts the InstallerDriver using RMI and at that point > > > only, > > > > > the InstallerDriver can return to managix and tells it that the > > startup > > > > is > > > > > complete. > > > > > > > > > > Not sure if this is the right way to do it but it definitely is > > better > > > > than > > > > > what we currently have. > > > > > Abdullah. > > > > > > > > > > On Mon, Aug 24, 2015 at 10:00 PM, Chris Hillery > > > > > > > > > > wrote: > > > > > > > > > > > Hopefully the solution won't involve additional important logic > > > inside > > > > > > Managix itself? > > > > > > > > > > > > Ceej > > > > > > aka Chris Hillery > > > > > > > > > > > > On Mon, Aug 24, 2015 at 7:26 AM, abdullah alamoudi < > > > bamousaa@gmail.com > > > > > > > > > > > wrote: > > > > > > > > > > > > > That works but it doesn't feel right doing it this way. I am > > going > > > to > > > > > fix > > > > > > > this one for good. > > > > > > > > > > > > > > Cheers, > > > > > > > Abdullah. > > > > > > > > > > > > > > On Mon, Aug 24, 2015 at 5:11 PM, Ian Maxon > > wrote: > > > > > > > > > > > > > > > The way I assured liveness for the YARN installer was to try > > > > running > > > > > > "for > > > > > > > > $x in dataset Metadata.Dataset return $x" via the API. I just > > > > polled > > > > > > for > > > > > > > a > > > > > > > > reasonable amount of time (though honestly, thinking about > it > > > now, > > > > > the > > > > > > > > correct parameter to use for the polling interval is the > > startup > > > > wait > > > > > > > time > > > > > > > > in the parameters file :) ). It's not perfect, but it gives > > less > > > > > false > > > > > > > > positives than just checking ps for processes that look like > > > > CCs/NCs. > > > > > > > > > > > > > > > > - Ian. > > > > > > > > > > > > > > > > On Mon, Aug 24, 2015 at 5:03 AM, abdullah alamoudi < > > > > > bamousaa@gmail.com > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Now that I think about it. Maybe we should provide multiple > > > ways > > > > to > > > > > > do > > > > > > > > > this. A polling mechanism to be used for arbitrary time > and a > > > > > pushing > > > > > > > > > mechanism on startup. > > > > > > > > > I am going to start implementation of this and will > probably > > > use > > > > > RMI > > > > > > > for > > > > > > > > > this task both ways (CC to InstallerDriver and > > InstallerDriver > > > to > > > > > > CC). > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > Abdullah. > > > > > > > > > > > > > > > > > > On Mon, Aug 24, 2015 at 2:19 PM, abdullah alamoudi < > > > > > > bamousaa@gmail.com > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > So after further investigation, turned out our startup > > > process > > > > > just > > > > > > > > > starts > > > > > > > > > > the CC and NC processes and then make sure the processes > > are > > > > > > running > > > > > > > > and > > > > > > > > > if > > > > > > > > > > the processes were found to be running, it returns the > > state > > > of > > > > > the > > > > > > > > > cluster > > > > > > > > > > to be active and the subsequent test commands can start > > > > > > immediately. > > > > > > > > > > > > > > > > > > > > This means that the CC could've started but is not yet > > ready > > > > when > > > > > > we > > > > > > > > try > > > > > > > > > > to process the next command. To address this, we need a > > > better > > > > > way > > > > > > to > > > > > > > > > tell > > > > > > > > > > when the startup procedure has completed. we can do this > by > > > > > pushing > > > > > > > (CC > > > > > > > > > > informs installer driver when the startup is complete) or > > > > polling > > > > > > > (The > > > > > > > > > > installer driver needs to actually query the CC for the > > state > > > > of > > > > > > the > > > > > > > > > > cluster). > > > > > > > > > > > > > > > > > > > > I can do either way so let's vote. My vote goes to the > > > pushing > > > > > > > > mechanism. > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > > > On Mon, Aug 24, 2015 at 10:15 AM, abdullah alamoudi < > > > > > > > > bamousaa@gmail.com> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > >> This solution turned out to be incorrect. Actually, the > > test > > > > > cases > > > > > > > > when > > > > > > > > > I > > > > > > > > > >> build after using the join method never fails but > running > > an > > > > > > actual > > > > > > > > > asterix > > > > > > > > > >> instance never succeeds which is quite confusing. > > > > > > > > > >> > > > > > > > > > >> I also think that the startup script has a major bug > where > > > it > > > > > > might > > > > > > > > > >> returns before the startup is complete. More on this > > > > later...... > > > > > > > > > >> > > > > > > > > > >> On Mon, Aug 24, 2015 at 7:48 AM, abdullah alamoudi < > > > > > > > > bamousaa@gmail.com> > > > > > > > > > >> wrote: > > > > > > > > > >> > > > > > > > > > >>> It is highly unlikely that it is related. > > > > > > > > > >>> > > > > > > > > > >>> Cheers, > > > > > > > > > >>> Abdullah. > > > > > > > > > >>> > > > > > > > > > >>> On Mon, Aug 24, 2015 at 5:45 AM, Chen Li < > > chenli@gmail.com > > > > > > > > > > wrote: > > > > > > > > > >>> > > > > > > > > > >>>> @Abdullah: Is this issue related to > > > > > > > > > >>>> https://issues.apache.org/jira/browse/ASTERIXDB-1074? > > Ian > > > > > and I > > > > > > > > plan > > > > > > > > > to > > > > > > > > > >>>> look into the details on Monday. > > > > > > > > > >>>> > > > > > > > > > >>>> On Sun, Aug 23, 2015 at 10:08 AM, abdullah alamoudi < > > > > > > > > > bamousaa@gmail.com > > > > > > > > > >>>> > > > > > > > > > > >>>> wrote: > > > > > > > > > >>>> > > > > > > > > > >>>> > About 3-4 days ago, I was working on the addition of > > the > > > > > > > > filesystem > > > > > > > > > >>>> based > > > > > > > > > >>>> > feed adapter and it didn't take anytime to complete. > > > > > However, > > > > > > > > when I > > > > > > > > > >>>> wanted > > > > > > > > > >>>> > to build and make sure all tests pass, I kept > getting > > > > > > > > > >>>> ConnectionRefused > > > > > > > > > >>>> > errors which caused the installer tests to fail > every > > > now > > > > > and > > > > > > > > then. > > > > > > > > > >>>> > > > > > > > > > > >>>> > I knew the new change had nothing to do with this > > > failure, > > > > > > yet, > > > > > > > I > > > > > > > > > >>>> couldn't > > > > > > > > > >>>> > direct my attention away from this bug (It just > > bothered > > > > me > > > > > so > > > > > > > > much > > > > > > > > > >>>> and I > > > > > > > > > >>>> > knew it needs to be resolved ASAP). After wasting > > > > countless > > > > > > > > hours, I > > > > > > > > > >>>> was > > > > > > > > > >>>> > finally able to figure out what was happening :-) > > > > > > > > > >>>> > > > > > > > > > > >>>> > In the startup routine, we start three Jetty web > > servers > > > > > (Web > > > > > > > > > >>>> interface > > > > > > > > > >>>> > server, JSON API server, and Feed server). Sometime > > ago, > > > > we > > > > > > used > > > > > > > > to > > > > > > > > > >>>> end the > > > > > > > > > >>>> > startup call before making sure the > server.isStarted() > > > > > method > > > > > > > > > returns > > > > > > > > > >>>> true > > > > > > > > > >>>> > on all servers. At that time, I introduced the > > > > > > > > waitUntilServerStarts > > > > > > > > > >>>> method > > > > > > > > > >>>> > to make sure we don't return before the servers are > > > ready. > > > > > > > Turned > > > > > > > > > >>>> out, that > > > > > > > > > >>>> > was an incorrect way to handle this (We can blame > > > > > > stackoverflow > > > > > > > > for > > > > > > > > > >>>> this > > > > > > > > > >>>> > one!) and it is not enough that the server > isStarted() > > > > > returns > > > > > > > > true. > > > > > > > > > >>>> The > > > > > > > > > >>>> > correct way to do this is to call the server.join() > > > method > > > > > > after > > > > > > > > the > > > > > > > > > >>>> > server.start(). > > > > > > > > > >>>> > > > > > > > > > > >>>> > See: > > > > > > > > > >>>> > > > > > > > > > > >>>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > http://stackoverflow.com/questions/15924874/embedded-jetty-why-to-use-join > > > > > > > > > >>>> > > > > > > > > > > >>>> > This was equally satisfying as it was frustrating > and > > > you > > > > > are > > > > > > > > > welcome > > > > > > > > > >>>> for > > > > > > > > > >>>> > the future time I saved each of you :) > > > > > > > > > >>>> > -- > > > > > > > > > >>>> > Amoudi, Abdullah. > > > > > > > > > >>>> > > > > > > > > > > >>>> > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> -- > > > > > > > > > >>> Amoudi, Abdullah. > > > > > > > > > >>> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> -- > > > > > > > > > >> Amoudi, Abdullah. > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Amoudi, Abdullah. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Amoudi, Abdullah. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Amoudi, Abdullah. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Amoudi, Abdullah. > > > > > > > > > > > > > > > > > > > > > -- > > > > Raman > > > > > > > > > > > > > > > -- > > Amoudi, Abdullah. > > > > > > > > -- > > Raman > > > -- Amoudi, Abdullah. --001a11c185c2f2f0f7051e20c64e--