Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A114D1884E for ; Fri, 13 Nov 2015 14:56:53 +0000 (UTC) Received: (qmail 23082 invoked by uid 500); 13 Nov 2015 14:56:53 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 23033 invoked by uid 500); 13 Nov 2015 14:56:53 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 23023 invoked by uid 99); 13 Nov 2015 14:56:53 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Nov 2015 14:56:52 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 897E5C1388 for ; Fri, 13 Nov 2015 14:56:52 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.9 X-Spam-Level: ** X-Spam-Status: No, score=2.9 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id DEG72TigA-NT for ; Fri, 13 Nov 2015 14:56:42 +0000 (UTC) Received: from mail-io0-f172.google.com (mail-io0-f172.google.com [209.85.223.172]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id CFC61439CD for ; Fri, 13 Nov 2015 14:56:41 +0000 (UTC) Received: by iofh3 with SMTP id h3so99707160iof.3 for ; Fri, 13 Nov 2015 06:56:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=omqmrRiVjFPy7vK6brrYwmuPb/nVUhhxSjRjpciL0iY=; b=KECtgNSHr0tULtKSFNQzZPGIJYgItJcxOfibA1YfUJH93D2uMX7Bhi+uABr/RdjlmO Bjwr/AiVoN0t2KhBvqaJuG48Nk3G9iLvR57qlsKshCb0VQuzeqnsWf72DeS71aJL9SGv dsAk32S3EyG4NXumr2QRM1uVEnd5f7UXo0MJo9ielzJjqMzZjWwg1OdB6wwbdDRvxW/e yapfF8cWcq+1DcacGhy0slPypSZRxZqsnOz2JWiR9ZMbysB6c8BKoQge9FaiVgR8sPt4 Xr11487GkVgCaBWSREokH01T/SXWFkp8oygUX55e1PMZeYMe6HwdFsfLF8EGL8r2BP+L YIeA== MIME-Version: 1.0 X-Received: by 10.107.164.227 with SMTP id d96mr20075977ioj.73.1447426601399; Fri, 13 Nov 2015 06:56:41 -0800 (PST) Received: by 10.107.40.200 with HTTP; Fri, 13 Nov 2015 06:56:41 -0800 (PST) In-Reply-To: References: Date: Fri, 13 Nov 2015 09:56:41 -0500 Message-ID: Subject: Re: Quick question re UnknownHostException From: Adam Fuchs To: user@accumulo.apache.org Content-Type: multipart/alternative; boundary=001a114226146a3b6d05246d4455 --001a114226146a3b6d05246d4455 Content-Type: text/plain; charset=UTF-8 Josef, If these are intermittent failures, you might consider turning on the watcher [1] to automatically restart your processes. This should keep your cluster from atrophying over time. You'll still have to take administrative action to fix the DNS problem, but your availability should be better. Cheers, Adam [1] http://accumulo.apache.org/1.7/accumulo_user_manual.html#watcher On Fri, Nov 13, 2015 at 6:57 AM, Josef Roehrl - PHEMI wrote: > Hi Everyone, > > Turns out that it was a DNS server issue exactly. Had to get this > confirmed by the Data Centre, though. > > Thanks! > > On Fri, Nov 13, 2015 at 12:25 PM, Josef Roehrl - PHEMI > wrote: > >> Hi All, >> >> 3 times in the past few weeks (twice on 1 system, once on another), the >> master gets UnknownHostException (s), one by one, for each of the tablet >> servers. Then, it wants to stop them. Eventually, all the tablet servers >> quit. >> >> It goes like this for all the tablet servers: >> >> 12 08:14:01,0498tserver:620 >> ERROR >> >> error sending update to tserver3:9997: org.apache.thrift.transport.TTransportException: java.net.UnknownHostException >> >> 12 09:01:53,0352master:12 >> ERROR >> >> org.apache.thrift.transport.TTransportException: java.net.UnknownHostException >> >> 12 16:35:50,0672master:110 >> ERROR >> >> unable to get tablet server status tserver3:9997[250e6cd2c500012] org.apache.thrift.transport.TTransportException: java.net.UnknownHostException >> >> >> >> I've redacted the real host names, of course. >> >> This could be a DNS problem, though the system was running fine for days >> before this happened (same scenario on the 2 systems with really quite >> different DNS servers). >> >> If any one has a hint or seen something like this, I would appreciate any >> pointers. >> >> I have looked at the JIRA issues regarding DNS outages, but nothing seems >> to fit this pattern. >> >> Thanks >> >> -- >> >> >> Josef Roehrl >> Senior Software Developer >> *PHEMI Systems* >> 180-887 Great Northern Way >> Vancouver, BC V5T 4T5 >> 604-336-1119 >> Website Twitter >> Linkedin >> >> >> >> > > > -- > > > Josef Roehrl > Senior Software Developer > *PHEMI Systems* > 180-887 Great Northern Way > Vancouver, BC V5T 4T5 > 604-336-1119 > Website Twitter > Linkedin > > > > --001a114226146a3b6d05246d4455 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Josef,

If these are intermittent failur= es, you might consider turning on the watcher [1] to automatically restart = your processes. This should keep your cluster from atrophying over time. Yo= u'll still have to take administrative action to fix the DNS problem, b= ut your availability should be better.

Cheers,
Adam

On Fri, Nov 13, 2015 at 6:5= 7 AM, Josef Roehrl - PHEMI <jroehrl@phemi.com> wrote:
Hi Everyone,

Turns out that it was a DNS server issue exactly.=C2=A0 Had to get this c= onfirmed by the Data Centre, though.

Thanks!
=
On Fri, Nov 13, 2015 at 12:25 PM, Josef Roehrl= - PHEMI <jroehrl@phemi.com> wrote:
Hi All,

3 times in the past = few weeks (twice on 1 system, once on another), the master gets UnknownHost= Exception (s), one by one, for each of the tablet servers.=C2=A0 Then, it w= ants to stop them. Eventually, all the tablet servers quit.

<= /div>
It goes like this for all the tablet servers:

12=C2=A008:14:01,0498tserver:620
ERROR
error sending update to tserver3:9997: org.apach=
e.thrift.transport.TTransportException: java.net.UnknownHostException
=
12=C2=A009:01:53,0352master:12
ERROR
org.apache.thrift.transport.TTransportException: ja=
va.net.UnknownHostException
12=C2=A016:35:50,0672master:110
ERROR
unable to get tablet server status tserver3:9997=
[250e6cd2c500012] org.apache.thrift.transport.TTransportException: java.net=
.UnknownHostException


=
I've redacted the real host names, of course.

=
This could be a DNS problem, though the system was running fine = for days before this happened (same scenario on the 2 systems with really q= uite different DNS servers).

If any one has a hint= or seen something like this, I would appreciate any pointers.
I have looked at the JIRA issues regarding DNS outages, but no= thing seems to fit this pattern.

Thanks

--
<= div dir=3D"ltr">

Josef Roehrl
Senior Software Developer
PHEMI Systems
<= /font>
180-887 Great Nort= hern Way
Vancouve= r, BC V5T 4T5
Website=C2=A0Twitter=C2=A0Linkedin=C2=A0



--

J= osef Roehrl
Senior Software Developer
PHEMI = Systems
18= 0-887 Great Northern Way
Vancouver, BC V5T 4T5
<= div>Website=C2=A0Twitter=C2=A0Linkedin=C2= =A0

--001a114226146a3b6d05246d4455--