Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
Received-SPF: pass (athena.apache.org: domain of eric.newton@gmail.com
 designates 209.85.216.174 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAPjGfyGycBa3kgU3nMowuNWu3DDueUy39UKh9D3N6E9_Ej1F1g@mail.gmail.com>
References: 
 <CAPjGfyGvmh_xZQrJUCsxYx-eTiBZP1GcvwLeuMv6WbSsvV-vJA@mail.gmail.com>
	<CADxc9BnTmC88N+Beop10Am+FBP_p8cSBzqxXO4p0eHmGQqpD7w@mail.gmail.com>
	<CAPjGfyGQSCMFmh5chbKeuHeJOSyTtF4EzyZW8CuiZejfL3j2FA@mail.gmail.com>
	<CAGHyZ6KkZo7rfaPE=jSRukw=PLg_LsgL=6z56Ms3KteBOdZvdw@mail.gmail.com>
	<CAPjGfyGycBa3kgU3nMowuNWu3DDueUy39UKh9D3N6E9_Ej1F1g@mail.gmail.com>
Date: Wed, 15 Jan 2014 08:49:13 -0500
Message-ID: 
 <CADxc9BnCdds5R7=es1-ddLKDUz9r031EAru6rV=8kZpu4VC3FQ@mail.gmail.com>
Subject: Re: Bulk ingest losing tablet server
From: Eric Newton <eric.newton@gmail.com>
To: "user@accumulo.apache.org" <user@accumulo.apache.org>
Content-Type: multipart/alternative; boundary=047d7bea44520037aa04f002934d

--047d7bea44520037aa04f002934d
Content-Type: text/plain; charset=ISO-8859-1

When a tablet server (lets call it A) bulk imports a file, it makes a few
bookkeeping entries in the !METADATA table. The tablet server that is
serving the !METADATA table (lets call it B) checks a constraint: tablet
server A must still have its zookeeper lock.  This constraint is being
violated because A has lost its lock.

Tablet server A should have died.

The native map is used for live data ingest and exist outside of the java
heap.  The caches live in the heap.

-Eric


On Wed, Jan 15, 2014 at 8:19 AM, Anthony F <afccri@gmail.com> wrote:

> Just checked on the native mem maps . . . looks like it is set to 1GB.  Do
> the index and data caches reside in native mem maps if available or is
> native mem used for something else?
>
> I just repeated an ingest . . . this time I did not lose any tablet
> servers but my logs are filling up with the following messages:
>
> 2014-01-15 08:16:41,643 [constraints.MetadataConstraints] DEBUG: violating
> metadata mutation : b;74~thf
> 2014-01-15 08:16:41,643 [constraints.MetadataConstraints] DEBUG:  update:
> file:/b-00012bq/I00012cj.rf value 20272720,0,1389757684543
> 2014-01-15 08:16:41,643 [constraints.MetadataConstraints] DEBUG:  update:
> loaded:/b-00012bq/I00012cj.rf value 2675766456963732003
> 2014-01-15 08:16:41,643 [constraints.MetadataConstraints] DEBUG:  update:
> srv:time value M1389757684543
> 2014-01-15 08:16:41,643 [constraints.MetadataConstraints] DEBUG:  update:
> srv:lock value tservers/
> 192.168.2.231:9997/zlock-0000000002$2438da698db13b4
>
>
>
> On Mon, Jan 13, 2014 at 2:44 PM, Sean Busbey <busbey+lists@cloudera.com>wrote:
>
>>
>> On Mon, Jan 13, 2014 at 12:02 PM, Anthony F <afccri@gmail.com> wrote:
>>
>>> Yes, system swappiness is set to 0.  I'll run again and gather more logs.
>>>
>>> Is there a zookeeper timeout setting that I can adjust to avoid this
>>> issue and is that advisable?  Basically, the tservers are colocated with
>>> HDFS datanodes and Hadoop nodemanagers.  The machines are overallocated in
>>> terms of RAM.  So, I have a feeling that when a map-reduce job is kicked
>>> off, it causes the tserver to page out to swap space.  Once the map-reduce
>>> job finishes and the bulk ingest is kicked off, the tserver is paged back
>>> in and the ZK timeout causes a shutdown.
>>>
>>>
>>>
>> You should not overallocate the amount of memory on the machines.
>> Generally, you should provide memory limits under teh assumption that
>> everything will be on at once.
>>
>> Many parts of Hadoop (not just Accumulo) will degrade or malfunction in
>> the presence of memory swapping.
>>
>> How much of hte 12GB for Accumulo is for native memmaps?
>>
>
>

--047d7bea44520037aa04f002934d
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">When a tablet server (lets call it A) bulk imports a file,=
 it makes a few bookkeeping entries in the !METADATA table. The tablet serv=
er that is serving the !METADATA table (lets call it B) checks a constraint=
: tablet server A must still have its zookeeper lock. =A0This constraint is=
 being violated because A has lost its lock.<div>
<br></div><div style>Tablet server A should have died.</div><div style><br>=
</div><div style>The native map is used for live data ingest and exist outs=
ide of the java heap. =A0The caches live in the heap.</div><div style><br>
</div><div style>-Eric</div></div><div class=3D"gmail_extra"><br><br><div c=
lass=3D"gmail_quote">On Wed, Jan 15, 2014 at 8:19 AM, Anthony F <span dir=
=3D"ltr">&lt;<a href=3D"mailto:afccri@gmail.com" target=3D"_blank">afccri@g=
mail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Just checked on the native =
mem maps . . . looks like it is set to 1GB. =A0Do the index and data caches=
 reside in native mem maps if available or is native mem used for something=
 else?<div>
<br></div><div>I just repeated an ingest . . . this time I did not lose any=
 tablet servers but my logs are filling up with the following messages:</di=
v>
<div><br></div><div><div>2014-01-15 08:16:41,643 [constraints.MetadataConst=
raints] DEBUG: violating metadata mutation : b;74~thf</div><div>2014-01-15 =
08:16:41,643 [constraints.MetadataConstraints] DEBUG: =A0update: file:/b-00=
012bq/I00012cj.rf value 20272720,0,1389757684543</div>

<div>2014-01-15 08:16:41,643 [constraints.MetadataConstraints] DEBUG: =A0up=
date: loaded:/b-00012bq/I00012cj.rf value 2675766456963732003</div><div>201=
4-01-15 08:16:41,643 [constraints.MetadataConstraints] DEBUG: =A0update: sr=
v:time value M1389757684543</div>

<div>2014-01-15 08:16:41,643 [constraints.MetadataConstraints] DEBUG: =A0up=
date: srv:lock value tservers/<a href=3D"http://192.168.2.231:9997/zlock-00=
00000002$2438da698db13b4" target=3D"_blank">192.168.2.231:9997/zlock-000000=
0002$2438da698db13b4</a></div>

</div><div><br></div></div><div class=3D"HOEnZb"><div class=3D"h5"><div cla=
ss=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Mon, Jan 13, 2014 =
at 2:44 PM, Sean Busbey <span dir=3D"ltr">&lt;<a href=3D"mailto:busbey+list=
s@cloudera.com" target=3D"_blank">busbey+lists@cloudera.com</a>&gt;</span> =
wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><br><div class=3D"gmail_ext=
ra"><div class=3D"gmail_quote"><div>On Mon, Jan 13, 2014 at 12:02 PM, Antho=
ny F <span dir=3D"ltr">&lt;<a href=3D"mailto:afccri@gmail.com" target=3D"_b=
lank">afccri@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Yes, system swappiness is s=
et to 0. =A0I&#39;ll run again and gather more logs.<div><br></div><div>Is =
there a zookeeper timeout setting that I can adjust to avoid this issue and=
 is that advisable? =A0Basically, the tservers are colocated with HDFS data=
nodes and Hadoop nodemanagers. =A0The machines are overallocated in terms o=
f RAM. =A0So, I have a feeling that when a map-reduce job is kicked off, it=
 causes the tserver to page out to swap space. =A0Once the map-reduce job f=
inishes and the bulk ingest is kicked off, the tserver is paged back in and=
 the ZK timeout causes a shutdown.</div>


</div><div><div><div class=3D"gmail_extra"><br><br></div></div></div></bloc=
kquote><div><br></div></div><div>You should not overallocate the amount of =
memory on the machines. Generally, you should provide memory limits under t=
eh assumption that everything will be on at once.</div>


<div><br></div><div>Many parts of Hadoop (not just Accumulo) will degrade o=
r malfunction in the presence of memory swapping.</div><div><br></div><div>=
How much of hte 12GB for Accumulo is for native memmaps?=A0</div></div></di=
v>


</div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--047d7bea44520037aa04f002934d--