From user-return-1729-archive-asf-public=cust-asf.ponee.io@kudu.apache.org  Mon Oct  7 09:31:47 2019
Return-Path: <user-return-1729-archive-asf-public=cust-asf.ponee.io@kudu.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 3FD7C1804BB
	for <archive-asf-public@cust-asf.ponee.io>; Mon,  7 Oct 2019 11:31:47 +0200 (CEST)
Received: (qmail 99240 invoked by uid 500); 7 Oct 2019 09:31:46 -0000
Mailing-List: contact user-help@kudu.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@kudu.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@kudu.apache.org>
List-Post: <mailto:user@kudu.apache.org>
List-Id: <user.kudu.apache.org>
Reply-To: user@kudu.apache.org
Delivered-To: mailing list user@kudu.apache.org
Received: (qmail 99230 invoked by uid 99); 7 Oct 2019 09:31:46 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Oct 2019 09:31:45 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 6D7941A412E
	for <user@kudu.apache.org>; Mon,  7 Oct 2019 09:31:45 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 2.004
X-Spam-Level: **
X-Spam-Status: No, score=2.004 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2,
	RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=0.001,
	RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001]
	autolearn=disabled
Authentication-Results: spamd2-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=an10-io.20150623.gappssmtp.com
Received: from mx1-ec2-va.apache.org ([10.40.0.8])
	by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024)
	with ESMTP id EHB9QdEP2mMf for <user@kudu.apache.org>;
	Mon,  7 Oct 2019 09:31:43 +0000 (UTC)
Received-SPF: None (mailfrom) identity=mailfrom; client-ip=209.85.210.45; helo=mail-ot1-f45.google.com; envelope-from=fmateen@an10.io; receiver=<UNKNOWN> 
Received: from mail-ot1-f45.google.com (mail-ot1-f45.google.com [209.85.210.45])
	by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id E2CD4BC8BE
	for <user@kudu.apache.org>; Mon,  7 Oct 2019 09:31:42 +0000 (UTC)
Received: by mail-ot1-f45.google.com with SMTP id 89so10333502oth.13
        for <user@kudu.apache.org>; Mon, 07 Oct 2019 02:31:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=an10-io.20150623.gappssmtp.com; s=20150623;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=fJETug9WYd3UidQZ2GrN/Sk+WGAP7r83qrWxY6oVR/8=;
        b=W20ezB610oguP+OihbhelsiDU6qAXCCPsFR7FwwCc3DtWDwygJCEBCq5ab17VQvZ86
         BfQEbugxldTwALMl7oYAV4Z6Ud6gdAdXZGSLPa+SIIimhGCupfraa4BtvmQZj0XhwqmM
         5O6nUzG+wAAbgiexRqSQSBLxDHCJaHRE3Pkog+8ve89IprfDdR5Kt69r3C/MuuV8UmQO
         JjwYMqDwqLbnbKGr4BXGWdfUSBOttRd/j6H5EXIcZGGNSNFTrvtOmOrzbAuW8trg+TDE
         mnvXo0MqzVtnu6WhhXkXqee4ybFWOotUtIuzkp2l2x0cB4y63FG8hC2sSbhZ0TcYHs0e
         M4wg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=fJETug9WYd3UidQZ2GrN/Sk+WGAP7r83qrWxY6oVR/8=;
        b=L7BWjv9n3QENxQUnFeJvxV2d3XjfL1ABBAULznrOhabs7Phl+o7xtVXr8Pg8UKCUli
         /nB63jdhp2YOxo3yvFoQEwI7AXi3jsxUw0fTV6+ok7OTGnTrYJW0IEtmcATl7BPBUjLw
         7M6aJvwhjtKJqQTJSDN5DRgzjKMx92GMnnlhnp0LqMc8Gxewm++SvlVAAqLTSz5HX/bp
         Nzb34wAwWdHZsBEaJaaVrQSXxNmEcbMeqCwWmt8xPhSwQOqkq13zzbLC7UMsCJ5lPKra
         2v8tWsRIYM1pByOHBTk/vo6FIKnpKqQE77CLLTEIj3D8lkIz0e7N8vZ2lnqWohXV/d8H
         2Kvg==
X-Gm-Message-State: APjAAAU3ER/g0MFrCbm46TVHo+KPg2EC8FYiavmGlg/y5KAJmzUz083C
	Onw2LPmnSE8dOsyOO6Z7ZbGq9p7sFtPykohwGs4gnYLe38o=
X-Google-Smtp-Source: APXvYqxXvtgw2veqOcWBltlr+8h/HpWUa8JNKkY9VGURQzx0BkuDSQvX9UEUAmfg5b+aYVtUVL8AL607sZSATLU+eVE=
X-Received: by 2002:a05:6830:1ac5:: with SMTP id r5mr21200291otc.338.1570440696306;
 Mon, 07 Oct 2019 02:31:36 -0700 (PDT)
MIME-Version: 1.0
References: <CAPPEwvSRO7wKCjVyD6_X3u9Wgghu8W9cJySp4R4kB8-HuOf2jQ@mail.gmail.com>
 <CANbMB4wzY3sVGfqTVNGDC+=ne-q7rTutr1ANN_C-JVONP0od+w@mail.gmail.com>
In-Reply-To: <CANbMB4wzY3sVGfqTVNGDC+=ne-q7rTutr1ANN_C-JVONP0od+w@mail.gmail.com>
From: Faraz Mateen <fmateen@an10.io>
Date: Mon, 7 Oct 2019 14:31:00 +0500
Message-ID: <CAPPEwvTJ9pBghjNUGzdtrDZS_e3dQDLECzDmFE_3E8NNXX31nQ@mail.gmail.com>
Subject: Re: "Too many open files" error
To: user@kudu.apache.org
Content-Type: multipart/alternative; boundary="000000000000d8340805944eb71a"

--000000000000d8340805944eb71a
Content-Type: text/plain; charset="UTF-8"

Alexey,

Thank you for the response. Having too many partitions is exactly what the
problem is. When I restart the tserver, it tries to open files against each
tablet and eventually crashes.

Is there a way to get around this and recover my data? Is there any config
I can change to run the tserver? Or can I add a new tablet server and
migrate existing tablets?

On Sat, Oct 5, 2019 at 10:05 PM Alexey Serbin <aserbin@cloudera.com> wrote:

> Hi,
>
> Most likely the issue happened because of high number of tablet replicas
> at the tablet server.  In case of high spike of in the input data rate,
> higher compaction activity might require more than usual number of file
> descriptors, since more files are opened.
>
> How many tablet replicas does that tablet server have?  It's not
> recommended to have too many:
> https://kudu.apache.org/docs/known_issues.html#_scale
>
> To understand what has happened, you need to take a look into the logs of
> the tablet server.  This might be useful:
> https://kudu.apache.org/docs/troubleshooting.html
>
> Overall, if there is only one (?) tablet server in the whole Kudu cluster,
> why to have 39 partitions per table?  I guess that's some sort of
> proof-of-concept/toy setup, but anyways.  Since all the tablet replicas end
> up at the same single tablet server, I don't see benefits from partitioning
> in that setup.  For the tablet server, it simply means x-times increased
> number of open file descriptors and increased memory usage.
>
>
> Kind regards,
>
> Alexey
>
> On Fri, Oct 4, 2019 at 4:21 AM Faraz Mateen <fmateen@an10.io> wrote:
>
>> Hi all,
>>
>> I am facing a problem with my kudu setup where tablet server crashes with
>> "too many open files" error.
>> The setup consists of a single master and a single tablet server. Tables
>> created are such that there are 39 partitions per table. However not all
>> partitions have data that corresponds to them.
>> Yesterday my tserver crashed and when I am trying to restart the tserver,
>> it fails with the error:
>>
>> I1004 03:50:39.896301  5669 ts_tablet_manager.cc:1173] T
>> cab85f15f06748d0b59161d9f3da55f7 P ee14d248ac994d0eb60dbb0db4ab3b09:
>> Registered tablet (data state: TABLET_DATA_READY)
>> W1004 03:50:39.923184  5687 os-util.cc:165] could not read
>> /proc/self/status: IO error: /proc/self/status: Too many open files (error
>> 24)
>> I1004 03:50:39.939460  5669 ts_tablet_manager.cc:1173] T
>> d8d68ce6f6ea49479c00d29709869f1f P ee14d248ac994d0eb60dbb0db4ab3b09:
>> Registered tablet (data state: TABLET_DATA_READY)
>>
>> I have already modified ulimit of the machine:
>>
>> root@vm-3:~# ulimit -a
>> core file size          (blocks, -c) 0
>> data seg size           (kbytes, -d) unlimited
>> scheduling priority             (-e) 0
>> file size               (blocks, -f) unlimited
>> pending signals                 (-i) 63923
>> max locked memory       (kbytes, -l) 16384
>> max memory size         (kbytes, -m) unlimited
>> open files                      (-n) 65535
>> pipe size            (512 bytes, -p) 8
>> POSIX message queues     (bytes, -q) 819200
>> real-time priority              (-r) 0
>> stack size              (kbytes, -s) 8192
>> cpu time               (seconds, -t) unlimited
>> max user processes              (-u) 65535
>> virtual memory          (kbytes, -v) unlimited
>> file locks                      (-x) unlimited
>>
>> *Set up Details:*
>> Single master and tserver setup on a single VM.
>> 4 cores, 550GB hard disk, 16GB RAM
>> Kudu version 1.8 on ubuntu, installed through debian packages.
>> Before crash, data was being inserted in kudu at a very high rate. RAM
>> usage was around 87% and disk usage was around 84 percent.
>>
>> Here is what I have tried so far:
>> 1- Set ulimit -n to 65535.
>> 2- Reboot the vm to get rid of stale processes.
>> 3- Set block_manager_max_open_files to 32000 in tserver flag file.
>>
>> What I want to know now is:
>> 1- Why am I hitting this problem? Is this due to low resources on the VM
>> or high number of tablets on a single tserver?
>> 2- How can I get around this problem, recover my data and kudu services?
>>
>> Would really appreciate some help on this.
>> --
>> Faraz Mateen
>>
>

-- 
Faraz Mateen

--000000000000d8340805944eb71a
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Alexey,<div><br></div><div>Thank you for the response. Hav=
ing too many partitions is exactly what the problem is. When I restart the =
tserver, it tries to open files against each tablet and eventually crashes.=
</div><div><br></div><div>Is there a way to get around this and recover my =
data? Is there any config I can change to run the tserver? Or can I add a n=
ew tablet server and migrate existing tablets?</div></div><br><div class=3D=
"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Sat, Oct 5, 2019 at =
10:05 PM Alexey Serbin &lt;<a href=3D"mailto:aserbin@cloudera.com" target=
=3D"_blank">aserbin@cloudera.com</a>&gt; wrote:<br></div><blockquote class=
=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rg=
b(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div class=3D"gmail_defau=
lt" style=3D"font-family:arial,helvetica,sans-serif;font-size:small">Hi,</d=
iv><div class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-s=
erif;font-size:small"><br></div><div class=3D"gmail_default" style=3D"font-=
family:arial,helvetica,sans-serif;font-size:small">Most likely the issue ha=
ppened because of high number of tablet replicas at the tablet server.=C2=
=A0 In case of high spike of in the input data rate, higher compaction acti=
vity might require more than usual number of file descriptors, since more f=
iles are opened.</div><div class=3D"gmail_default" style=3D"font-family:ari=
al,helvetica,sans-serif;font-size:small"><br></div><div class=3D"gmail_defa=
ult" style=3D"font-family:arial,helvetica,sans-serif;font-size:small">How m=
any tablet replicas does that tablet server have?=C2=A0 It&#39;s not recomm=
ended to have too many: <a href=3D"https://kudu.apache.org/docs/known_issue=
s.html#_scale" target=3D"_blank">https://kudu.apache.org/docs/known_issues.=
html#_scale</a></div><div class=3D"gmail_default" style=3D"font-family:aria=
l,helvetica,sans-serif;font-size:small"><br></div><div class=3D"gmail_defau=
lt" style=3D"font-family:arial,helvetica,sans-serif;font-size:small">To und=
erstand what has happened, you need to take a look into the logs of the tab=
let server.=C2=A0 This might be useful:=C2=A0<a href=3D"https://kudu.apache=
.org/docs/troubleshooting.html" target=3D"_blank">https://kudu.apache.org/d=
ocs/troubleshooting.html</a></div><div class=3D"gmail_default" style=3D"fon=
t-family:arial,helvetica,sans-serif;font-size:small"><br></div><div class=
=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;font-siz=
e:small">Overall, if there is only one (?) tablet server in the whole Kudu =
cluster, why to have 39 partitions per table?=C2=A0 I guess that&#39;s some=
 sort of proof-of-concept/toy setup, but anyways.=C2=A0 Since all the table=
t replicas end up at the same single tablet server, I don&#39;t see benefit=
s from partitioning in that setup.=C2=A0 For the tablet server, it simply m=
eans x-times increased number of open file descriptors and increased memory=
 usage.</div><div class=3D"gmail_default" style=3D"font-family:arial,helvet=
ica,sans-serif;font-size:small"><br></div><div class=3D"gmail_default" styl=
e=3D"font-family:arial,helvetica,sans-serif;font-size:small"><br></div><div=
 class=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;fo=
nt-size:small">Kind regards,</div><div class=3D"gmail_default" style=3D"fon=
t-family:arial,helvetica,sans-serif;font-size:small"><br></div><div class=
=3D"gmail_default" style=3D"font-family:arial,helvetica,sans-serif;font-siz=
e:small">Alexey</div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" =
class=3D"gmail_attr">On Fri, Oct 4, 2019 at 4:21 AM Faraz Mateen &lt;<a hre=
f=3D"mailto:fmateen@an10.io" target=3D"_blank">fmateen@an10.io</a>&gt; wrot=
e:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0=
.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"l=
tr">Hi all,<div><br></div><div>I am facing a problem with my kudu setup whe=
re tablet server crashes with &quot;too many open files&quot; error.</div><=
div>The setup consists of a single master and a single tablet server. Table=
s created are such that there are 39 partitions per table. However not all =
partitions have data that corresponds to them.</div><div>Yesterday my tserv=
er crashed and when I am trying to restart the tserver, it fails with the e=
rror:</div><div><br></div><div>I1004 03:50:39.896301 =C2=A05669 ts_tablet_m=
anager.cc:1173] T cab85f15f06748d0b59161d9f3da55f7 P ee14d248ac994d0eb60dbb=
0db4ab3b09: Registered tablet (data state: TABLET_DATA_READY)<br>W1004 03:5=
0:39.923184 =C2=A05687 os-util.cc:165] could not read /proc/self/status: IO=
 error: /proc/self/status: Too many open files (error 24)<br>I1004 03:50:39=
.939460 =C2=A05669 ts_tablet_manager.cc:1173] T d8d68ce6f6ea49479c00d297098=
69f1f P ee14d248ac994d0eb60dbb0db4ab3b09: Registered tablet (data state: TA=
BLET_DATA_READY)<br></div><div><br></div><div>I have already modified ulimi=
t of the machine:</div><div><br></div><div>root@vm-3:~# ulimit -a<br>core f=
ile size =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(blocks, -c) 0<br>data seg size =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (kbytes, -d) unlimited<br>scheduling pri=
ority =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (-e) 0<br>file size =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (blocks, -f) unlimited<br>pending=
 signals =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (-i) 63923=
<br>max locked memory =C2=A0 =C2=A0 =C2=A0 (kbytes, -l) 16384<br>max memory=
 size =C2=A0 =C2=A0 =C2=A0 =C2=A0 (kbytes, -m) unlimited<br>open files =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(-=
n) 65535<br>pipe size =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(512 bytes, =
-p) 8<br>POSIX message queues =C2=A0 =C2=A0 (bytes, -q) 819200<br>real-time=
 priority =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(-r) 0<br>stack s=
ize =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(kbytes, -s) 8192<br>cp=
u time =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (seconds, -t) unlim=
ited<br>max user processes =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
(-u) 65535<br>virtual memory =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(kbytes, -v)=
 unlimited<br>file locks =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0(-x) unlimited<br></div><div><br></div><div><b>S=
et up Details:</b></div><div>Single master and tserver setup on a single VM=
.</div><div>4 cores, 550GB hard disk, 16GB RAM</div><div>Kudu version 1.8 o=
n ubuntu, installed through debian packages.</div><div>Before crash, data w=
as being inserted in kudu at a very high rate. RAM usage was around 87% and=
 disk usage was around 84 percent.</div><div><br></div><div>Here is what I =
have tried so far:</div><div>1- Set ulimit -n to 65535.</div><div>2- Reboot=
 the vm to get rid of stale processes.</div><div>3- Set=C2=A0block_manager_=
max_open_files to 32000 in tserver flag file.</div><div>=C2=A0</div><div>Wh=
at I want to know now is:</div><div>1- Why am I hitting this problem? Is th=
is due to low resources on the VM or high number of tablets on a single tse=
rver?</div><div>2- How can I get around this problem, recover my data and k=
udu services?</div><div><br></div><div>Would really appreciate some help on=
 this.</div><div>-- <br><div dir=3D"ltr"><div dir=3D"ltr"><div><div dir=3D"=
ltr"><div><div dir=3D"ltr"><div>Faraz Mateen<br></div></div></div></div></d=
iv></div></div></div></div>
</blockquote></div>
</blockquote></div><br clear=3D"all"><div><br></div>-- <br><div dir=3D"ltr"=
><div dir=3D"ltr"><div><div dir=3D"ltr"><div><div dir=3D"ltr"><div>Faraz Ma=
teen<br></div></div></div></div></div></div></div>

--000000000000d8340805944eb71a--