From user-return-1729-archive-asf-public=cust-asf.ponee.io@kudu.apache.org Mon Oct 7 09:31:47 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 3FD7C1804BB for ; Mon, 7 Oct 2019 11:31:47 +0200 (CEST) Received: (qmail 99240 invoked by uid 500); 7 Oct 2019 09:31:46 -0000 Mailing-List: contact user-help@kudu.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@kudu.apache.org Delivered-To: mailing list user@kudu.apache.org Received: (qmail 99230 invoked by uid 99); 7 Oct 2019 09:31:46 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Oct 2019 09:31:45 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 6D7941A412E for ; Mon, 7 Oct 2019 09:31:45 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.004 X-Spam-Level: ** X-Spam-Status: No, score=2.004 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=an10-io.20150623.gappssmtp.com Received: from mx1-ec2-va.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id EHB9QdEP2mMf for ; Mon, 7 Oct 2019 09:31:43 +0000 (UTC) Received-SPF: None (mailfrom) identity=mailfrom; client-ip=209.85.210.45; helo=mail-ot1-f45.google.com; envelope-from=fmateen@an10.io; receiver= Received: from mail-ot1-f45.google.com (mail-ot1-f45.google.com [209.85.210.45]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id E2CD4BC8BE for ; Mon, 7 Oct 2019 09:31:42 +0000 (UTC) Received: by mail-ot1-f45.google.com with SMTP id 89so10333502oth.13 for ; Mon, 07 Oct 2019 02:31:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=an10-io.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=fJETug9WYd3UidQZ2GrN/Sk+WGAP7r83qrWxY6oVR/8=; b=W20ezB610oguP+OihbhelsiDU6qAXCCPsFR7FwwCc3DtWDwygJCEBCq5ab17VQvZ86 BfQEbugxldTwALMl7oYAV4Z6Ud6gdAdXZGSLPa+SIIimhGCupfraa4BtvmQZj0XhwqmM 5O6nUzG+wAAbgiexRqSQSBLxDHCJaHRE3Pkog+8ve89IprfDdR5Kt69r3C/MuuV8UmQO JjwYMqDwqLbnbKGr4BXGWdfUSBOttRd/j6H5EXIcZGGNSNFTrvtOmOrzbAuW8trg+TDE mnvXo0MqzVtnu6WhhXkXqee4ybFWOotUtIuzkp2l2x0cB4y63FG8hC2sSbhZ0TcYHs0e M4wg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=fJETug9WYd3UidQZ2GrN/Sk+WGAP7r83qrWxY6oVR/8=; b=L7BWjv9n3QENxQUnFeJvxV2d3XjfL1ABBAULznrOhabs7Phl+o7xtVXr8Pg8UKCUli /nB63jdhp2YOxo3yvFoQEwI7AXi3jsxUw0fTV6+ok7OTGnTrYJW0IEtmcATl7BPBUjLw 7M6aJvwhjtKJqQTJSDN5DRgzjKMx92GMnnlhnp0LqMc8Gxewm++SvlVAAqLTSz5HX/bp Nzb34wAwWdHZsBEaJaaVrQSXxNmEcbMeqCwWmt8xPhSwQOqkq13zzbLC7UMsCJ5lPKra 2v8tWsRIYM1pByOHBTk/vo6FIKnpKqQE77CLLTEIj3D8lkIz0e7N8vZ2lnqWohXV/d8H 2Kvg== X-Gm-Message-State: APjAAAU3ER/g0MFrCbm46TVHo+KPg2EC8FYiavmGlg/y5KAJmzUz083C Onw2LPmnSE8dOsyOO6Z7ZbGq9p7sFtPykohwGs4gnYLe38o= X-Google-Smtp-Source: APXvYqxXvtgw2veqOcWBltlr+8h/HpWUa8JNKkY9VGURQzx0BkuDSQvX9UEUAmfg5b+aYVtUVL8AL607sZSATLU+eVE= X-Received: by 2002:a05:6830:1ac5:: with SMTP id r5mr21200291otc.338.1570440696306; Mon, 07 Oct 2019 02:31:36 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Faraz Mateen Date: Mon, 7 Oct 2019 14:31:00 +0500 Message-ID: Subject: Re: "Too many open files" error To: user@kudu.apache.org Content-Type: multipart/alternative; boundary="000000000000d8340805944eb71a" --000000000000d8340805944eb71a Content-Type: text/plain; charset="UTF-8" Alexey, Thank you for the response. Having too many partitions is exactly what the problem is. When I restart the tserver, it tries to open files against each tablet and eventually crashes. Is there a way to get around this and recover my data? Is there any config I can change to run the tserver? Or can I add a new tablet server and migrate existing tablets? On Sat, Oct 5, 2019 at 10:05 PM Alexey Serbin wrote: > Hi, > > Most likely the issue happened because of high number of tablet replicas > at the tablet server. In case of high spike of in the input data rate, > higher compaction activity might require more than usual number of file > descriptors, since more files are opened. > > How many tablet replicas does that tablet server have? It's not > recommended to have too many: > https://kudu.apache.org/docs/known_issues.html#_scale > > To understand what has happened, you need to take a look into the logs of > the tablet server. This might be useful: > https://kudu.apache.org/docs/troubleshooting.html > > Overall, if there is only one (?) tablet server in the whole Kudu cluster, > why to have 39 partitions per table? I guess that's some sort of > proof-of-concept/toy setup, but anyways. Since all the tablet replicas end > up at the same single tablet server, I don't see benefits from partitioning > in that setup. For the tablet server, it simply means x-times increased > number of open file descriptors and increased memory usage. > > > Kind regards, > > Alexey > > On Fri, Oct 4, 2019 at 4:21 AM Faraz Mateen wrote: > >> Hi all, >> >> I am facing a problem with my kudu setup where tablet server crashes with >> "too many open files" error. >> The setup consists of a single master and a single tablet server. Tables >> created are such that there are 39 partitions per table. However not all >> partitions have data that corresponds to them. >> Yesterday my tserver crashed and when I am trying to restart the tserver, >> it fails with the error: >> >> I1004 03:50:39.896301 5669 ts_tablet_manager.cc:1173] T >> cab85f15f06748d0b59161d9f3da55f7 P ee14d248ac994d0eb60dbb0db4ab3b09: >> Registered tablet (data state: TABLET_DATA_READY) >> W1004 03:50:39.923184 5687 os-util.cc:165] could not read >> /proc/self/status: IO error: /proc/self/status: Too many open files (error >> 24) >> I1004 03:50:39.939460 5669 ts_tablet_manager.cc:1173] T >> d8d68ce6f6ea49479c00d29709869f1f P ee14d248ac994d0eb60dbb0db4ab3b09: >> Registered tablet (data state: TABLET_DATA_READY) >> >> I have already modified ulimit of the machine: >> >> root@vm-3:~# ulimit -a >> core file size (blocks, -c) 0 >> data seg size (kbytes, -d) unlimited >> scheduling priority (-e) 0 >> file size (blocks, -f) unlimited >> pending signals (-i) 63923 >> max locked memory (kbytes, -l) 16384 >> max memory size (kbytes, -m) unlimited >> open files (-n) 65535 >> pipe size (512 bytes, -p) 8 >> POSIX message queues (bytes, -q) 819200 >> real-time priority (-r) 0 >> stack size (kbytes, -s) 8192 >> cpu time (seconds, -t) unlimited >> max user processes (-u) 65535 >> virtual memory (kbytes, -v) unlimited >> file locks (-x) unlimited >> >> *Set up Details:* >> Single master and tserver setup on a single VM. >> 4 cores, 550GB hard disk, 16GB RAM >> Kudu version 1.8 on ubuntu, installed through debian packages. >> Before crash, data was being inserted in kudu at a very high rate. RAM >> usage was around 87% and disk usage was around 84 percent. >> >> Here is what I have tried so far: >> 1- Set ulimit -n to 65535. >> 2- Reboot the vm to get rid of stale processes. >> 3- Set block_manager_max_open_files to 32000 in tserver flag file. >> >> What I want to know now is: >> 1- Why am I hitting this problem? Is this due to low resources on the VM >> or high number of tablets on a single tserver? >> 2- How can I get around this problem, recover my data and kudu services? >> >> Would really appreciate some help on this. >> -- >> Faraz Mateen >> > -- Faraz Mateen --000000000000d8340805944eb71a Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Alexey,

Thank you for the response. Hav= ing too many partitions is exactly what the problem is. When I restart the = tserver, it tries to open files against each tablet and eventually crashes.=

Is there a way to get around this and recover my = data? Is there any config I can change to run the tserver? Or can I add a n= ew tablet server and migrate existing tablets?

On Sat, Oct 5, 2019 at = 10:05 PM Alexey Serbin <aserbin@cloudera.com> wrote:
Hi,

Most likely the issue ha= ppened because of high number of tablet replicas at the tablet server.=C2= =A0 In case of high spike of in the input data rate, higher compaction acti= vity might require more than usual number of file descriptors, since more f= iles are opened.

How m= any tablet replicas does that tablet server have?=C2=A0 It's not recomm= ended to have too many: https://kudu.apache.org/docs/known_issues.= html#_scale

To und= erstand what has happened, you need to take a look into the logs of the tab= let server.=C2=A0 This might be useful:=C2=A0https://kudu.apache.org/d= ocs/troubleshooting.html

Overall, if there is only one (?) tablet server in the whole Kudu = cluster, why to have 39 partitions per table?=C2=A0 I guess that's some= sort of proof-of-concept/toy setup, but anyways.=C2=A0 Since all the table= t replicas end up at the same single tablet server, I don't see benefit= s from partitioning in that setup.=C2=A0 For the tablet server, it simply m= eans x-times increased number of open file descriptors and increased memory= usage.


Kind regards,

Alexey

On Fri, Oct 4, 2019 at 4:21 AM Faraz Mateen <fmateen@an10.io> wrot= e:
Hi all,

I am facing a problem with my kudu setup whe= re tablet server crashes with "too many open files" error.
<= div>The setup consists of a single master and a single tablet server. Table= s created are such that there are 39 partitions per table. However not all = partitions have data that corresponds to them.
Yesterday my tserv= er crashed and when I am trying to restart the tserver, it fails with the e= rror:

I1004 03:50:39.896301 =C2=A05669 ts_tablet_m= anager.cc:1173] T cab85f15f06748d0b59161d9f3da55f7 P ee14d248ac994d0eb60dbb= 0db4ab3b09: Registered tablet (data state: TABLET_DATA_READY)
W1004 03:5= 0:39.923184 =C2=A05687 os-util.cc:165] could not read /proc/self/status: IO= error: /proc/self/status: Too many open files (error 24)
I1004 03:50:39= .939460 =C2=A05669 ts_tablet_manager.cc:1173] T d8d68ce6f6ea49479c00d297098= 69f1f P ee14d248ac994d0eb60dbb0db4ab3b09: Registered tablet (data state: TA= BLET_DATA_READY)

I have already modified ulimi= t of the machine:

root@vm-3:~# ulimit -a
core f= ile size =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(blocks, -c) 0
data seg size = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (kbytes, -d) unlimited
scheduling pri= ority =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (-e) 0
file size =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (blocks, -f) unlimited
pending= signals =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (-i) 63923=
max locked memory =C2=A0 =C2=A0 =C2=A0 (kbytes, -l) 16384
max memory= size =C2=A0 =C2=A0 =C2=A0 =C2=A0 (kbytes, -m) unlimited
open files =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(-= n) 65535
pipe size =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(512 bytes, = -p) 8
POSIX message queues =C2=A0 =C2=A0 (bytes, -q) 819200
real-time= priority =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(-r) 0
stack s= ize =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(kbytes, -s) 8192
cp= u time =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (seconds, -t) unlim= ited
max user processes =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= (-u) 65535
virtual memory =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(kbytes, -v)= unlimited
file locks =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0(-x) unlimited

S= et up Details:
Single master and tserver setup on a single VM= .
4 cores, 550GB hard disk, 16GB RAM
Kudu version 1.8 o= n ubuntu, installed through debian packages.
Before crash, data w= as being inserted in kudu at a very high rate. RAM usage was around 87% and= disk usage was around 84 percent.

Here is what I = have tried so far:
1- Set ulimit -n to 65535.
2- Reboot= the vm to get rid of stale processes.
3- Set=C2=A0block_manager_= max_open_files to 32000 in tserver flag file.
=C2=A0
Wh= at I want to know now is:
1- Why am I hitting this problem? Is th= is due to low resources on the VM or high number of tablets on a single tse= rver?
2- How can I get around this problem, recover my data and k= udu services?

Would really appreciate some help on= this.
--
Faraz Mateen


--
Faraz Ma= teen
--000000000000d8340805944eb71a--