Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id C2F19200D5D for ; Wed, 6 Dec 2017 01:19:08 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id C16C1160C1C; Wed, 6 Dec 2017 00:19:08 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E1483160C1B for ; Wed, 6 Dec 2017 01:19:07 +0100 (CET) Received: (qmail 72448 invoked by uid 500); 6 Dec 2017 00:19:07 -0000 Mailing-List: contact user-help@kudu.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@kudu.apache.org Delivered-To: mailing list user@kudu.apache.org Received: (qmail 72438 invoked by uid 99); 6 Dec 2017 00:19:06 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Dec 2017 00:19:06 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id F0481C1400 for ; Wed, 6 Dec 2017 00:19:05 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.879 X-Spam-Level: * X-Spam-Status: No, score=1.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=cloudera.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id RTAMyNabxB2y for ; Wed, 6 Dec 2017 00:19:04 +0000 (UTC) Received: from mail-lf0-f52.google.com (mail-lf0-f52.google.com [209.85.215.52]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 8A4D35F343 for ; Wed, 6 Dec 2017 00:19:03 +0000 (UTC) Received: by mail-lf0-f52.google.com with SMTP id f13so2299908lff.12 for ; Tue, 05 Dec 2017 16:19:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudera.com; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=8mQgKlAvYKg9dzMSs1xZOkolhKkxYLBG+0Oz+KcDNq4=; b=ShwhDM2Kw9WbYEBSbXVeHEClcHmguqBGGNiZtrQPQqYZru2ZdOKJHVutWcYfFNrUaK bHtCOzS/17DQ5gi7enEnofldFdQ+BNLN9SoaJe9pMTtvWx1rs9OgfiEz9m8fKmEEgVHl sRejM2fVAPmjI9dQhJT79lhdzztHzszic86TyjKWZ3quRNDrqv97/MuM2ZTT8C2oOXcV mdIk/umIjhjvcb5gaihKKL+uzXLkIuPeKXADsj/cjyBe+PwVvXuftCtMa10r8r2f3Jko WHrhKxgEQB+IFMO/2IRSS2JYnbYWfAGaYqwdnwgxPKyF6woa94uewameSTyo1yWxzNyW FgTg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=8mQgKlAvYKg9dzMSs1xZOkolhKkxYLBG+0Oz+KcDNq4=; b=EzPzV0g/M5qMoTtQ0VXxvuq82yA0DG0qZDO+PiwTMDPIvcvN707PvaeKeNLEh37hFG 58hDt9WTzVWdOuIo3jwdxUvRj4OLQD/I1YSio4s6UqNC8aYAJiTdTGT/tTkny3LNVT+p f8sbxzWDl1iujAxgJc/xEI56zelrvgcFZg+4+cQUHIxbXQjzaYGCmv/jz4xR5S7N3s6P osECaUZHs5SbNL0ZHlMSQ77U4l6WZc6TXI3H0YMUqGxsOlruw1yXP0jjFZ8+FqX3Mlxi l+9XLopgeLnRFiBRbmSnIPLNNBE11oOxqH1bxDJiIZe/t5cBh+lTOW5ESzeyGQnmOdZb UsDA== X-Gm-Message-State: AJaThX5sgcwN/8ZbnTY5jt+OrrxjRvfRF3nPcy1mNoDboHxYsHIFYmAy QxgD1sXuzztNLCKJxWp8LWecZ4CoWl34pt+MQNA5gANL X-Google-Smtp-Source: AGs4zMZWaHsiYZx4s6fc6asm1Cw55WApd76ZyZ2kcOz8b3rhzW/0P/AI0vwQa2Q5v/yhkjCHul4r8IAylCPmZdQqF1E= X-Received: by 10.46.101.74 with SMTP id z71mr12196468ljb.35.1512519542718; Tue, 05 Dec 2017 16:19:02 -0800 (PST) MIME-Version: 1.0 Received: by 10.179.68.220 with HTTP; Tue, 5 Dec 2017 16:18:32 -0800 (PST) In-Reply-To: References: From: Andrew Wong Date: Tue, 5 Dec 2017 16:18:32 -0800 Message-ID: Subject: Re: Data inconsistency after restart To: user@kudu.apache.org Content-Type: multipart/alternative; boundary="001a114ab1bc0f6cdd055fa0e68e" archived-at: Wed, 06 Dec 2017 00:19:09 -0000 --001a114ab1bc0f6cdd055fa0e68e Content-Type: text/plain; charset="UTF-8" Hi Petter, When we verified that all data was inserted we found that some data was > missing. We added this missing data and on some chunks we got the > information that all rows were already present, i.e impala says something > like Modified: 0 rows, nnnnnnn errors. Doing the verification again now > shows that the Kudu table is complete. So, even though we did not insert > any data on some chunks, a count(*) operation over these chunks now returns > a different value. How did you verify that all the data was inserted and how did you find some data missing? I'm wondering if it's possible that the initial "missing" data was data that Kudu was still in the process of inserting (albeit slowly, due to memory backpressure or somesuch). Now to my question. Will data be inconsistent if we recycle Kudu after > seeing soft memory limit warnings? Your data should be consistently written, even with those warnings. AFAIK they would cause a bit of slowness, not incorrect results. Is there a way to tell when it is safe to restart Kudu to avoid these > issues? Should we use any special procedure when restarting (e.g. only > restart the tablet servers, only restart one tablet server at a time or > something like that)? In general, you can use the `ksck` tool to check the health of your cluster. See https://kudu.apache.org/docs/command_line_tools_reference.html#cluster-ksck for more details. For restarting a cluster, I would recommend taking down all tablet servers at once, otherwise tablet replicas may try to replicate data from the server that was taken down. Hope this helped, Andrew On Tue, Dec 5, 2017 at 10:42 AM, Petter von Dolwitz (Hem) < petter.von.dolwitz@gmail.com> wrote: > Hi Kudu users, > > We just started to use Kudu (1.4.0+cdh5.12.1). To make a baseline for > evaluation we ingested 3 month worth of data. During ingestion we were > facing messages from the maintenance threads that a soft memory limit were > reached. It seems like the background maintenance threads stopped > performing their tasks at this point in time. It also so seems like the > memory was never recovered even after stopping ingestion so I guess there > was a large backlog being built up. I guess the root cause here is that we > were a bit too conservative when giving Kudu memory. After a reststart a > lot of maintenance tasks were started (i.e. compaction). > > When we verified that all data was inserted we found that some data was > missing. We added this missing data and on some chunks we got the > information that all rows were already present, i.e impala says something > like Modified: 0 rows, nnnnnnn errors. Doing the verification again now > shows that the Kudu table is complete. So, even though we did not insert > any data on some chunks, a count(*) operation over these chunks now returns > a different value. > > Now to my question. Will data be inconsistent if we recycle Kudu after > seeing soft memory limit warnings? > > Is there a way to tell when it is safe to restart Kudu to avoid these > issues? Should we use any special procedure when restarting (e.g. only > restart the tablet servers, only restart one tablet server at a time or > something like that)? > > The table design uses 50 tablets per day (times 90 days). It is 8 TB of > data after 3xreplication over 5 tablet servers. > > Thanks, > Petter > > > -- Andrew Wong --001a114ab1bc0f6cdd055fa0e68e Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Petter,

When we verified that all data was inserted we found = that some data was missing. We added this missing data and on some chunks w= e got the information that all rows were already present, i.e impala says s= omething like Modified: 0 rows, nnnnnnn errors. Doing the verification agai= n now shows that the Kudu table is complete. So, even though we did not ins= ert any data on some chunks, a count(*) operation over these chunks now ret= urns a different value.

How did you verify that all the data was= inserted and how did you find some data missing? I'm wondering if it&#= 39;s possible that the initial "missing" data was data that Kudu = was still in the process of inserting (albeit slowly, due to memory backpre= ssure or somesuch).

Now to my question. Will data be = inconsistent if we recycle Kudu after seeing soft memory limit warnings?

Your data should be consistently written, even with those warnings= . AFAIK they would cause a bit of slowness, not incorrect results.

Is there a way to tell when it is safe to restart Kudu to avo= id these issues? Should we use any special procedure when restarting (e.g. = only restart the tablet servers, only restart one tablet server at a time o= r something like that)?

In general, you can use the `ksck` tool to check the health o= f your cluster. See=C2=A0https://kudu.apache.org/docs/command_= line_tools_reference.html#cluster-ksck for more details. For restarting= a cluster, I would recommend taking down all tablet servers at once, other= wise tablet replicas may try to replicate data from the server that was tak= en down.

Hope this helped,
Andrew
<= /div>

On Tue, Dec = 5, 2017 at 10:42 AM, Petter von Dolwitz (Hem) <petter.von.dolwi= tz@gmail.com> wrote:
Hi Kudu users,

We just started to use Kudu (1= .4.0+cdh5.12.1). To make a baseline for evaluation we ingested 3 month wort= h of data. During ingestion we were facing messages from the maintenance th= reads that a soft memory limit were reached. It seems like the background m= aintenance threads stopped performing their tasks at this point in time. It= also so seems like the memory was never recovered even after stopping inge= stion so I guess there was a large backlog being built up. I guess the root= cause here is that we were a bit too conservative when giving Kudu memory.= After a reststart a lot of maintenance tasks were started (i.e. compaction= ).

When we verified that all data was inserted we = found that some data was missing. We added this missing data and on some ch= unks we got the information that all rows were already present, i.e impala = says something like Modified: 0 rows, nnnnnnn errors. Doing the verificatio= n again now shows that the Kudu table is complete. So, even though we did n= ot insert any data on some chunks, a count(*) operation over these chunks n= ow returns a different value.

Now to my question. = Will data be inconsistent if we recycle Kudu after seeing soft memory limit= warnings?

Is there a way to tell when it is safe = to restart Kudu to avoid these issues? Should we use any special procedure = when restarting (e.g. only restart the tablet servers, only restart one tab= let server at a time or something like that)?

The = table design uses 50 tablets per day (times 90 days). It is 8 TB of data af= ter 3xreplication over 5 tablet servers.

Thanks,
Petter





--
An= drew Wong
--001a114ab1bc0f6cdd055fa0e68e--