Mailing-List: contact user-help@kudu.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@kudu.apache.org
MIME-Version: 1.0
In-Reply-To: <CADFYdErzKcfxycWjvkTVCRzwh_huHsN3wfWEHW6_XMruNp=d8A@mail.gmail.com>
References: <CADFYdErzKcfxycWjvkTVCRzwh_huHsN3wfWEHW6_XMruNp=d8A@mail.gmail.com>
From: Andrew Wong <awong@cloudera.com>
Date: Tue, 5 Dec 2017 16:18:32 -0800
Message-ID: <CAPV-+AzQbwdtcbkTU2hR6Pi6DOn7uwHAydYYf9tfURq2DZdfwA@mail.gmail.com>
Subject: Re: Data inconsistency after restart
To: user@kudu.apache.org
Content-Type: multipart/alternative; boundary="001a114ab1bc0f6cdd055fa0e68e"
archived-at: Wed, 06 Dec 2017 00:19:09 -0000

--001a114ab1bc0f6cdd055fa0e68e
Content-Type: text/plain; charset="UTF-8"

Hi Petter,

When we verified that all data was inserted we found that some data was
> missing. We added this missing data and on some chunks we got the
> information that all rows were already present, i.e impala says something
> like Modified: 0 rows, nnnnnnn errors. Doing the verification again now
> shows that the Kudu table is complete. So, even though we did not insert
> any data on some chunks, a count(*) operation over these chunks now returns
> a different value.


How did you verify that all the data was inserted and how did you find some
data missing? I'm wondering if it's possible that the initial "missing"
data was data that Kudu was still in the process of inserting (albeit
slowly, due to memory backpressure or somesuch).

Now to my question. Will data be inconsistent if we recycle Kudu after
> seeing soft memory limit warnings?


Your data should be consistently written, even with those warnings. AFAIK
they would cause a bit of slowness, not incorrect results.

Is there a way to tell when it is safe to restart Kudu to avoid these
> issues? Should we use any special procedure when restarting (e.g. only
> restart the tablet servers, only restart one tablet server at a time or
> something like that)?


In general, you can use the `ksck` tool to check the health of your
cluster. See
https://kudu.apache.org/docs/command_line_tools_reference.html#cluster-ksck
for more details. For restarting a cluster, I would recommend taking down
all tablet servers at once, otherwise tablet replicas may try to replicate
data from the server that was taken down.

Hope this helped,
Andrew

On Tue, Dec 5, 2017 at 10:42 AM, Petter von Dolwitz (Hem) <
petter.von.dolwitz@gmail.com> wrote:

> Hi Kudu users,
>
> We just started to use Kudu (1.4.0+cdh5.12.1). To make a baseline for
> evaluation we ingested 3 month worth of data. During ingestion we were
> facing messages from the maintenance threads that a soft memory limit were
> reached. It seems like the background maintenance threads stopped
> performing their tasks at this point in time. It also so seems like the
> memory was never recovered even after stopping ingestion so I guess there
> was a large backlog being built up. I guess the root cause here is that we
> were a bit too conservative when giving Kudu memory. After a reststart a
> lot of maintenance tasks were started (i.e. compaction).
>
> When we verified that all data was inserted we found that some data was
> missing. We added this missing data and on some chunks we got the
> information that all rows were already present, i.e impala says something
> like Modified: 0 rows, nnnnnnn errors. Doing the verification again now
> shows that the Kudu table is complete. So, even though we did not insert
> any data on some chunks, a count(*) operation over these chunks now returns
> a different value.
>
> Now to my question. Will data be inconsistent if we recycle Kudu after
> seeing soft memory limit warnings?
>
> Is there a way to tell when it is safe to restart Kudu to avoid these
> issues? Should we use any special procedure when restarting (e.g. only
> restart the tablet servers, only restart one tablet server at a time or
> something like that)?
>
> The table design uses 50 tablets per day (times 90 days). It is 8 TB of
> data after 3xreplication over 5 tablet servers.
>
> Thanks,
> Petter
>
>
>


-- 
Andrew Wong

--001a114ab1bc0f6cdd055fa0e68e
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Petter,<div><br></div><div><blockquote style=3D"margin:=
0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" =
class=3D"gmail_quote">When we verified that all data was inserted we found =
that some data was missing. We added this missing data and on some chunks w=
e got the information that all rows were already present, i.e impala says s=
omething like Modified: 0 rows, nnnnnnn errors. Doing the verification agai=
n now shows that the Kudu table is complete. So, even though we did not ins=
ert any data on some chunks, a count(*) operation over these chunks now ret=
urns a different value.</blockquote><div style=3D"font-size:12.8px"><br></d=
iv><div style=3D"font-size:12.8px">How did you verify that all the data was=
 inserted and how did you find some data missing? I&#39;m wondering if it&#=
39;s possible that the initial &quot;missing&quot; data was data that Kudu =
was still in the process of inserting (albeit slowly, due to memory backpre=
ssure or somesuch).</div><div style=3D"font-size:12.8px"><br></div><blockqu=
ote style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204=
);padding-left:1ex" class=3D"gmail_quote">Now to my question. Will data be =
inconsistent if we recycle Kudu after seeing soft memory limit warnings?</b=
lockquote><div style=3D"font-size:12.8px"><br></div><div style=3D"font-size=
:12.8px">Your data should be consistently written, even with those warnings=
. AFAIK they would cause a bit of slowness, not incorrect results.</div><di=
v style=3D"font-size:12.8px"><br></div><blockquote style=3D"margin:0px 0px =
0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class=3D=
"gmail_quote">Is there a way to tell when it is safe to restart Kudu to avo=
id these issues? Should we use any special procedure when restarting (e.g. =
only restart the tablet servers, only restart one tablet server at a time o=
r something like that)?</blockquote><div style=3D"font-size:12.8px"><br></d=
iv></div><div>In general, you can use the `ksck` tool to check the health o=
f your cluster. See=C2=A0<a href=3D"https://kudu.apache.org/docs/command_li=
ne_tools_reference.html#cluster-ksck">https://kudu.apache.org/docs/command_=
line_tools_reference.html#cluster-ksck</a> for more details. For restarting=
 a cluster, I would recommend taking down all tablet servers at once, other=
wise tablet replicas may try to replicate data from the server that was tak=
en down.</div><div><br></div><div>Hope this helped,</div><div>Andrew</div><=
/div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Tue, Dec =
5, 2017 at 10:42 AM, Petter von Dolwitz (Hem) <span dir=3D"ltr">&lt;<a href=
=3D"mailto:petter.von.dolwitz@gmail.com" target=3D"_blank">petter.von.dolwi=
tz@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" sty=
le=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div d=
ir=3D"ltr">Hi Kudu users,<div><br></div><div>We just started to use Kudu (1=
.4.0+cdh5.12.1). To make a baseline for evaluation we ingested 3 month wort=
h of data. During ingestion we were facing messages from the maintenance th=
reads that a soft memory limit were reached. It seems like the background m=
aintenance threads stopped performing their tasks at this point in time. It=
 also so seems like the memory was never recovered even after stopping inge=
stion so I guess there was a large backlog being built up. I guess the root=
 cause here is that we were a bit too conservative when giving Kudu memory.=
 After a reststart a lot of maintenance tasks were started (i.e. compaction=
).</div><div><br></div><div>When we verified that all data was inserted we =
found that some data was missing. We added this missing data and on some ch=
unks we got the information that all rows were already present, i.e impala =
says something like Modified: 0 rows, nnnnnnn errors. Doing the verificatio=
n again now shows that the Kudu table is complete. So, even though we did n=
ot insert any data on some chunks, a count(*) operation over these chunks n=
ow returns a different value.</div><div><br></div><div>Now to my question. =
Will data be inconsistent if we recycle Kudu after seeing soft memory limit=
 warnings?</div><div><br></div><div>Is there a way to tell when it is safe =
to restart Kudu to avoid these issues? Should we use any special procedure =
when restarting (e.g. only restart the tablet servers, only restart one tab=
let server at a time or something like that)?</div><div><br></div><div>The =
table design uses 50 tablets per day (times 90 days). It is 8 TB of data af=
ter 3xreplication over 5 tablet servers.</div><div><br></div><div>Thanks,</=
div><div>Petter</div><div><br></div><div><br></div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div class=
=3D"gmail_signature" data-smartmail=3D"gmail_signature"><div dir=3D"ltr">An=
drew Wong</div></div>
</div>

--001a114ab1bc0f6cdd055fa0e68e--