Mailing-List: contact user-help@kudu.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@kudu.apache.org
MIME-Version: 1.0
In-Reply-To: <CAEeV1nuLED0R3nQwSJ63kTzCPTtNrLbH6dMYukAS2f_QJiBm5g@mail.gmail.com>
References: <CAEeV1nvEQsWwdsroYEhTq3HysWbktC9d=5i6NVE9xoiUUcVxUg@mail.gmail.com>
 <CALo2W-U5SRf29yMDm8OfE7_mJs4Kj+Xkqwg6hZ0CU+SS3YEyQQ@mail.gmail.com>
 <CAEeV1ns2Ocx+CjZ25jHb7jOjrESMk7FiY8r=iMaCQwrxL0D1Yg@mail.gmail.com>
 <CALo2W-XYT94L1FvqOZsf-9YaEC34jnBonvRMJ_h3vepfwwt2GA@mail.gmail.com> <CAEeV1nuLED0R3nQwSJ63kTzCPTtNrLbH6dMYukAS2f_QJiBm5g@mail.gmail.com>
From: Paul Brannan <paul.brannan@thesystech.com>
Date: Sun, 26 Feb 2017 18:53:52 -0500
Message-ID: <CAEeV1nssze6WRU_P2aDnOVoGWB8Lz5+QFyz5TqyUjs+S-YUpWQ@mail.gmail.com>
Subject: Re: mixing range and hash partitioning
To: user@kudu.apache.org
Content-Type: multipart/alternative; boundary=001a113a80e8d3f7ae054977acc2
archived-at: Sun, 26 Feb 2017 23:53:57 -0000

--001a113a80e8d3f7ae054977acc2
Content-Type: text/plain; charset=UTF-8

Is that 4TB per tablet server, regardless of how many tablets it has?

If I have 128GB of data per day, then each tablet server hits the
recommended limit after about a month.  To store 10 years of data, I would
need 120 tablet servers to avoid going over the limit.  Is that the best
solution or is there another alternative?

How many cores are recommended per tablet server?  If I typically only scan
one day of data at time, could a single core service multiple tablet
servers?


On Fri, Feb 24, 2017 at 11:22 PM, Paul Brannan <paul.brannan@thesystech.com>
wrote:

> The test doesn't exactly reproduce what I did in my sample program.
>
> I'm able to successfully drop the unbounded partition in both cases
> (calling set_range_partition_columns only vs calling
> set_range_partition_columns+add_hash_partitions).  However, if I omit the
> call to DropRangePartition, then AddRangePartition succeeds in the first
> case and fails in the second case.  I expect it to succeed in both cases or
> fail in both cases.
>
> I've attached a simple program which demonstrates.
>
>
> On Fri, Feb 24, 2017 at 7:09 PM, Dan Burkert <danburkert@apache.org>
> wrote:
>
>> Hi Paul,
>>
>> I can't reproduce the behavior you are describing, I always get a single
>> unbounded range partition when creating the table without specifying range
>> bounds or splits (regardless of hash partitioning). I searched and couldn't
>> find a unit test for this behavior, so I wrote one - you might compare your
>> code against my test. https://gerrit.cloudera.org/#/c/6153/
>>
>> Thanks,
>> Dan
>>
>> On Fri, Feb 24, 2017 at 2:41 PM, Paul Brannan <
>> paul.brannan@thesystech.com> wrote:
>>
>>> I can verify that dropping the unbounded range partition allows me to
>>> later add bounded partitions.
>>>
>>> If I only have range partitioning (by commenting out the call to
>>> add_hash_partitions), adding a bounded partition succeeds, regardless of
>>> whether I first drop the unbounded partition.  This seems surprising; why
>>> the difference?
>>>
>>> On Fri, Feb 24, 2017 at 4:20 PM, Dan Burkert <danburkert@apache.org>
>>> wrote:
>>>
>>>> Hi Paul,
>>>>
>>>> I think the issue you are running into is that if you don't add a range
>>>> partition explicitly during table creation (by calling add_range_partition
>>>> or inserting a split with add_range_partition_split), Kudu will default to
>>>> creating 1 unbounded range partition.  So your two options are to add the
>>>> range partition during table creation time, or if you only know that
>>>> partition you want at a later time, you can drop the existing partition
>>>> (alterer->DropRangePartition with two empty rows), then add the range
>>>> partition.  Note that dropping the range partition will effectively
>>>> truncate the table.  This can be done with the same alterer in a single
>>>> transaction.  If you want to see a bunch of examples, you can check out
>>>> this unit test: https://github.com/apache/kudu/blob/master/src/kudu/in
>>>> tegration-tests/alter_table-test.cc#L1106.
>>>>
>>>> - Dan
>>>>
>>>> On Fri, Feb 24, 2017 at 10:53 AM, Paul Brannan <
>>>> paul.brannan@thesystech.com> wrote:
>>>>
>>>>> I'm trying to create a table with one-column range-partitioned and
>>>>> another column hash-partitioned.  Documentation for add_hash_partitions and
>>>>> set_range_partition_columns suggest this should be possible ("Tables must
>>>>> be created with either range, hash, or range and hash partitioning").
>>>>>
>>>>> I have a schema with three INT64 columns ("time", "key", and
>>>>> "value").  When I create the table, I set up the partitioning:
>>>>>
>>>>> (*table_creator)
>>>>>   .table_name("test_table")
>>>>>   .schema(&schema)
>>>>>   .add_hash_partitions({"key"}, 2)
>>>>>   .set_range_partition_columns({"time"})
>>>>>   .num_replicas(1)
>>>>>   .Create()
>>>>>
>>>>> I later try to add a partition:
>>>>>
>>>>> auto timesplit(KuduSchema & schema, std::int64_t t) {
>>>>>   auto split = schema.NewRow();
>>>>>   check_ok(split->SetInt64("time", t));
>>>>>   return split;
>>>>> }
>>>>>
>>>>> alterer->AddRangePartition(
>>>>>   timesplit(schema, date_start),
>>>>>   timesplit(schema, next_date_start));
>>>>>
>>>>> check_ok(alterer->Alter());
>>>>>
>>>>> But I get an error "Invalid argument: New range partition conflicts
>>>>> with existing range partition".
>>>>>
>>>>> How are hash and range partitioning intended to be mixed?
>>>>>
>>>>>
>>>>
>>>
>>
>

--001a113a80e8d3f7ae054977acc2
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div>Is that 4TB per tablet server, regardless of how=
 many tablets it has?<br><br></div>If I have 128GB of data per day, then ea=
ch tablet server hits the recommended limit after about a month.=C2=A0 To s=
tore 10 years of data, I would need 120 tablet servers to avoid going over =
the limit.=C2=A0 Is that the best solution or is there another alternative?=
<br><br></div>How many cores are recommended per tablet server?=C2=A0 If I =
typically only scan one day of data at time, could a single core service mu=
ltiple tablet servers?<br><br></div><div class=3D"gmail_extra"><br><div cla=
ss=3D"gmail_quote">On Fri, Feb 24, 2017 at 11:22 PM, Paul Brannan <span dir=
=3D"ltr">&lt;<a href=3D"mailto:paul.brannan@thesystech.com" target=3D"_blan=
k">paul.brannan@thesystech.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex"><div dir=3D"ltr"><div><div>The test doesn&#39;t exactly repro=
duce what I did in my sample program.<br><br></div>I&#39;m able to successf=
ully drop the unbounded partition in both cases (calling set_range_partitio=
n_columns only vs calling set_range_partition_columns+<wbr>add_hash_partiti=
ons).=C2=A0 However, if I omit the call to DropRangePartition, then AddRang=
ePartition succeeds in the first case and fails in the second case.=C2=A0 I=
 expect it to succeed in both cases or fail in both cases.<br><br></div>I&#=
39;ve attached a simple program which demonstrates.<br><br></div><div class=
=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_extra"><br><div class=3D"=
gmail_quote">On Fri, Feb 24, 2017 at 7:09 PM, Dan Burkert <span dir=3D"ltr"=
>&lt;<a href=3D"mailto:danburkert@apache.org" target=3D"_blank">danburkert@=
apache.org</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=
=3D"ltr">Hi Paul,<div><br></div><div>I can&#39;t reproduce the behavior you=
 are describing, I always get a single unbounded range partition when creat=
ing the table without specifying range bounds or splits (regardless of hash=
 partitioning). I searched and couldn&#39;t find a unit test for this behav=
ior, so I wrote one - you might compare your code against my test.=C2=A0<a =
href=3D"https://gerrit.cloudera.org/#/c/6153/" target=3D"_blank">https://ge=
rrit.cloudera.<wbr>org/#/c/6153/</a></div><div><br></div><div>Thanks,</div>=
<div>Dan</div></div><div class=3D"m_242980169297988673HOEnZb"><div class=3D=
"m_242980169297988673h5"><div class=3D"gmail_extra"><br><div class=3D"gmail=
_quote">On Fri, Feb 24, 2017 at 2:41 PM, Paul Brannan <span dir=3D"ltr">&lt=
;<a href=3D"mailto:paul.brannan@thesystech.com" target=3D"_blank">paul.bran=
nan@thesystech.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quot=
e" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">=
<div dir=3D"ltr"><div>I can verify that dropping the unbounded range partit=
ion allows me to later add bounded partitions.<br><br></div>If I only have =
range partitioning (by commenting out the call to add_hash_partitions), add=
ing a bounded partition succeeds, regardless of whether I first drop the un=
bounded partition.=C2=A0 This seems surprising; why the difference?<br></di=
v><div class=3D"m_242980169297988673m_2264374366848159995HOEnZb"><div class=
=3D"m_242980169297988673m_2264374366848159995h5"><div class=3D"gmail_extra"=
><br><div class=3D"gmail_quote">On Fri, Feb 24, 2017 at 4:20 PM, Dan Burker=
t <span dir=3D"ltr">&lt;<a href=3D"mailto:danburkert@apache.org" target=3D"=
_blank">danburkert@apache.org</a>&gt;</span> wrote:<br><blockquote class=3D=
"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding=
-left:1ex"><div dir=3D"ltr">Hi Paul,<div><br></div><div>I think the issue y=
ou are running into is that if you don&#39;t add a range partition explicit=
ly during table creation (by calling add_range_partition or inserting a spl=
it with add_range_partition_split), Kudu will default to creating 1 unbound=
ed range partition.=C2=A0 So your two options are to add the range partitio=
n during table creation time, or if you only know that partition you want a=
t a later time, you can drop the existing partition (alterer-&gt;DropRangeP=
artition with two empty rows), then add the range partition.=C2=A0 Note tha=
t dropping the range partition will effectively truncate the table.=C2=A0 T=
his can be done with the same alterer in a single transaction.=C2=A0 If you=
 want to see a bunch of examples, you can check out this unit test:=C2=A0<a=
 href=3D"https://github.com/apache/kudu/blob/master/src/kudu/integration-te=
sts/alter_table-test.cc#L1106" target=3D"_blank">https://github.com/apach<w=
br>e/kudu/blob/master/src/kudu/in<wbr>tegration-tests/alter_table-te<wbr>st=
.cc#L1106</a>.</div><span class=3D"m_242980169297988673m_226437436684815999=
5m_-6532631784826560516HOEnZb"><font color=3D"#888888"><div><br></div><div>=
- Dan</div></font></span></div><div class=3D"m_242980169297988673m_22643743=
66848159995m_-6532631784826560516HOEnZb"><div class=3D"m_242980169297988673=
m_2264374366848159995m_-6532631784826560516h5"><div class=3D"gmail_extra"><=
br><div class=3D"gmail_quote">On Fri, Feb 24, 2017 at 10:53 AM, Paul Branna=
n <span dir=3D"ltr">&lt;<a href=3D"mailto:paul.brannan@thesystech.com" targ=
et=3D"_blank">paul.brannan@thesystech.com</a>&gt;</span> wrote:<br><blockqu=
ote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc s=
olid;padding-left:1ex"><div dir=3D"ltr"><div><div>I&#39;m trying to create =
a table with one-column range-partitioned and another column hash-partition=
ed.=C2=A0 Documentation for add_hash_partitions and set_range_partition_col=
umns suggest this should be possible (&quot;Tables must be created with eit=
her range, hash, or range and hash partitioning&quot;).<br><br></div>I have=
 a schema with three INT64 columns (&quot;time&quot;, &quot;key&quot;, and =
&quot;value&quot;).=C2=A0 When I create the table, I set up the partitionin=
g:<br><br><div style=3D"margin-left:40px">(*table_creator)<br></div><div st=
yle=3D"margin-left:40px">=C2=A0 .table_name(&quot;test_table&quot;)<br></di=
v><div style=3D"margin-left:40px">=C2=A0 .schema(&amp;schema)<br></div><div=
 style=3D"margin-left:40px">=C2=A0 .add_hash_partitions({&quot;key&quot;}, =
2)<br></div><div style=3D"margin-left:40px">=C2=A0 .set_range_partition_col=
umns({<wbr>&quot;time&quot;})<br></div><div style=3D"margin-left:40px">=C2=
=A0 .num_replicas(1)<br></div><div style=3D"margin-left:40px">=C2=A0 .Creat=
e()<br><br></div>I later try to add a partition:<br><br><div style=3D"margi=
n-left:40px">auto timesplit(KuduSchema &amp; schema, std::int64_t t) {<br><=
/div><div style=3D"margin-left:40px">=C2=A0 auto split =3D schema.NewRow();=
<br></div><div style=3D"margin-left:40px">=C2=A0 check_ok(split-&gt;SetInt6=
4(&quot;time<wbr>&quot;, t));<br></div><div style=3D"margin-left:40px">=C2=
=A0 return split;<br>}<br></div><div style=3D"margin-left:40px"><br></div><=
div style=3D"margin-left:40px">alterer-&gt;AddRangePartition(<br></div><div=
 style=3D"margin-left:40px">=C2=A0 timesplit(schema, date_start),<br></div>=
<div style=3D"margin-left:40px">=C2=A0 timesplit(schema, next_date_start));=
<br><br></div><div style=3D"margin-left:40px">check_ok(alterer-&gt;Alter())=
;<br><br></div>But I get an error &quot;Invalid argument: New range partiti=
on conflicts with existing range partition&quot;.<br><br></div>How are hash=
 and range partitioning intended to be mixed?<br><br></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--001a113a80e8d3f7ae054977acc2--