kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: 答复: A few questions for using Kudu
Date Tue, 20 Mar 2018 02:50:24 GMT
On Thu, Mar 15, 2018 at 8:32 PM, 张晓宁 <zhangxiaoning@jd.com> wrote:

> Thank you Dan! My follow-up comments with XiaoNing.
> *发件人:* Dan Burkert [mailto:danburkert@apache.org]
> *发送时间:* 2018年3月16日 1:06
> *收件人:* user@kudu.apache.org
> *主题:* Re: A few questions for using Kudu
> Hi, answers inline:
> On Thu, Mar 15, 2018 at 3:12 AM, 张晓宁 <zhangxiaoning@jd.com> wrote:
> I have a few questions for using kudu:
> 1.       As more and more data inserted to kudu, the performance
> decrease. After continuous data insertion for about 30 minutes, the TPS
> performance decreased with 20%, and after 1-hour data insertion, the
> performance decreased with 40%. Is this a known issue?
> This is expected if you are inserting data in random order.  If you try
> another benchmark where you insert data in primary key sorted order, you'll
> see that the performance will be much higher, and more consistent.  If you
> have a heavy insert workload, this kind of optimization is critical.  The
> table's partitioning and primary key can often be designed to make this
> happen naturally, but it's a dataset dependent thing, so without more
> specifics about your data it's difficult to give more precise advice.
>  XiaoNing: Our table has 2 partitions,the first level partition is by
> date range(using the column timestamp),one partition for one single day,
> and the second partition is by a hash on 2 column(key + host).These 3
> columns(timestamp,key,host) are the primary key of the table.For you
> comment “insert data in primary key sorted order”,do you mean we need to
> sort the data on the 3 primary-key columns before insertion?

If timestamp is the first column then it should probably be somewhat
naturally-sorted by the primary key, right? It doesn't need to be perfectly
sorted, but if the inserts are in roughly PK order, we will avoid
unnecessary compaction.

> 2.       When setting the replica number to be 1, totally I will have 2
> copy of data(1 master data + 1 replica data), is this true?
> That's incorrect.  The master node does not hold any table data.  If you
> set the number of replicas to be 1, you will lose data if you lose the
> tablet server which holds the replica.  We always recommend production
> workloads set number of replicas to 3 in order to have fault tolerance.
>  XiaoNing: So if we want to have fault tolerance, we should at least set
> the replica number to be 3, right?

That's right.

Todd Lipcon
Software Engineer, Cloudera

View raw message