kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Wong <aw...@cloudera.com>
Subject Re: co-locating kudu table servers with HDFS data nodes
Date Wed, 29 Nov 2017 18:33:47 GMT
Hi Sunil,

Sorry for the delayed response. Let me preface this by saying I'm not an
Impala or HDFS expert.

Sharing resources:
The "con" is that each system, Kudu, HDFS, Impala is bound to use resources
that the others could use, so HDFS could fill up space on a disk that Kudu
is using, and Kudu would then use a different disk (if it were configured
to use multiple disks). The same goes for memory, cores, etc., although
Kudu has its own ways of dealing with memory pressure, full disks, etc. The
"pro" is that you could have fewer machines.

SSD vs spinning disks:
In terms of provisioning for Kudu, I would say that, given the option, your
WAL directory should be an SSD. The WAL writes to disk on each insert,
upsert, etc., so making sure this disk is performant is important.

Distributing data:
Disk partitioning isn't particularly relevant to how Kudu distributes data
to tservers. Kudu will distribute tablets (i.e. chunks of tables that may
specify a hash or range) based on your partitioning schema
<https://kudu.apache.org/docs/schema_design.html> and replication factor,
i.e. it distributes tablets. If your table only has a single tablet and a
replication factor of 1, there will be a single chunk of data for that
table in a single location. If your schema specifies multiple tablets for
your table, then there will be multiple chunks of data for that table, each
chunk only in a single location each (although potentially different
locations per table). If you have a replication factor >1, there will be
multiple copies of these chunks.

Hope this helped,

On Tue, Nov 21, 2017 at 4:17 PM, Sunil Parmar <sunilosunil@gmail.com> wrote:

> We are using CDH 5.12 and using HDFS for our primary data storage and
> Impala for querying them. Our worker node hosts both HDFS datanode and
> Impalad services. We're starting to move some of our data into KUDU and
> would like to understand community experiment and recommendation on
> disk/machine allocation and pro/cons for each.
> Install KUDU tablet server on each worker node vs separate machine
> Separate physical disks for KUDU tablet server on same machine vs sharing
> the disk with data nodes
> SSD vs spinning disks
> Some more questions on separate note but kinda related to the POC
> We have a small table as a first candidate for KUDU ( couple of G before
> replication ) . Does KUDU tries to distribute data across tablet servers
> for each table i.e. slow performance with too much sparse data. i.e. for
> small table what is better fewer disk partitions ( host-partition ) vs
> evenly distributed across worker nodes.
> Thanks,
> Sunil Parmar

Andrew Wong

View raw message