kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Burkert <...@cloudera.com>
Subject Re: Performance Question
Date Wed, 15 Jun 2016 17:13:09 GMT
Adding partition splits when range partitioning is done via the
You can find more about the different partitioning options in the schema
design guide <http://getkudu.io/docs/schema_design.html#data-distribution>.
We generally recommend sticking to hash partitioning if possible, since you
don't have to determine your own split rows.

- Dan

On Wed, Jun 15, 2016 at 9:17 AM, Benjamin Kim <bbuild11@gmail.com> wrote:

> Todd,
> I think the locality is not within our setup. We have the compute cluster
> with Spark, YARN, etc. on its own, and we have the storage cluster with
> HBase, Kudu, etc. on another. We beefed up the hardware specs on the
> compute cluster and beefed up storage capacity on the storage cluster. We
> got this setup idea from the Databricks folks. I do have a question. I
> created the table to use range partition on columns. I see that if I use
> hash partition I can set the number of splits, but how do I do that using
> range (50 nodes * 10 = 500 splits)?
> Thanks,
> Ben
> On Jun 15, 2016, at 9:11 AM, Todd Lipcon <todd@cloudera.com> wrote:
> Awesome use case. One thing to keep in mind is that spark parallelism will
> be limited by the number of tablets. So, you might want to split into 10 or
> so buckets per node to get the best query throughput.
> Usually if you run top on some machines while running the query you can
> see if it is fully utilizing the cores.
> Another known issue right now is that spark locality isn't working
> properly on replicated tables so you will use a lot of network traffic. For
> a perf test you might want to try a table with replication count 1
> On Jun 15, 2016 5:26 PM, "Benjamin Kim" <bbuild11@gmail.com> wrote:
> Hi Todd,
> I did a simple test of our ad events. We stream using Spark Streaming
> directly into HBase, and the Data Analysts/Scientists do some
> insight/discovery work plus some reports generation. For the reports, we
> use SQL, and the more deeper stuff, we use Spark. In Spark, our main data
> currency store of choice is DataFrames.
> The schema is around 83 columns wide where most are of the string data
> type.
> "event_type", "timestamp", "event_valid", "event_subtype", "user_ip",
> "user_id", "mappable_id",
> "cookie_status", "profile_status", "user_status", "previous_timestamp",
> "user_agent", "referer",
> "host_domain", "uri", "request_elapsed", "browser_languages", "acamp_id",
> "creative_id",
> "location_id", “pcamp_id",
> "pdomain_id", "continent_code", "country", "region", "dma", "city", "zip",
> "isp", "line_speed",
> "gender", "year_of_birth", "behaviors_read", "behaviors_written",
> "key_value_pairs", "acamp_candidates",
> "tag_format", "optimizer_name", "optimizer_version", "optimizer_ip",
> "pixel_id", “video_id",
> "video_network_id", "video_time_watched", "video_percentage_watched",
> "video_media_type",
> "video_player_iframed", "video_player_in_view", "video_player_width",
> "video_player_height",
> "conversion_valid_sale", "conversion_sale_amount",
> "conversion_commission_amount", "conversion_step",
> "conversion_currency", "conversion_attribution", "conversion_offer_id",
> "custom_info", "frequency",
> "recency_seconds", "cost", "revenue", “optimizer_acamp_id",
> "optimizer_creative_id", "optimizer_ecpm", "impression_id",
> "diagnostic_data",
> "user_profile_mapping_source", "latitude", "longitude", "area_code",
> "gmt_offset", "in_dst",
> "proxy_type", "mobile_carrier", "pop", "hostname", "profile_expires",
> "timestamp_iso", "reference_id",
> "identity_organization", "identity_method"
> Most queries are like counts of how many users use what browser, how many
> are unique users, etc. The part that scares most users is when it comes to
> joining this data with other dimension/3rd party events tables because of
> shear size of it.
> We do what most companies do, similar to what I saw in earlier
> presentations of Kudu. We dump data out of HBase into partitioned Parquet
> tables to make query performance manageable.
> I will coordinate with a data scientist today to do some tests. He is
> working on identity matching/record linking of users from 2 domains: US and
> Singapore, using probabilistic deduping algorithms. I will load the data
> from ad events from both countries, and let him run his process against
> this data in Kudu. I hope this will “wow” the team.
> Thanks,
> Ben
> On Jun 15, 2016, at 12:47 AM, Todd Lipcon <todd@cloudera.com> wrote:
> Hi Benjamin,
> What workload are you using for benchmarks? Using spark or something more
> custom? rdd or data frame or SQL, etc? Maybe you can share the schema and
> some queries
> Todd
> Todd
> On Jun 15, 2016 8:10 AM, "Benjamin Kim" <bbuild11@gmail.com> wrote:
>> Hi Todd,
>> Now that Kudu 0.9.0 is out. I have done some tests. Already, I am
>> impressed. Compared to HBase, read and write performance are better. Write
>> performance has the greatest improvement (> 4x), while read is > 1.5x.
>> Albeit, these are only preliminary tests. Do you know of a way to really do
>> some conclusive tests? I want to see if I can match your results on my 50
>> node cluster.
>> Thanks,
>> Ben
>> On May 30, 2016, at 10:33 AM, Todd Lipcon <todd@cloudera.com> wrote:
>> On Sat, May 28, 2016 at 7:12 AM, Benjamin Kim <bbuild11@gmail.com> wrote:
>>> Todd,
>>> It sounds like Kudu can possibly top or match those numbers put out by
>>> Aerospike. Do you have any performance statistics published or any
>>> instructions as to measure them myself as good way to test? In addition,
>>> this will be a test using Spark, so should I wait for Kudu version 0.9.0
>>> where support will be built in?
>> We don't have a lot of benchmarks published yet, especially on the write
>> side. I've found that thorough cross-system benchmarks are very difficult
>> to do fairly and accurately, and often times users end up misguided if they
>> pay too much attention to them :) So, given a finite number of developers
>> working on Kudu, I think we've tended to spend more time on the project
>> itself and less time focusing on "competition". I'm sure there are use
>> cases where Kudu will beat out Aerospike, and probably use cases where
>> Aerospike will beat Kudu as well.
>> From my perspective, it would be great if you can share some details of
>> your workload, especially if there are some areas you're finding Kudu
>> lacking. Maybe we can spot some easy code changes we could make to improve
>> performance, or suggest a tuning variable you could change.
>> -Todd
>>> On May 27, 2016, at 9:19 PM, Todd Lipcon <todd@cloudera.com> wrote:
>>> On Fri, May 27, 2016 at 8:20 PM, Benjamin Kim <bbuild11@gmail.com>
>>> wrote:
>>>> Hi Mike,
>>>> First of all, thanks for the link. It looks like an interesting read. I
>>>> checked that Aerospike is currently at version, and in the article,
>>>> they are evaluating version 3.5.4. The main thing that impressed me was
>>>> their claim that they can beat Cassandra and HBase by 8x for writing and
>>>> 25x for reading. Their big claim to fame is that Aerospike can write 1M
>>>> records per second with only 50 nodes. I wanted to see if this is real.
>>> 1M records per second on 50 nodes is pretty doable by Kudu as well,
>>> depending on the size of your records and the insertion order. I've been
>>> playing with a ~70 node cluster recently and seen 1M+ writes/second
>>> sustained, and bursting above 4M. These are 1KB rows with 11 columns, and
>>> with pretty old HDD-only nodes. I think newer flash-based nodes could do
>>> better.
>>>> To answer your questions, we have a DMP with user profiles with many
>>>> attributes. We create segmentation information off of these attributes to
>>>> classify them. Then, we can target advertising appropriately for our sales
>>>> department. Much of the data processing is for applying models on all or
>>>> not most of every profile’s attributes to find similarities (nearest
>>>> neighbor/clustering) over a large number of rows when batch processing or
>>>> small subset of rows for quick online scoring. So, our use case is a
>>>> typical advanced analytics scenario. We have tried HBase, but it doesn’t
>>>> work well for these types of analytics.
>>>> I read, that Aerospike in the release notes, they did do many
>>>> improvements for batch and scan operations.
>>>> I wonder what your thoughts are for using Kudu for this.
>>> Sounds like a good Kudu use case to me. I've heard great things about
>>> Aerospike for the low latency random access portion, but I've also heard
>>> that it's _very_ expensive, and not particularly suited to the columnar
>>> scan workload. Lastly, I think the Apache license of Kudu is much more
>>> appealing than the AGPL3 used by Aerospike. But, that's not really a direct
>>> answer to the performance question :)
>>>> Thanks,
>>>> Ben
>>>> On May 27, 2016, at 6:21 PM, Mike Percy <mpercy@cloudera.com> wrote:
>>>> Have you considered whether you have a scan heavy or a random access
>>>> heavy workload? Have you considered whether you always access / update a
>>>> whole row vs only a partial row? Kudu is a column store so has some
>>>> awesome performance characteristics when you are doing a lot of scanning
>>>> just a couple of columns.
>>>> I don't know the answer to your question but if your concern is
>>>> performance then I would be interested in seeing comparisons from a perf
>>>> perspective on certain workloads.
>>>> Finally, a year ago Aerospike did quite poorly in a Jepsen test:
>>>> https://aphyr.com/posts/324-jepsen-aerospike
>>>> I wonder if they have addressed any of those issues.
>>>> Mike
>>>> On Friday, May 27, 2016, Benjamin Kim <bbuild11@gmail.com> wrote:
>>>>> I am just curious. How will Kudu compare with Aerospike (
>>>>> http://www.aerospike.com)? I went to a Spark Roadshow and found out
>>>>> about this piece of software. It appears to fit our use case perfectly
>>>>> since we are an ad-tech company trying to leverage our user profiles
>>>>> Plus, it already has a Spark connector and has a SQL-like client. The
>>>>> tables can be accessed using Spark SQL DataFrames and, also, made into
>>>>> tables for direct use with Spark SQL ODBC/JDBC Thriftserver. I see from
>>>>> work done here http://gerrit.cloudera.org:8080/#/c/2992/ that the
>>>>> Spark integration is well underway and, from the looks of it lately,
>>>>> complete. I would prefer to use Kudu since we are already a Cloudera
>>>>> and Kudu is easy to deploy and configure using Cloudera Manager. I also
>>>>> hope that some of Aerospike’s speed optimization techniques can make
>>>>> into Kudu in the future, if they have not been already thought of or
>>>>> included.
>>>>> Just some thoughts…
>>>>> Cheers,
>>>>> Ben
>>>> --
>>>> --
>>>> Mike Percy
>>>> Software Engineer, Cloudera
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera

View raw message