Thanks very much Todd, perfectly clear on both counts. 

Yeah, as a convention we will only be exposing views to analysts/report-writers/bi-tools (for several reasons), so having as long in underlying tables will only be a concern of pipeline developers.

-m

On Fri, Jan 5, 2018 at 3:23 PM, Todd Lipcon <todd@cloudera.com> wrote:
Hey Mauricio,

Answers inline below

On Fri, Jan 5, 2018 at 2:50 PM, Mauricio Aristizabal <mauricio@impactradius.com> wrote:
Todd, since you bring it up in this thread... what CDH version do you expect DECIMAL support to make it into? I recently asked Icaro Vazquez about it but still no news.  We're hoping it makes it into 5.14 otherwise according to the roadmap there might not be another minor release and we'd be waiting till Summer for CDH 6.

As this is an open source project mailing list, it would be inappropriate for me to comment on a vendor's release schedule. Please note that Kudu is a product of the Apache Software Foundation and the ASF doesn't have any influence on or knowledge of Cloudera's release plans.

Of course it happens that I and many other contributors are also employees of Cloudera, but we participate in the ASF as individuals and not representatives of our employer, and so generally won't comment on questions like this in this forum. Please refer to Cloudera's forums for questions about CDH release plans, etc.
 

And just in case we're forced to make do without DECIMAL initially, is the recommendation really to store as string and convert?  I was thinking of storing as int/long and dividing by 10 or 1000 as needed in an impala view over the kudu table.  Wouldn't a division be way more performant than a conversion from string, especially when aggregating over thousands of records in a report query?

You're right -- using an integer type and division by a power of 10 is going to be much faster than casting from a string.  Division by a constant would be JITted by Impala into a pretty minimal sequence of assembly instructions (two bitshifts, an integer multiplication, and a subtraction) which likely take about 6 cycles total. In contrast, a cast from string to decimal probably takes many thousands of cycles.

The only downside is that if you have end users using the data they might be confused by the integer representation whereas a string representation would be a little clearer.

Thanks
-Todd
 

On Fri, Jan 5, 2018 at 11:13 AM, Todd Lipcon <todd@cloudera.com> wrote:
Oh, one other piece of feedback: maybe worth editing the title to say "vs Apache Parquet" instead of "vs Apache Impala" since in all cases you are using Impala as the query engine?

-Todd

On Fri, Jan 5, 2018 at 11:06 AM, Todd Lipcon <todd@cloudera.com> wrote:
Hey Boris,

Thanks for publishing this. It's a great look at how an end user evaluates Kudu. I appreciate that you cover both the pros and cons of the technology, and glad to see that your conclusion leaves you excited about Kudu :)

One quick note is that I think you'll be even more pleased when you upgrade to a later version (eg Kudu 1.5). We've improved performance in several areas and also improved scalability compared to the version you're testing. TIMESTAMP is also supported now, with DECIMAL soon to follow. It might be worth noting this as an addendum to the blog post if you feel like it.

-Todd

On Fri, Jan 5, 2018 at 10:51 AM, Boris Tyukin <boris@boristyukin.com> wrote:
Hi guys,

we just finished testing Kudu, mostly comparing Kudu to Impala on HDFS/parquet. I wanted to share my blog post and results. We used typical (and real) healthcare data for the test, not a synthetic data which I think makes it is a bit more interesting.

I welcome any feedback!

http://boristyukin.com/benchmarking-apache-kudu-vs-apache-impala/

We are really impressed with Kudu and I wanted to take an opportunity to thank Kudu developers for such an amazing and much-needed product.

Boris





--
Todd Lipcon
Software Engineer, Cloudera



--
Todd Lipcon
Software Engineer, Cloudera



--
MAURICIO ARISTIZABAL
Architect - Business Intelligence + Data Science 
mauricio@impactradius.com(m)+1 323 309 4260 
223 E. De La Guerra St. | Santa Barbara, CA 93101

Overview | Twitter | Facebook LinkedIn



--
Todd Lipcon
Software Engineer, Cloudera



--
MAURICIO ARISTIZABAL
Architect - Business Intelligence + Data Science 
mauricio@impactradius.com(m)+1 323 309 4260 
223 E. De La Guerra St. | Santa Barbara, CA 93101

Overview | Twitter | Facebook | LinkedIn