kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Hamstra <m...@clearstorydata.com>
Subject Re: Spark on Kudu
Date Sun, 10 Apr 2016 19:33:42 GMT
>
> Do they care being able to insert into Kudu with SparkSQL


I care about insert into Kudu with Spark SQL.  I'm currently delaying a
refactoring of some Spark SQL-oriented insert functionality while trying to
evaluate what to expect from Kudu.  Whether Kudu does a good job supporting
inserts with Spark SQL will be a key consideration as to whether we adopt
Kudu.

On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans <jdcryans@apache.org>
wrote:

> Yup, starting to get a good idea.
>
> What are your DS folks looking for in terms of functionality related to
> Spark? A SparkSQL integration that's as fully featured as Impala's? Do they
> care being able to insert into Kudu with SparkSQL or just being able to
> query real fast? Anything more specific to Spark that I'm missing?
>
> FWIW the plan is to get to 1.0 in late Summer/early Fall. At Cloudera all
> our resources are committed to making things happen in time, and a more
> fully featured Spark integration isn't in our plans during that period. I'm
> really hoping someone in the community will help with Spark, the same way
> we got a big contribution for the Flume sink.
>
> J-D
>
> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim <bbuild11@gmail.com> wrote:
>
>> Yes, we took Kudu for a test run using 0.6 and 0.7 versions. But, since
>> it’s not “production-ready”, upper management doesn’t want to fully deploy
>> it yet. They just want to keep an eye on it though. Kudu was so much
>> simpler and easier to use in every aspect compared to HBase. Impala was
>> great for the report writers and analysts to experiment with for the short
>> time it was up. But, once again, the only blocker was the lack of Spark
>> support for our Data Developers/Scientists. So, production-level data
>> population won’t happen until then.
>>
>> I hope this helps you get an idea where I am coming from…
>>
>> Cheers,
>> Ben
>>
>>
>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans <jdcryans@apache.org>
>> wrote:
>>
>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim <bbuild11@gmail.com>
>> wrote:
>>
>>> J-D,
>>>
>>> The main thing I hear that Cassandra is being used as an updatable hot
>>> data store to ensure that duplicates are taken care of and idempotency is
>>> maintained. Whether data was directly retrieved from Cassandra for
>>> analytics, reports, or searches, it was not clear as to what was its main
>>> use. Some also just used it for a staging area to populate downstream
>>> tables in parquet format. The last thing I heard was that CQL was terrible,
>>> so that rules out much use of direct queries against it.
>>>
>>
>> I'm no C* expert, but I don't think CQL is meant for real analytics, just
>> ease of use instead of plainly using the APIs. Even then, Kudu should beat
>> it easily on big scans. Same for HBase. We've done benchmarks against the
>> latter, not the former.
>>
>>
>>>
>>> As for our company, we have been looking for an updatable data store for
>>> a long time that can be quickly queried directly either using Spark SQL or
>>> Impala or some other SQL engine and still handle TB or PB of data without
>>> performance degradation and many configuration headaches. For now, we are
>>> using HBase to take on this role with Phoenix as a fast way to directly
>>> query the data. I can see Kudu as the best way to fill this gap easily,
>>> especially being the closest thing to other relational databases out there
>>> in familiarity for the many SQL analytics people in our company. The other
>>> alternative would be to go with AWS Redshift for the same reasons, but it
>>> would come at a cost, of course. If we went with either solutions, Kudu or
>>> Redshift, it would get rid of the need to extract from HBase to parquet
>>> tables or export to PostgreSQL to support more of the SQL language using by
>>> analysts or the reporting software we use..
>>>
>>
>> Ok, the usual then *smile*. Looks like we're not too far off with Kudu.
>> Have you folks tried Kudu with Impala yet with those use cases?
>>
>>
>>>
>>> I hope this helps.
>>>
>>
>> It does, thanks for nice reply.
>>
>>
>>>
>>> Cheers,
>>> Ben
>>>
>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans <jdcryans@apache.org>
>>> wrote:
>>>
>>> Ha first time I'm hearing about SMACK. Inside Cloudera we like to refer
>>> to "Impala + Kudu" as Kimpala, but yeah it's not as sexy. My colleagues who
>>> were also there did say that the hype around Spark isn't dying down.
>>>
>>> There's definitely an overlap in the use cases that Cassandra, HBase,
>>> and Kudu cater to. I wouldn't go as far as saying that C* is just an
>>> interim solution for the use case you describe.
>>>
>>> Nothing significant happened in Kudu over the past month, it's a storage
>>> engine so things move slowly *smile*. I'd love to see more contributions on
>>> the Spark front. I know there's code out there that could be integrated in
>>> kudu-spark, it just needs to land in gerrit. I'm sure folks will happily
>>> review it.
>>>
>>> Do you have relevant experiences you can share? I'd love to learn more
>>> about the use cases for which you envision using Kudu as a C* replacement.
>>>
>>> Thanks,
>>>
>>> J-D
>>>
>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim <bbuild11@gmail.com>
>>> wrote:
>>>
>>>> Hi J-D,
>>>>
>>>> My colleagues recently came back from Strata in San Jose. They told me
>>>> that everything was about Spark and there is a big buzz about the SMACK
>>>> stack (Spark, Mesos, Akka, Cassandra, Kafka). I still think that Cassandra
>>>> is just an interim solution as a low-latency, easily queried data store.
I
>>>> was wondering if anything significant happened in regards to Kudu,
>>>> especially on the Spark front. Plus, can you come up with your own proposed
>>>> stack acronym to promote?
>>>>
>>>> Cheers,
>>>> Ben
>>>>
>>>>
>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans <jdcryans@apache.org>
>>>> wrote:
>>>>
>>>> Hi Ben,
>>>>
>>>> AFAIK no one in the dev community committed to any timeline. I know of
>>>> one person on the Kudu Slack who's working on a better RDD, but that's
>>>> about it.
>>>>
>>>> Regards,
>>>>
>>>> J-D
>>>>
>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim <bkim@amobee.com> wrote:
>>>>
>>>>> Hi J-D,
>>>>>
>>>>> Quick question… Is there an ETA for KUDU-1214? I want to target a
>>>>> version of Kudu to begin real testing of Spark against it for our devs.
At
>>>>> least, I can tell them what timeframe to anticipate.
>>>>>
>>>>> Just curious,
>>>>> *Benjamin Kim*
>>>>> *Data Solutions Architect*
>>>>>
>>>>> [a•mo•bee] *(n.)* the company defining digital marketing.
>>>>>
>>>>> *Mobile: +1 818 635 2900 <%2B1%20818%20635%202900>*
>>>>> 3250 Ocean Park Blvd, Suite 200  |  Santa Monica, CA 90405  |
>>>>> www.amobee.com
>>>>>
>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans <jdcryans@apache.org>
>>>>> wrote:
>>>>>
>>>>> The DStream stuff isn't there at all. I'm not sure if it's needed
>>>>> either.
>>>>>
>>>>> The kuduRDD is just leveraging the MR input format, ideally we'd use
>>>>> scans directly.
>>>>>
>>>>> The SparkSQL stuff is there but it doesn't do any sort of pushdown.
>>>>> It's really basic.
>>>>>
>>>>> The goal was to provide something for others to contribute to. We have
>>>>> some basic unit tests that others can easily extend. None of us on the
team
>>>>> are Spark experts, but we'd be really happy to assist one improve the
>>>>> kudu-spark code.
>>>>>
>>>>> J-D
>>>>>
>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim <bbuild11@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> J-D,
>>>>>>
>>>>>> It looks like it fulfills most of the basic requirements (kudu RDD,
>>>>>> kudu DStream) in KUDU-1214. Am I right? Besides shoring up more Spark
SQL
>>>>>> functionality (Dataframes) and doing the documentation, what more
needs to
>>>>>> be done? Optimizations?
>>>>>>
>>>>>> I believe that it’s a good place to start using Spark with Kudu
and
>>>>>> compare it to HBase with Spark (not clean).
>>>>>>
>>>>>> Thanks,
>>>>>> Ben
>>>>>>
>>>>>>
>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans <jdcryans@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>> AFAIK no one is working on it, but we did manage to get this in for
>>>>>> 0.7.0: https://issues.cloudera.org/browse/KUDU-1321
>>>>>>
>>>>>> It's a really simple wrapper, and yes you can use SparkSQL on Kudu,
>>>>>> but it will require a lot more work to make it fast/useful.
>>>>>>
>>>>>> Hope this helps,
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim <bbuild11@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I see this KUDU-1214 <https://issues.cloudera.org/browse/KUDU-1214>
targeted
>>>>>>> for 0.8.0, but I see no progress on it. When this is complete,
will this
>>>>>>> mean that Spark will be able to work with Kudu both programmatically
and as
>>>>>>> a client via Spark SQL? Or is there more work that needs to be
done on the
>>>>>>> Spark side for it to work?
>>>>>>>
>>>>>>> Just curious.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Ben
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>

Mime
View raw message