kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Spark on Kudu
Date Mon, 11 Apr 2016 18:22:35 GMT
You guys make a convincing point, although on the upsert side we'll need
more support from the servers. Right now all you can do is an INSERT then,
if you get a dup key, do an UPDATE. I guess we could at least add an API on
the client side that would manage it, but it wouldn't be atomic.

J-D

On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra <mark@clearstorydata.com>
wrote:

> It's pretty simple, actually.  I need to support versioned datasets in a
> Spark SQL environment.  Instead of a hack on top of a Parquet data store,
> I'm hoping (among other reasons) to be able to use Kudu's write and
> timestamp-based read operations to support not only appending data, but
> also updating existing data, and even some schema migration.  The most
> typical use case is a dataset that is updated periodically (e.g., weekly or
> monthly) in which the the preliminary data in the previous window (week or
> month) is updated with values that are expected to remain unchanged from
> then on, and a new set of preliminary values for the current window need to
> be added/appended.
>
> Using Kudu's Java API and developing additional functionality on top of
> what Kudu has to offer isn't too much to ask, but the ease of integration
> with Spark SQL will gate how quickly we would move to using Kudu and how
> seriously we'd look at alternatives before making that decision.
>
> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans <jdcryans@apache.org>
> wrote:
>
>> Mark,
>>
>> Thanks for taking some time to reply in this thread, glad it caught the
>> attention of other folks!
>>
>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra <mark@clearstorydata.com>
>> wrote:
>>
>>> Do they care being able to insert into Kudu with SparkSQL
>>>
>>>
>>> I care about insert into Kudu with Spark SQL.  I'm currently delaying a
>>> refactoring of some Spark SQL-oriented insert functionality while trying to
>>> evaluate what to expect from Kudu.  Whether Kudu does a good job supporting
>>> inserts with Spark SQL will be a key consideration as to whether we adopt
>>> Kudu.
>>>
>>
>> I'd like to know more about why SparkSQL inserts in necessary for you. Is
>> it just that you currently do it that way into some database or parquet so
>> with minimal refactoring you'd be able to use Kudu? Would re-writing those
>> SQL lines into Scala and directly use the Java API's KuduSession be too
>> much work?
>>
>> Additionally, what do you expect to gain from using Kudu VS your current
>> solution? If it's not completely clear, I'd love to help you think through
>> it.
>>
>>
>>>
>>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans <
>>> jdcryans@apache.org> wrote:
>>>
>>>> Yup, starting to get a good idea.
>>>>
>>>> What are your DS folks looking for in terms of functionality related to
>>>> Spark? A SparkSQL integration that's as fully featured as Impala's? Do they
>>>> care being able to insert into Kudu with SparkSQL or just being able to
>>>> query real fast? Anything more specific to Spark that I'm missing?
>>>>
>>>> FWIW the plan is to get to 1.0 in late Summer/early Fall. At Cloudera
>>>> all our resources are committed to making things happen in time, and a more
>>>> fully featured Spark integration isn't in our plans during that period. I'm
>>>> really hoping someone in the community will help with Spark, the same way
>>>> we got a big contribution for the Flume sink.
>>>>
>>>> J-D
>>>>
>>>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim <bbuild11@gmail.com>
>>>> wrote:
>>>>
>>>>> Yes, we took Kudu for a test run using 0.6 and 0.7 versions. But,
>>>>> since it’s not “production-ready”, upper management doesn’t want
to fully
>>>>> deploy it yet. They just want to keep an eye on it though. Kudu was so
much
>>>>> simpler and easier to use in every aspect compared to HBase. Impala was
>>>>> great for the report writers and analysts to experiment with for the
short
>>>>> time it was up. But, once again, the only blocker was the lack of Spark
>>>>> support for our Data Developers/Scientists. So, production-level data
>>>>> population won’t happen until then.
>>>>>
>>>>> I hope this helps you get an idea where I am coming from…
>>>>>
>>>>> Cheers,
>>>>> Ben
>>>>>
>>>>>
>>>>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans <jdcryans@apache.org>
>>>>> wrote:
>>>>>
>>>>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim <bbuild11@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> J-D,
>>>>>>
>>>>>> The main thing I hear that Cassandra is being used as an updatable
>>>>>> hot data store to ensure that duplicates are taken care of and idempotency
>>>>>> is maintained. Whether data was directly retrieved from Cassandra
for
>>>>>> analytics, reports, or searches, it was not clear as to what was
its main
>>>>>> use. Some also just used it for a staging area to populate downstream
>>>>>> tables in parquet format. The last thing I heard was that CQL was
terrible,
>>>>>> so that rules out much use of direct queries against it.
>>>>>>
>>>>>
>>>>> I'm no C* expert, but I don't think CQL is meant for real analytics,
>>>>> just ease of use instead of plainly using the APIs. Even then, Kudu should
>>>>> beat it easily on big scans. Same for HBase. We've done benchmarks against
>>>>> the latter, not the former.
>>>>>
>>>>>
>>>>>>
>>>>>> As for our company, we have been looking for an updatable data store
>>>>>> for a long time that can be quickly queried directly either using
Spark SQL
>>>>>> or Impala or some other SQL engine and still handle TB or PB of data
>>>>>> without performance degradation and many configuration headaches.
For now,
>>>>>> we are using HBase to take on this role with Phoenix as a fast way
to
>>>>>> directly query the data. I can see Kudu as the best way to fill this
gap
>>>>>> easily, especially being the closest thing to other relational databases
>>>>>> out there in familiarity for the many SQL analytics people in our
company.
>>>>>> The other alternative would be to go with AWS Redshift for the same
>>>>>> reasons, but it would come at a cost, of course. If we went with
either
>>>>>> solutions, Kudu or Redshift, it would get rid of the need to extract
from
>>>>>> HBase to parquet tables or export to PostgreSQL to support more of
the SQL
>>>>>> language using by analysts or the reporting software we use..
>>>>>>
>>>>>
>>>>> Ok, the usual then *smile*. Looks like we're not too far off with
>>>>> Kudu. Have you folks tried Kudu with Impala yet with those use cases?
>>>>>
>>>>>
>>>>>>
>>>>>> I hope this helps.
>>>>>>
>>>>>
>>>>> It does, thanks for nice reply.
>>>>>
>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> Ben
>>>>>>
>>>>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans <jdcryans@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>> Ha first time I'm hearing about SMACK. Inside Cloudera we like to
>>>>>> refer to "Impala + Kudu" as Kimpala, but yeah it's not as sexy. My
>>>>>> colleagues who were also there did say that the hype around Spark
isn't
>>>>>> dying down.
>>>>>>
>>>>>> There's definitely an overlap in the use cases that Cassandra, HBase,
>>>>>> and Kudu cater to. I wouldn't go as far as saying that C* is just
an
>>>>>> interim solution for the use case you describe.
>>>>>>
>>>>>> Nothing significant happened in Kudu over the past month, it's a
>>>>>> storage engine so things move slowly *smile*. I'd love to see more
>>>>>> contributions on the Spark front. I know there's code out there that
could
>>>>>> be integrated in kudu-spark, it just needs to land in gerrit. I'm
sure
>>>>>> folks will happily review it.
>>>>>>
>>>>>> Do you have relevant experiences you can share? I'd love to learn
>>>>>> more about the use cases for which you envision using Kudu as a C*
>>>>>> replacement.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim <bbuild11@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi J-D,
>>>>>>>
>>>>>>> My colleagues recently came back from Strata in San Jose. They
told
>>>>>>> me that everything was about Spark and there is a big buzz about
the SMACK
>>>>>>> stack (Spark, Mesos, Akka, Cassandra, Kafka). I still think that
Cassandra
>>>>>>> is just an interim solution as a low-latency, easily queried
data store. I
>>>>>>> was wondering if anything significant happened in regards to
Kudu,
>>>>>>> especially on the Spark front. Plus, can you come up with your
own proposed
>>>>>>> stack acronym to promote?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Ben
>>>>>>>
>>>>>>>
>>>>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans <jdcryans@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Ben,
>>>>>>>
>>>>>>> AFAIK no one in the dev community committed to any timeline.
I know
>>>>>>> of one person on the Kudu Slack who's working on a better RDD,
but that's
>>>>>>> about it.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> J-D
>>>>>>>
>>>>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim <bkim@amobee.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi J-D,
>>>>>>>>
>>>>>>>> Quick question… Is there an ETA for KUDU-1214? I want to
target a
>>>>>>>> version of Kudu to begin real testing of Spark against it
for our devs. At
>>>>>>>> least, I can tell them what timeframe to anticipate.
>>>>>>>>
>>>>>>>> Just curious,
>>>>>>>> *Benjamin Kim*
>>>>>>>> *Data Solutions Architect*
>>>>>>>>
>>>>>>>> [a•mo•bee] *(n.)* the company defining digital marketing.
>>>>>>>>
>>>>>>>> *Mobile: +1 818 635 2900 <%2B1%20818%20635%202900>*
>>>>>>>> 3250 Ocean Park Blvd, Suite 200  |  Santa Monica, CA 90405
 |
>>>>>>>> www.amobee.com
>>>>>>>>
>>>>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans <
>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>
>>>>>>>> The DStream stuff isn't there at all. I'm not sure if it's
needed
>>>>>>>> either.
>>>>>>>>
>>>>>>>> The kuduRDD is just leveraging the MR input format, ideally
we'd
>>>>>>>> use scans directly.
>>>>>>>>
>>>>>>>> The SparkSQL stuff is there but it doesn't do any sort of
pushdown.
>>>>>>>> It's really basic.
>>>>>>>>
>>>>>>>> The goal was to provide something for others to contribute
to. We
>>>>>>>> have some basic unit tests that others can easily extend.
None of us on the
>>>>>>>> team are Spark experts, but we'd be really happy to assist
one improve the
>>>>>>>> kudu-spark code.
>>>>>>>>
>>>>>>>> J-D
>>>>>>>>
>>>>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim <bbuild11@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> J-D,
>>>>>>>>>
>>>>>>>>> It looks like it fulfills most of the basic requirements
(kudu
>>>>>>>>> RDD, kudu DStream) in KUDU-1214. Am I right? Besides
shoring up more Spark
>>>>>>>>> SQL functionality (Dataframes) and doing the documentation,
what more needs
>>>>>>>>> to be done? Optimizations?
>>>>>>>>>
>>>>>>>>> I believe that it’s a good place to start using Spark
with Kudu
>>>>>>>>> and compare it to HBase with Spark (not clean).
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ben
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans <
>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>
>>>>>>>>> AFAIK no one is working on it, but we did manage to get
this in
>>>>>>>>> for 0.7.0: https://issues.cloudera.org/browse/KUDU-1321
>>>>>>>>>
>>>>>>>>> It's a really simple wrapper, and yes you can use SparkSQL
on
>>>>>>>>> Kudu, but it will require a lot more work to make it
fast/useful.
>>>>>>>>>
>>>>>>>>> Hope this helps,
>>>>>>>>>
>>>>>>>>> J-D
>>>>>>>>>
>>>>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim <bbuild11@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I see this KUDU-1214
>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1214>
targeted for
>>>>>>>>>> 0.8.0, but I see no progress on it. When this is
complete, will this mean
>>>>>>>>>> that Spark will be able to work with Kudu both programmatically
and as a
>>>>>>>>>> client via Spark SQL? Or is there more work that
needs to be done on the
>>>>>>>>>> Spark side for it to work?
>>>>>>>>>>
>>>>>>>>>> Just curious.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Ben
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message