kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Spark on Kudu
Date Tue, 14 Jun 2016 22:00:56 GMT
It's only in Cloudera's maven repo:
https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/

J-D

On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim <bbuild11@gmail.com> wrote:

> Hi J-D,
>
> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar for
> spark-shell to use. Can you show me where to find it?
>
> Thanks,
> Ben
>
>
> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans <jdcryans@apache.org>
> wrote:
>
> What's in this doc is what's gonna get released:
> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark
>
> J-D
>
> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <bbuild11@gmail.com> wrote:
>
>> Will this be documented with examples once 0.9.0 comes out?
>>
>> Thanks,
>> Ben
>>
>>
>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans <jdcryans@apache.org>
>> wrote:
>>
>> It will be in 0.9.0.
>>
>> J-D
>>
>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim <bbuild11@gmail.com> wrote:
>>
>>> Hi Chris,
>>>
>>> Will all this effort be rolled into 0.9.0 and be ready for use?
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On May 18, 2016, at 9:01 AM, Chris George <Christopher.George@rms.com>
>>> wrote:
>>>
>>> There is some code in review that needs some more refinement.
>>> It will allow upsert/insert from a dataframe using the datasource api.
>>> It will also allow the creation and deletion of tables from a dataframe
>>> http://gerrit.cloudera.org:8080/#/c/2992/
>>>
>>> Example usages will look something like:
>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc
>>>
>>> -Chris George
>>>
>>>
>>> On 5/18/16, 9:45 AM, "Benjamin Kim" <bbuild11@gmail.com> wrote:
>>>
>>> Can someone tell me what the state is of this Spark work?
>>>
>>> Also, does anyone have any sample code on how to update/insert data in
>>> Kudu using DataFrames?
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On Apr 13, 2016, at 8:22 AM, Chris George <Christopher.George@rms.com>
>>> wrote:
>>>
>>> SparkSQL cannot support these type of statements but we may be able to
>>> implement similar functionality through the api.
>>> -Chris
>>>
>>> On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuild11@gmail.com> wrote:
>>>
>>> It would be nice to adhere to the SQL:2003 standard for an “upsert” if
>>> it were to be implemented.
>>>
>>> MERGE INTO table_name USING table_reference ON (condition)
>>>  WHEN MATCHED THEN
>>>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>>>  WHEN NOT MATCHED THEN
>>>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
>>>
>>> Cheers,
>>> Ben
>>>
>>> On Apr 11, 2016, at 12:21 PM, Chris George <Christopher.George@rms.com>
>>> wrote:
>>>
>>> I have a wip kuduRDD that I made a few months ago. I pushed it into
>>> gerrit if you want to take a look.
>>> http://gerrit.cloudera.org:8080/#/c/2754/
>>> It does pushdown predicates which the existing input formatter based rdd
>>> does not.
>>>
>>> Within the next two weeks I’m planning to implement a datasource for
>>> spark that will have pushdown predicates and insertion/update functionality
>>> (need to look more at cassandra and the hbase datasource for best way to do
>>> this) I agree that server side upsert would be helpful.
>>> Having a datasource would give us useful data frames and also make spark
>>> sql usable for kudu.
>>>
>>> My reasoning for having a spark datasource and not using Impala is: 1.
>>> We have had trouble getting impala to run fast with high concurrency when
>>> compared to spark 2. We interact with datasources which do not integrate
>>> with impala. 3. We have custom sql query planners for extended sql
>>> functionality.
>>>
>>> -Chris George
>>>
>>>
>>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcryans@apache.org> wrote:
>>>
>>> You guys make a convincing point, although on the upsert side we'll need
>>> more support from the servers. Right now all you can do is an INSERT then,
>>> if you get a dup key, do an UPDATE. I guess we could at least add an API on
>>> the client side that would manage it, but it wouldn't be atomic.
>>>
>>> J-D
>>>
>>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra <mark@clearstorydata.com>
>>> wrote:
>>>
>>>> It's pretty simple, actually.  I need to support versioned datasets in
>>>> a Spark SQL environment.  Instead of a hack on top of a Parquet data store,
>>>> I'm hoping (among other reasons) to be able to use Kudu's write and
>>>> timestamp-based read operations to support not only appending data, but
>>>> also updating existing data, and even some schema migration.  The most
>>>> typical use case is a dataset that is updated periodically (e.g., weekly
or
>>>> monthly) in which the the preliminary data in the previous window (week or
>>>> month) is updated with values that are expected to remain unchanged from
>>>> then on, and a new set of preliminary values for the current window need
to
>>>> be added/appended.
>>>>
>>>> Using Kudu's Java API and developing additional functionality on top of
>>>> what Kudu has to offer isn't too much to ask, but the ease of integration
>>>> with Spark SQL will gate how quickly we would move to using Kudu and how
>>>> seriously we'd look at alternatives before making that decision.
>>>>
>>>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans <
>>>> jdcryans@apache.org> wrote:
>>>>
>>>>> Mark,
>>>>>
>>>>> Thanks for taking some time to reply in this thread, glad it caught
>>>>> the attention of other folks!
>>>>>
>>>>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra <
>>>>> mark@clearstorydata.com> wrote:
>>>>>
>>>>>> Do they care being able to insert into Kudu with SparkSQL
>>>>>>
>>>>>>
>>>>>> I care about insert into Kudu with Spark SQL.  I'm currently delaying
>>>>>> a refactoring of some Spark SQL-oriented insert functionality while
trying
>>>>>> to evaluate what to expect from Kudu.  Whether Kudu does a good job
>>>>>> supporting inserts with Spark SQL will be a key consideration as
to whether
>>>>>> we adopt Kudu.
>>>>>>
>>>>>
>>>>> I'd like to know more about why SparkSQL inserts in necessary for you.
>>>>> Is it just that you currently do it that way into some database or parquet
>>>>> so with minimal refactoring you'd be able to use Kudu? Would re-writing
>>>>> those SQL lines into Scala and directly use the Java API's KuduSession
be
>>>>> too much work?
>>>>>
>>>>> Additionally, what do you expect to gain from using Kudu VS your
>>>>> current solution? If it's not completely clear, I'd love to help you
think
>>>>> through it.
>>>>>
>>>>>
>>>>>>
>>>>>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans <
>>>>>> jdcryans@apache.org> wrote:
>>>>>>
>>>>>>> Yup, starting to get a good idea.
>>>>>>>
>>>>>>> What are your DS folks looking for in terms of functionality
related
>>>>>>> to Spark? A SparkSQL integration that's as fully featured as
Impala's? Do
>>>>>>> they care being able to insert into Kudu with SparkSQL or just
being able
>>>>>>> to query real fast? Anything more specific to Spark that I'm
missing?
>>>>>>>
>>>>>>> FWIW the plan is to get to 1.0 in late Summer/early Fall. At
>>>>>>> Cloudera all our resources are committed to making things happen
in time,
>>>>>>> and a more fully featured Spark integration isn't in our plans
during that
>>>>>>> period. I'm really hoping someone in the community will help
with Spark,
>>>>>>> the same way we got a big contribution for the Flume sink.
>>>>>>>
>>>>>>> J-D
>>>>>>>
>>>>>>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim <bbuild11@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Yes, we took Kudu for a test run using 0.6 and 0.7 versions.
But,
>>>>>>>> since it’s not “production-ready”, upper management
doesn’t want to fully
>>>>>>>> deploy it yet. They just want to keep an eye on it though.
Kudu was so much
>>>>>>>> simpler and easier to use in every aspect compared to HBase.
Impala was
>>>>>>>> great for the report writers and analysts to experiment with
for the short
>>>>>>>> time it was up. But, once again, the only blocker was the
lack of Spark
>>>>>>>> support for our Data Developers/Scientists. So, production-level
data
>>>>>>>> population won’t happen until then.
>>>>>>>>
>>>>>>>> I hope this helps you get an idea where I am coming from…
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Ben
>>>>>>>>
>>>>>>>>
>>>>>>>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans <
>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>
>>>>>>>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim <bbuild11@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> J-D,
>>>>>>>>>
>>>>>>>>> The main thing I hear that Cassandra is being used as
an updatable
>>>>>>>>> hot data store to ensure that duplicates are taken care
of and idempotency
>>>>>>>>> is maintained. Whether data was directly retrieved from
Cassandra for
>>>>>>>>> analytics, reports, or searches, it was not clear as
to what was its main
>>>>>>>>> use. Some also just used it for a staging area to populate
downstream
>>>>>>>>> tables in parquet format. The last thing I heard was
that CQL was terrible,
>>>>>>>>> so that rules out much use of direct queries against
it.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I'm no C* expert, but I don't think CQL is meant for real
>>>>>>>> analytics, just ease of use instead of plainly using the
APIs. Even then,
>>>>>>>> Kudu should beat it easily on big scans. Same for HBase.
We've done
>>>>>>>> benchmarks against the latter, not the former.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> As for our company, we have been looking for an updatable
data
>>>>>>>>> store for a long time that can be quickly queried directly
either using
>>>>>>>>> Spark SQL or Impala or some other SQL engine and still
handle TB or PB of
>>>>>>>>> data without performance degradation and many configuration
headaches. For
>>>>>>>>> now, we are using HBase to take on this role with Phoenix
as a fast way to
>>>>>>>>> directly query the data. I can see Kudu as the best way
to fill this gap
>>>>>>>>> easily, especially being the closest thing to other relational
databases
>>>>>>>>> out there in familiarity for the many SQL analytics people
in our company.
>>>>>>>>> The other alternative would be to go with AWS Redshift
for the same
>>>>>>>>> reasons, but it would come at a cost, of course. If we
went with either
>>>>>>>>> solutions, Kudu or Redshift, it would get rid of the
need to extract from
>>>>>>>>> HBase to parquet tables or export to PostgreSQL to support
more of the SQL
>>>>>>>>> language using by analysts or the reporting software
we use..
>>>>>>>>>
>>>>>>>>
>>>>>>>> Ok, the usual then *smile*. Looks like we're not too far
off with
>>>>>>>> Kudu. Have you folks tried Kudu with Impala yet with those
use cases?
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I hope this helps.
>>>>>>>>>
>>>>>>>>
>>>>>>>> It does, thanks for nice reply.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Ben
>>>>>>>>>
>>>>>>>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans <
>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>
>>>>>>>>> Ha first time I'm hearing about SMACK. Inside Cloudera
we like to
>>>>>>>>> refer to "Impala + Kudu" as Kimpala, but yeah it's not
as sexy. My
>>>>>>>>> colleagues who were also there did say that the hype
around Spark isn't
>>>>>>>>> dying down.
>>>>>>>>>
>>>>>>>>> There's definitely an overlap in the use cases that Cassandra,
>>>>>>>>> HBase, and Kudu cater to. I wouldn't go as far as saying
that C* is just an
>>>>>>>>> interim solution for the use case you describe.
>>>>>>>>>
>>>>>>>>> Nothing significant happened in Kudu over the past month,
it's a
>>>>>>>>> storage engine so things move slowly *smile*. I'd love
to see more
>>>>>>>>> contributions on the Spark front. I know there's code
out there that could
>>>>>>>>> be integrated in kudu-spark, it just needs to land in
gerrit. I'm sure
>>>>>>>>> folks will happily review it.
>>>>>>>>>
>>>>>>>>> Do you have relevant experiences you can share? I'd love
to learn
>>>>>>>>> more about the use cases for which you envision using
Kudu as a C*
>>>>>>>>> replacement.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> J-D
>>>>>>>>>
>>>>>>>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim <bbuild11@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi J-D,
>>>>>>>>>>
>>>>>>>>>> My colleagues recently came back from Strata in San
Jose. They
>>>>>>>>>> told me that everything was about Spark and there
is a big buzz about the
>>>>>>>>>> SMACK stack (Spark, Mesos, Akka, Cassandra, Kafka).
I still think that
>>>>>>>>>> Cassandra is just an interim solution as a low-latency,
easily queried data
>>>>>>>>>> store. I was wondering if anything significant happened
in regards to Kudu,
>>>>>>>>>> especially on the Spark front. Plus, can you come
up with your own proposed
>>>>>>>>>> stack acronym to promote?
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Ben
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans <
>>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Ben,
>>>>>>>>>>
>>>>>>>>>> AFAIK no one in the dev community committed to any
timeline. I
>>>>>>>>>> know of one person on the Kudu Slack who's working
on a better RDD, but
>>>>>>>>>> that's about it.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> J-D
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim <bkim@amobee.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi J-D,
>>>>>>>>>>>
>>>>>>>>>>> Quick question… Is there an ETA for KUDU-1214?
I want to target
>>>>>>>>>>> a version of Kudu to begin real testing of Spark
against it for our devs.
>>>>>>>>>>> At least, I can tell them what timeframe to anticipate.
>>>>>>>>>>>
>>>>>>>>>>> Just curious,
>>>>>>>>>>> *Benjamin Kim*
>>>>>>>>>>> *Data Solutions Architect*
>>>>>>>>>>>
>>>>>>>>>>> [a•mo•bee] *(n.)* the company defining digital
marketing.
>>>>>>>>>>>
>>>>>>>>>>> *Mobile: +1 818 635 2900 <%2B1%20818%20635%202900>*
>>>>>>>>>>> 3250 Ocean Park Blvd, Suite 200  |  Santa Monica,
CA 90405  |
>>>>>>>>>>> www.amobee.com
>>>>>>>>>>>
>>>>>>>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans
<
>>>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>> The DStream stuff isn't there at all. I'm not
sure if it's
>>>>>>>>>>> needed either.
>>>>>>>>>>>
>>>>>>>>>>> The kuduRDD is just leveraging the MR input format,
ideally we'd
>>>>>>>>>>> use scans directly.
>>>>>>>>>>>
>>>>>>>>>>> The SparkSQL stuff is there but it doesn't do
any sort of
>>>>>>>>>>> pushdown. It's really basic.
>>>>>>>>>>>
>>>>>>>>>>> The goal was to provide something for others
to contribute to.
>>>>>>>>>>> We have some basic unit tests that others can
easily extend. None of us on
>>>>>>>>>>> the team are Spark experts, but we'd be really
happy to assist one improve
>>>>>>>>>>> the kudu-spark code.
>>>>>>>>>>>
>>>>>>>>>>> J-D
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim
<
>>>>>>>>>>> bbuild11@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> J-D,
>>>>>>>>>>>>
>>>>>>>>>>>> It looks like it fulfills most of the basic
requirements (kudu
>>>>>>>>>>>> RDD, kudu DStream) in KUDU-1214. Am I right?
Besides shoring up more Spark
>>>>>>>>>>>> SQL functionality (Dataframes) and doing
the documentation, what more needs
>>>>>>>>>>>> to be done? Optimizations?
>>>>>>>>>>>>
>>>>>>>>>>>> I believe that it’s a good place to start
using Spark with Kudu
>>>>>>>>>>>> and compare it to HBase with Spark (not clean).
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Ben
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel
Cryans <
>>>>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> AFAIK no one is working on it, but we did
manage to get this in
>>>>>>>>>>>> for 0.7.0: https://issues.cloudera.org/browse/KUDU-1321
>>>>>>>>>>>>
>>>>>>>>>>>> It's a really simple wrapper, and yes you
can use SparkSQL on
>>>>>>>>>>>> Kudu, but it will require a lot more work
to make it fast/useful.
>>>>>>>>>>>>
>>>>>>>>>>>> Hope this helps,
>>>>>>>>>>>>
>>>>>>>>>>>> J-D
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin
Kim <
>>>>>>>>>>>> bbuild11@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I see this KUDU-1214
>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1214>
targeted for
>>>>>>>>>>>>> 0.8.0, but I see no progress on it. When
this is complete, will this mean
>>>>>>>>>>>>> that Spark will be able to work with
Kudu both programmatically and as a
>>>>>>>>>>>>> client via Spark SQL? Or is there more
work that needs to be done on the
>>>>>>>>>>>>> Spark side for it to work?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Just curious.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>
>

Mime
View raw message