kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jordan Birdsell <jordantbirds...@gmail.com>
Subject Re: Spark on Kudu
Date Tue, 20 Sep 2016 22:02:27 GMT
http://kudu.apache.org/docs/developing.html#_kudu_integration_with_spark

On Tue, Sep 20, 2016 at 5:00 PM Benjamin Kim <bbuild11@gmail.com> wrote:

> I see that the API has changed a bit so my old code doesn’t work anymore.
> Can someone direct me to some code samples?
>
> Thanks,
> Ben
>
>
> On Sep 20, 2016, at 1:44 PM, Todd Lipcon <todd@cloudera.com> wrote:
>
> On Tue, Sep 20, 2016 at 1:18 PM, Benjamin Kim <bbuild11@gmail.com> wrote:
>
>> Now that Kudu 1.0.0 is officially out and ready for production use, where
>> do we find the spark connector jar for this release?
>>
>>
> It's available in the official ASF maven repository:
> https://repository.apache.org/#nexus-search;quick~kudu-spark
>
> <dependency>
>   <groupId>org.apache.kudu</groupId>
>   <artifactId>kudu-spark_2.10</artifactId>
>   <version>1.0.0</version>
> </dependency>
>
>
> -Todd
>
>
>
>> On Jun 17, 2016, at 11:08 AM, Dan Burkert <dan@cloudera.com> wrote:
>>
>> Hi Ben,
>>
>> To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL, I
>> do not think we support that at this point.  I haven't looked deeply into
>> it, but we may hit issues specifying Kudu-specific options (partitioning,
>> column encoding, etc.).  Probably issues that can be worked through
>> eventually, though.  If you are interested in contributing to Kudu, this is
>> an area that could obviously use improvement!  Most or all of our Spark
>> features have been completely community driven to date.
>>
>>
>>> I am assuming that more Spark support along with semantic changes below
>>> will be incorporated into Kudu 0.9.1.
>>>
>>
>> As a rule we do not release new features in patch releases, but the good
>> news is that we are releasing regularly, and our next scheduled release is
>> for the August timeframe (see JD's roadmap
>> <https://lists.apache.org/thread.html/1a3b949e715a74d7f26bd9c102247441a06d16d077324ba39a662e2a@1455234076@%3Cdev.kudu.apache.org%3E>
email
>> about what we are aiming to include).  Also, Cloudera does publish snapshot
>> versions of the Spark connector here
>> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/>,
>> so the jars are available if you don't mind using snapshots.
>>
>>
>>> Anyone know of a better way to make unique primary keys other than using
>>> UUID to make every row unique if there is no unique column (or combination
>>> thereof) to use.
>>>
>>
>> Not that I know of.  In general it's pretty rare to have a dataset
>> without a natural primary key (even if it's just all of the columns), but
>> in those cases UUID is a good solution.
>>
>>
>>> This is what I am using. I know auto incrementing is coming down the
>>> line (don’t know when), but is there a way to simulate this in Kudu using
>>> Spark out of curiosity?
>>>
>>
>> To my knowledge there is no plan to have auto increment in Kudu.
>> Distributed, consistent, auto incrementing counters is a difficult problem,
>> and I don't think there are any known solutions that would be fast enough
>> for Kudu (happy to be proven wrong, though!).
>>
>> - Dan
>>
>>
>>>
>>> Thanks,
>>> Ben
>>>
>>> On Jun 14, 2016, at 6:08 PM, Dan Burkert <dan@cloudera.com> wrote:
>>>
>>> I'm not sure exactly what the semantics will be, but at least one of
>>> them will be upsert.  These modes come from spark, and they were really
>>> designed for file-backed storage and not table storage.  We may want to do
>>> append = upsert, and overwrite = truncate + insert.  I think that may match
>>> the normal spark semantics more closely.
>>>
>>> - Dan
>>>
>>> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim <bbuild11@gmail.com>
>>> wrote:
>>>
>>>> Dan,
>>>>
>>>> Thanks for the information. That would mean both “append” and
>>>> “overwrite” modes would be combined or not needed in the future.
>>>>
>>>> Cheers,
>>>> Ben
>>>>
>>>> On Jun 14, 2016, at 5:57 PM, Dan Burkert <dan@cloudera.com> wrote:
>>>>
>>>> Right now append uses an update Kudu operation, which requires the row
>>>> already be present in the table. Overwrite maps to insert.  Kudu very
>>>> recently got upsert support baked in, but it hasn't yet been integrated
>>>> into the Spark connector.  So pretty soon these sharp edges will get a lot
>>>> better, since upsert is the way to go for most spark workloads.
>>>>
>>>> - Dan
>>>>
>>>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim <bbuild11@gmail.com>
>>>> wrote:
>>>>
>>>>> I tried to use the “append” mode, and it worked. Over 3.8 million
rows
>>>>> in 64s. I would assume that now I can use the “overwrite” mode on
existing
>>>>> data. Now, I have to find answers to these questions. What would happen
if
>>>>> I “append” to the data in the Kudu table if the data already exists?
What
>>>>> would happen if I “overwrite” existing data when the DataFrame has
data in
>>>>> it that does not exist in the Kudu table? I need to evaluate the best
way
>>>>> to simulate the UPSERT behavior in HBase because this is what our use
case
>>>>> is.
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>>
>>>>>
>>>>>
>>>>> On Jun 14, 2016, at 5:05 PM, Benjamin Kim <bbuild11@gmail.com>
wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Now, I’m getting this error when trying to write to the table.
>>>>>
>>>>> import scala.collection.JavaConverters._
>>>>> val key_seq = Seq(“my_id")
>>>>> val key_list = List(“my_id”).asJava
>>>>> kuduContext.createTable(tableName, df.schema, key_seq, new
>>>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100))
>>>>>
>>>>> df.write
>>>>>     .options(Map("kudu.master" -> kuduMaster,"kudu.table" ->
>>>>> tableName))
>>>>>     .mode("overwrite")
>>>>>     .kudu
>>>>>
>>>>> java.lang.RuntimeException: failed to write 1000 rows from DataFrame
>>>>> to Kudu; sample errors: Not found: key not found (error 0)Not found:
key
>>>>> not found (error 0)Not found: key not found (error 0)Not found: key not
>>>>> found (error 0)Not found: key not found (error 0)
>>>>>
>>>>> Does the key field need to be first in the DataFrame?
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>>
>>>>> On Jun 14, 2016, at 4:28 PM, Dan Burkert <dan@cloudera.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim <bbuild11@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Dan,
>>>>>>
>>>>>> Thanks! It got further. Now, how do I set the Primary Key to be a
>>>>>> column(s) in the DataFrame and set the partitioning? Is it like this?
>>>>>>
>>>>>> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new
>>>>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id"))
>>>>>>
>>>>>> java.lang.IllegalArgumentException: Table partitioning must be
>>>>>> specified using setRangePartitionColumns or addHashPartitions
>>>>>>
>>>>>
>>>>> Yep.  The `Seq("my_id")` part of that call is specifying the set of
>>>>> primary key columns, so in this case you have specified the single PK
>>>>> column "my_id".  The `addHashPartitions` call adds hash partitioning
to the
>>>>> table, in this case over the column "my_id" (which is good, it must be
over
>>>>> one or more PK columns, so in this case "my_id" is the one and only valid
>>>>> combination).  However, the call to `addHashPartition` also takes the
>>>>> number of buckets as the second param.  You shouldn't get the
>>>>> IllegalArgumentException as long as you are specifying either
>>>>> `addHashPartitions` or `setRangePartitionColumns`.
>>>>>
>>>>> - Dan
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Ben
>>>>>>
>>>>>>
>>>>>> On Jun 14, 2016, at 4:07 PM, Dan Burkert <dan@cloudera.com>
wrote:
>>>>>>
>>>>>> Looks like we're missing an import statement in that example.  Could
>>>>>> you try:
>>>>>>
>>>>>> import org.kududb.client._
>>>>>>
>>>>>> and try again?
>>>>>>
>>>>>> - Dan
>>>>>>
>>>>>> On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim <bbuild11@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I encountered an error trying to create a table based on the
>>>>>>> documentation from a DataFrame.
>>>>>>>
>>>>>>> <console>:49: error: not found: type CreateTableOptions
>>>>>>>               kuduContext.createTable(tableName, df.schema,
>>>>>>> Seq("key"), new CreateTableOptions().setNumReplicas(1))
>>>>>>>
>>>>>>> Is there something I’m missing?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ben
>>>>>>>
>>>>>>> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans <jdcryans@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>> It's only in Cloudera's maven repo:
>>>>>>> https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/
>>>>>>>
>>>>>>> J-D
>>>>>>>
>>>>>>> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim <bbuild11@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi J-D,
>>>>>>>>
>>>>>>>> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark
>>>>>>>> jar for spark-shell to use. Can you show me where to find
it?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ben
>>>>>>>>
>>>>>>>>
>>>>>>>> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans <jdcryans@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> What's in this doc is what's gonna get released:
>>>>>>>> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark
>>>>>>>>
>>>>>>>> J-D
>>>>>>>>
>>>>>>>> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <bbuild11@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Will this be documented with examples once 0.9.0 comes
out?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ben
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans <
>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>
>>>>>>>>> It will be in 0.9.0.
>>>>>>>>>
>>>>>>>>> J-D
>>>>>>>>>
>>>>>>>>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim <bbuild11@gmail.com>
>>>>>>>>>  wrote:
>>>>>>>>>
>>>>>>>>>> Hi Chris,
>>>>>>>>>>
>>>>>>>>>> Will all this effort be rolled into 0.9.0 and be
ready for use?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Ben
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On May 18, 2016, at 9:01 AM, Chris George <
>>>>>>>>>> Christopher.George@rms.com> wrote:
>>>>>>>>>>
>>>>>>>>>> There is some code in review that needs some more
refinement.
>>>>>>>>>> It will allow upsert/insert from a dataframe using
the datasource
>>>>>>>>>> api. It will also allow the creation and deletion
of tables from a dataframe
>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/
>>>>>>>>>>
>>>>>>>>>> Example usages will look something like:
>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc
>>>>>>>>>>
>>>>>>>>>> -Chris George
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 5/18/16, 9:45 AM, "Benjamin Kim" <bbuild11@gmail.com>
wrote:
>>>>>>>>>>
>>>>>>>>>> Can someone tell me what the state is of this Spark
work?
>>>>>>>>>>
>>>>>>>>>> Also, does anyone have any sample code on how to
update/insert
>>>>>>>>>> data in Kudu using DataFrames?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Ben
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Apr 13, 2016, at 8:22 AM, Chris George <
>>>>>>>>>> Christopher.George@rms.com> wrote:
>>>>>>>>>>
>>>>>>>>>> SparkSQL cannot support these type of statements
but we may be
>>>>>>>>>> able to implement similar functionality through the
api.
>>>>>>>>>> -Chris
>>>>>>>>>>
>>>>>>>>>> On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuild11@gmail.com>
wrote:
>>>>>>>>>>
>>>>>>>>>> It would be nice to adhere to the SQL:2003 standard
for an
>>>>>>>>>> “upsert” if it were to be implemented.
>>>>>>>>>>
>>>>>>>>>> MERGE INTO table_name USING table_reference ON (condition)
>>>>>>>>>>  WHEN MATCHED THEN
>>>>>>>>>>  UPDATE SET column1 = value1 [, column2 = value2
...]
>>>>>>>>>>  WHEN NOT MATCHED THEN
>>>>>>>>>>  INSERT (column1 [, column2 ...]) VALUES (value1
[, value2 …])
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Ben
>>>>>>>>>>
>>>>>>>>>> On Apr 11, 2016, at 12:21 PM, Chris George <
>>>>>>>>>> Christopher.George@rms.com> wrote:
>>>>>>>>>>
>>>>>>>>>> I have a wip kuduRDD that I made a few months ago.
I pushed it
>>>>>>>>>> into gerrit if you want to take a look.
>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2754/
>>>>>>>>>> It does pushdown predicates which the existing input
formatter
>>>>>>>>>> based rdd does not.
>>>>>>>>>>
>>>>>>>>>> Within the next two weeks I’m planning to implement
a datasource
>>>>>>>>>> for spark that will have pushdown predicates and
insertion/update
>>>>>>>>>> functionality (need to look more at cassandra and
the hbase datasource for
>>>>>>>>>> best way to do this) I agree that server side upsert
would be helpful.
>>>>>>>>>> Having a datasource would give us useful data frames
and also
>>>>>>>>>> make spark sql usable for kudu.
>>>>>>>>>>
>>>>>>>>>> My reasoning for having a spark datasource and not
using Impala
>>>>>>>>>> is: 1. We have had trouble getting impala to run
fast with high concurrency
>>>>>>>>>> when compared to spark 2. We interact with datasources
which do not
>>>>>>>>>> integrate with impala. 3. We have custom sql query
planners for extended
>>>>>>>>>> sql functionality.
>>>>>>>>>>
>>>>>>>>>> -Chris George
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcryans@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> You guys make a convincing point, although on the
upsert side
>>>>>>>>>> we'll need more support from the servers. Right now
all you can do is an
>>>>>>>>>> INSERT then, if you get a dup key, do an UPDATE.
I guess we could at least
>>>>>>>>>> add an API on the client side that would manage it,
but it wouldn't be
>>>>>>>>>> atomic.
>>>>>>>>>>
>>>>>>>>>> J-D
>>>>>>>>>>
>>>>>>>>>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra <
>>>>>>>>>> mark@clearstorydata.com>wrote:
>>>>>>>>>>
>>>>>>>>>>> It's pretty simple, actually.  I need to support
versioned
>>>>>>>>>>> datasets in a Spark SQL environment.  Instead
of a hack on top of a Parquet
>>>>>>>>>>> data store, I'm hoping (among other reasons)
to be able to use Kudu's write
>>>>>>>>>>> and timestamp-based read operations to support
not only appending data, but
>>>>>>>>>>> also updating existing data, and even some schema
migration.  The most
>>>>>>>>>>> typical use case is a dataset that is updated
periodically (e.g., weekly or
>>>>>>>>>>> monthly) in which the the preliminary data in
the previous window (week or
>>>>>>>>>>> month) is updated with values that are expected
to remain unchanged from
>>>>>>>>>>> then on, and a new set of preliminary values
for the current window need to
>>>>>>>>>>> be added/appended.
>>>>>>>>>>>
>>>>>>>>>>> Using Kudu's Java API and developing additional
functionality on
>>>>>>>>>>> top of what Kudu has to offer isn't too much
to ask, but the ease of
>>>>>>>>>>> integration with Spark SQL will gate how quickly
we would move to using
>>>>>>>>>>> Kudu and how seriously we'd look at alternatives
before making that
>>>>>>>>>>> decision.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel
Cryans <
>>>>>>>>>>> jdcryans@apache.org>wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Mark,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for taking some time to reply in this
thread, glad it
>>>>>>>>>>>> caught the attention of other folks!
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra<
>>>>>>>>>>>> mark@clearstorydata.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Do they care being able to insert into
Kudu with SparkSQL
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I care about insert into Kudu with Spark
SQL.  I'm currently
>>>>>>>>>>>>> delaying a refactoring of some Spark
SQL-oriented insert functionality
>>>>>>>>>>>>> while trying to evaluate what to expect
from Kudu.  Whether Kudu does a
>>>>>>>>>>>>> good job supporting inserts with Spark
SQL will be a key consideration as
>>>>>>>>>>>>> to whether we adopt Kudu.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I'd like to know more about why SparkSQL
inserts in necessary
>>>>>>>>>>>> for you. Is it just that you currently do
it that way into some database or
>>>>>>>>>>>> parquet so with minimal refactoring you'd
be able to use Kudu? Would
>>>>>>>>>>>> re-writing those SQL lines into Scala and
directly use the Java API's
>>>>>>>>>>>> KuduSession be too much work?
>>>>>>>>>>>>
>>>>>>>>>>>> Additionally, what do you expect to gain
from using Kudu VS
>>>>>>>>>>>> your current solution? If it's not completely
clear, I'd love to help you
>>>>>>>>>>>> think through it.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel
Cryans <
>>>>>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yup, starting to get a good idea.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What are your DS folks looking for
in terms of functionality
>>>>>>>>>>>>>> related to Spark? A SparkSQL integration
that's as fully featured as
>>>>>>>>>>>>>> Impala's? Do they care being able
to insert into Kudu with SparkSQL or just
>>>>>>>>>>>>>> being able to query real fast? Anything
more specific to Spark that I'm
>>>>>>>>>>>>>> missing?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> FWIW the plan is to get to 1.0 in
late Summer/early Fall. At
>>>>>>>>>>>>>> Cloudera all our resources are committed
to making things happen in time,
>>>>>>>>>>>>>> and a more fully featured Spark integration
isn't in our plans during that
>>>>>>>>>>>>>> period. I'm really hoping someone
in the community will help with Spark,
>>>>>>>>>>>>>> the same way we got a big contribution
for the Flume sink.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 11:29 AM,
Benjamin Kim <
>>>>>>>>>>>>>> bbuild11@gmail.com>wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, we took Kudu for a test
run using 0.6 and 0.7 versions.
>>>>>>>>>>>>>>> But, since it’s not “production-ready”,
upper management doesn’t want to
>>>>>>>>>>>>>>> fully deploy it yet. They just
want to keep an eye on it though. Kudu was
>>>>>>>>>>>>>>> so much simpler and easier to
use in every aspect compared to HBase. Impala
>>>>>>>>>>>>>>> was great for the report writers
and analysts to experiment with for the
>>>>>>>>>>>>>>> short time it was up. But, once
again, the only blocker was the lack of
>>>>>>>>>>>>>>> Spark support for our Data Developers/Scientists.
So, production-level data
>>>>>>>>>>>>>>> population won’t happen until
then.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I hope this helps you get an
idea where I am coming from…
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Apr 10, 2016, at 11:08 AM,
Jean-Daniel Cryans <
>>>>>>>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:30
AM, Benjamin Kim <
>>>>>>>>>>>>>>> bbuild11@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> J-D,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The main thing I hear that
Cassandra is being used as an
>>>>>>>>>>>>>>>> updatable hot data store
to ensure that duplicates are taken care of and
>>>>>>>>>>>>>>>> idempotency is maintained.
Whether data was directly retrieved from
>>>>>>>>>>>>>>>> Cassandra for analytics,
reports, or searches, it was not clear as to what
>>>>>>>>>>>>>>>> was its main use. Some also
just used it for a staging area to populate
>>>>>>>>>>>>>>>> downstream tables in parquet
format. The last thing I heard was that CQL
>>>>>>>>>>>>>>>> was terrible, so that rules
out much use of direct queries against it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm no C* expert, but I don't
think CQL is meant for real
>>>>>>>>>>>>>>> analytics, just ease of use instead
of plainly using the APIs. Even then,
>>>>>>>>>>>>>>> Kudu should beat it easily on
big scans. Same for HBase. We've done
>>>>>>>>>>>>>>> benchmarks against the latter,
not the former.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> As for our company, we have
been looking for an updatable
>>>>>>>>>>>>>>>> data store for a long time
that can be quickly queried directly either
>>>>>>>>>>>>>>>> using Spark SQL or Impala
or some other SQL engine and still handle TB or
>>>>>>>>>>>>>>>> PB of data without performance
degradation and many configuration
>>>>>>>>>>>>>>>> headaches. For now, we are
using HBase to take on this role with Phoenix as
>>>>>>>>>>>>>>>> a fast way to directly query
the data. I can see Kudu as the best way to
>>>>>>>>>>>>>>>> fill this gap easily, especially
being the closest thing to other
>>>>>>>>>>>>>>>> relational databases out
there in familiarity for the many SQL analytics
>>>>>>>>>>>>>>>> people in our company. The
other alternative would be to go with AWS
>>>>>>>>>>>>>>>> Redshift for the same reasons,
but it would come at a cost, of course. If
>>>>>>>>>>>>>>>> we went with either solutions,
Kudu or Redshift, it would get rid of the
>>>>>>>>>>>>>>>> need to extract from HBase
to parquet tables or export to PostgreSQL to
>>>>>>>>>>>>>>>> support more of the SQL language
using by analysts or the reporting
>>>>>>>>>>>>>>>> software we use..
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ok, the usual then *smile*. Looks
like we're not too far off
>>>>>>>>>>>>>>> with Kudu. Have you folks tried
Kudu with Impala yet with those use cases?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I hope this helps.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It does, thanks for nice reply.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Apr 9, 2016, at 2:00 PM,
Jean-Daniel Cryans <
>>>>>>>>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Ha first time I'm hearing
about SMACK. Inside Cloudera we
>>>>>>>>>>>>>>>> like to refer to "Impala
+ Kudu" as Kimpala, but yeah it's not as sexy. My
>>>>>>>>>>>>>>>> colleagues who were also
there did say that the hype around Spark isn't
>>>>>>>>>>>>>>>> dying down.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> There's definitely an overlap
in the use cases that
>>>>>>>>>>>>>>>> Cassandra, HBase, and Kudu
cater to. I wouldn't go as far as saying that C*
>>>>>>>>>>>>>>>> is just an interim solution
for the use case you describe.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Nothing significant happened
in Kudu over the past month,
>>>>>>>>>>>>>>>> it's a storage engine so
things move slowly *smile*. I'd love to see more
>>>>>>>>>>>>>>>> contributions on the Spark
front. I know there's code out there that could
>>>>>>>>>>>>>>>> be integrated in kudu-spark,
it just needs to land in gerrit. I'm sure
>>>>>>>>>>>>>>>> folks will happily review
it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Do you have relevant experiences
you can share? I'd love to
>>>>>>>>>>>>>>>> learn more about the use
cases for which you envision using Kudu as a C*
>>>>>>>>>>>>>>>> replacement.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Apr 8, 2016 at 12:45
PM, Benjamin Kim <
>>>>>>>>>>>>>>>> bbuild11@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi J-D,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> My colleagues recently
came back from Strata in San Jose.
>>>>>>>>>>>>>>>>> They told me that everything
was about Spark and there is a big buzz about
>>>>>>>>>>>>>>>>> the SMACK stack (Spark,
Mesos, Akka, Cassandra, Kafka). I still think that
>>>>>>>>>>>>>>>>> Cassandra is just an
interim solution as a low-latency, easily queried data
>>>>>>>>>>>>>>>>> store. I was wondering
if anything significant happened in regards to Kudu,
>>>>>>>>>>>>>>>>> especially on the Spark
front. Plus, can you come up with your own proposed
>>>>>>>>>>>>>>>>> stack acronym to promote?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mar 1, 2016, at 12:20
PM, Jean-Daniel Cryans <
>>>>>>>>>>>>>>>>> jdcryans@apache.org>
wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Ben,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> AFAIK no one in the dev
community committed to any
>>>>>>>>>>>>>>>>> timeline. I know of one
person on the Kudu Slack who's working on a better
>>>>>>>>>>>>>>>>> RDD, but that's about
it.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Mar 1, 2016 at
11:00 AM, Benjamin Kim <
>>>>>>>>>>>>>>>>> bkim@amobee.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi J-D,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Quick question…
Is there an ETA for KUDU-1214? I want to
>>>>>>>>>>>>>>>>>> target a version
of Kudu to begin real testing of Spark against it for our
>>>>>>>>>>>>>>>>>> devs. At least, I
can tell them what timeframe to anticipate.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Just curious,
>>>>>>>>>>>>>>>>>> *Benjamin Kim*
>>>>>>>>>>>>>>>>>> *Data Solutions Architect*
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [a•mo•bee] *(n.)*
the company defining digital marketing.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> *Mobile: +1 818 635
2900 <%2B1%20818%20635%202900>*
>>>>>>>>>>>>>>>>>> 3250 Ocean Park Blvd,
Suite 200  |  Santa Monica, CA
>>>>>>>>>>>>>>>>>> 90405  |  www.amobee.com
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Feb 24, 2016,
at 3:51 PM, Jean-Daniel Cryans <
>>>>>>>>>>>>>>>>>> jdcryans@apache.org>
wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The DStream stuff
isn't there at all. I'm not sure if
>>>>>>>>>>>>>>>>>> it's needed either.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The kuduRDD is just
leveraging the MR input format,
>>>>>>>>>>>>>>>>>> ideally we'd use
scans directly.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The SparkSQL stuff
is there but it doesn't do any sort of
>>>>>>>>>>>>>>>>>> pushdown. It's really
basic.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The goal was to provide
something for others to
>>>>>>>>>>>>>>>>>> contribute to. We
have some basic unit tests that others can easily extend.
>>>>>>>>>>>>>>>>>> None of us on the
team are Spark experts, but we'd be really happy to
>>>>>>>>>>>>>>>>>> assist one improve
the kudu-spark code.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Feb 24, 2016
at 3:41 PM, Benjamin Kim <
>>>>>>>>>>>>>>>>>> bbuild11@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> J-D,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> It looks like
it fulfills most of the basic requirements
>>>>>>>>>>>>>>>>>>> (kudu RDD, kudu
DStream) in KUDU-1214. Am I right? Besides shoring up more
>>>>>>>>>>>>>>>>>>> Spark SQL functionality
(Dataframes) and doing the documentation, what more
>>>>>>>>>>>>>>>>>>> needs to be done?
Optimizations?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I believe that
it’s a good place to start using Spark
>>>>>>>>>>>>>>>>>>> with Kudu and
compare it to HBase with Spark (not clean).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Feb 24, 2016,
at 3:10 PM, Jean-Daniel Cryans <
>>>>>>>>>>>>>>>>>>> jdcryans@apache.org>
wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> AFAIK no one
is working on it, but we did manage to get
>>>>>>>>>>>>>>>>>>> this in for 0.7.0:
>>>>>>>>>>>>>>>>>>> https://issues.cloudera.org/browse/KUDU-1321
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> It's a really
simple wrapper, and yes you can use
>>>>>>>>>>>>>>>>>>> SparkSQL on Kudu,
but it will require a lot more work to make it
>>>>>>>>>>>>>>>>>>> fast/useful.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hope this helps,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Feb 24,
2016 at 3:08 PM, Benjamin Kim <
>>>>>>>>>>>>>>>>>>> bbuild11@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I see this
KUDU-1214
>>>>>>>>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1214>
targeted
>>>>>>>>>>>>>>>>>>>> for 0.8.0,
but I see no progress on it. When this is complete, will this
>>>>>>>>>>>>>>>>>>>> mean that
Spark will be able to work with Kudu both programmatically and as
>>>>>>>>>>>>>>>>>>>> a client
via Spark SQL? Or is there more work that needs to be done on the
>>>>>>>>>>>>>>>>>>>> Spark side
for it to work?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Just curious.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
>

Mime
View raw message