kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Burkert <...@cloudera.com>
Subject Re: Spark on Kudu
Date Wed, 15 Jun 2016 00:57:59 GMT
Right now append uses an update Kudu operation, which requires the row
already be present in the table. Overwrite maps to insert.  Kudu very
recently got upsert support baked in, but it hasn't yet been integrated
into the Spark connector.  So pretty soon these sharp edges will get a lot
better, since upsert is the way to go for most spark workloads.

- Dan

On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim <bbuild11@gmail.com> wrote:

> I tried to use the “append” mode, and it worked. Over 3.8 million rows in
> 64s. I would assume that now I can use the “overwrite” mode on existing
> data. Now, I have to find answers to these questions. What would happen if
> I “append” to the data in the Kudu table if the data already exists? What
> would happen if I “overwrite” existing data when the DataFrame has data in
> it that does not exist in the Kudu table? I need to evaluate the best way
> to simulate the UPSERT behavior in HBase because this is what our use case
> is.
>
> Thanks,
> Ben
>
>
>
> On Jun 14, 2016, at 5:05 PM, Benjamin Kim <bbuild11@gmail.com> wrote:
>
> Hi,
>
> Now, I’m getting this error when trying to write to the table.
>
> import scala.collection.JavaConverters._
> val key_seq = Seq(“my_id")
> val key_list = List(“my_id”).asJava
> kuduContext.createTable(tableName, df.schema, key_seq, new
> CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100))
>
> df.write
>     .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName))
>     .mode("overwrite")
>     .kudu
>
> java.lang.RuntimeException: failed to write 1000 rows from DataFrame to
> Kudu; sample errors: Not found: key not found (error 0)Not found: key not
> found (error 0)Not found: key not found (error 0)Not found: key not found
> (error 0)Not found: key not found (error 0)
>
> Does the key field need to be first in the DataFrame?
>
> Thanks,
> Ben
>
> On Jun 14, 2016, at 4:28 PM, Dan Burkert <dan@cloudera.com> wrote:
>
>
>
> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim <bbuild11@gmail.com> wrote:
>
>> Dan,
>>
>> Thanks! It got further. Now, how do I set the Primary Key to be a
>> column(s) in the DataFrame and set the partitioning? Is it like this?
>>
>> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new
>> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id"))
>>
>> java.lang.IllegalArgumentException: Table partitioning must be specified
>> using setRangePartitionColumns or addHashPartitions
>>
>
> Yep.  The `Seq("my_id")` part of that call is specifying the set of
> primary key columns, so in this case you have specified the single PK
> column "my_id".  The `addHashPartitions` call adds hash partitioning to the
> table, in this case over the column "my_id" (which is good, it must be over
> one or more PK columns, so in this case "my_id" is the one and only valid
> combination).  However, the call to `addHashPartition` also takes the
> number of buckets as the second param.  You shouldn't get the
> IllegalArgumentException as long as you are specifying either
> `addHashPartitions` or `setRangePartitionColumns`.
>
> - Dan
>
>
>>
>> Thanks,
>> Ben
>>
>>
>> On Jun 14, 2016, at 4:07 PM, Dan Burkert <dan@cloudera.com> wrote:
>>
>> Looks like we're missing an import statement in that example.  Could you
>> try:
>>
>> import org.kududb.client._
>>
>> and try again?
>>
>> - Dan
>>
>> On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim <bbuild11@gmail.com> wrote:
>>
>>> I encountered an error trying to create a table based on the
>>> documentation from a DataFrame.
>>>
>>> <console>:49: error: not found: type CreateTableOptions
>>>               kuduContext.createTable(tableName, df.schema, Seq("key"),
>>> new CreateTableOptions().setNumReplicas(1))
>>>
>>> Is there something I’m missing?
>>>
>>> Thanks,
>>> Ben
>>>
>>> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans <jdcryans@apache.org>
>>> wrote:
>>>
>>> It's only in Cloudera's maven repo:
>>> https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/
>>>
>>> J-D
>>>
>>> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim <bbuild11@gmail.com>
>>> wrote:
>>>
>>>> Hi J-D,
>>>>
>>>> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar
>>>> for spark-shell to use. Can you show me where to find it?
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>>
>>>> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans <jdcryans@apache.org>
>>>> wrote:
>>>>
>>>> What's in this doc is what's gonna get released:
>>>> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark
>>>>
>>>> J-D
>>>>
>>>> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <bbuild11@gmail.com>
>>>> wrote:
>>>>
>>>>> Will this be documented with examples once 0.9.0 comes out?
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>>
>>>>>
>>>>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans <jdcryans@apache.org>
>>>>> wrote:
>>>>>
>>>>> It will be in 0.9.0.
>>>>>
>>>>> J-D
>>>>>
>>>>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim <bbuild11@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Chris,
>>>>>>
>>>>>> Will all this effort be rolled into 0.9.0 and be ready for use?
>>>>>>
>>>>>> Thanks,
>>>>>> Ben
>>>>>>
>>>>>>
>>>>>> On May 18, 2016, at 9:01 AM, Chris George <Christopher.George@rms.com>
>>>>>> wrote:
>>>>>>
>>>>>> There is some code in review that needs some more refinement.
>>>>>> It will allow upsert/insert from a dataframe using the datasource
>>>>>> api. It will also allow the creation and deletion of tables from
a dataframe
>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/
>>>>>>
>>>>>> Example usages will look something like:
>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc
>>>>>>
>>>>>> -Chris George
>>>>>>
>>>>>>
>>>>>> On 5/18/16, 9:45 AM, "Benjamin Kim" <bbuild11@gmail.com> wrote:
>>>>>>
>>>>>> Can someone tell me what the state is of this Spark work?
>>>>>>
>>>>>> Also, does anyone have any sample code on how to update/insert data
>>>>>> in Kudu using DataFrames?
>>>>>>
>>>>>> Thanks,
>>>>>> Ben
>>>>>>
>>>>>>
>>>>>> On Apr 13, 2016, at 8:22 AM, Chris George <Christopher.George@rms.com>
>>>>>> wrote:
>>>>>>
>>>>>> SparkSQL cannot support these type of statements but we may be able
>>>>>> to implement similar functionality through the api.
>>>>>> -Chris
>>>>>>
>>>>>> On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuild11@gmail.com> wrote:
>>>>>>
>>>>>> It would be nice to adhere to the SQL:2003 standard for an “upsert”
>>>>>> if it were to be implemented.
>>>>>>
>>>>>> MERGE INTO table_name USING table_reference ON (condition)
>>>>>>  WHEN MATCHED THEN
>>>>>>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>>>>>>  WHEN NOT MATCHED THEN
>>>>>>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
>>>>>>
>>>>>> Cheers,
>>>>>> Ben
>>>>>>
>>>>>> On Apr 11, 2016, at 12:21 PM, Chris George <
>>>>>> Christopher.George@rms.com> wrote:
>>>>>>
>>>>>> I have a wip kuduRDD that I made a few months ago. I pushed it into
>>>>>> gerrit if you want to take a look.
>>>>>> http://gerrit.cloudera.org:8080/#/c/2754/
>>>>>> It does pushdown predicates which the existing input formatter based
>>>>>> rdd does not.
>>>>>>
>>>>>> Within the next two weeks I’m planning to implement a datasource
for
>>>>>> spark that will have pushdown predicates and insertion/update functionality
>>>>>> (need to look more at cassandra and the hbase datasource for best
way to do
>>>>>> this) I agree that server side upsert would be helpful.
>>>>>> Having a datasource would give us useful data frames and also make
>>>>>> spark sql usable for kudu.
>>>>>>
>>>>>> My reasoning for having a spark datasource and not using Impala is:
>>>>>> 1. We have had trouble getting impala to run fast with high concurrency
>>>>>> when compared to spark 2. We interact with datasources which do not
>>>>>> integrate with impala. 3. We have custom sql query planners for extended
>>>>>> sql functionality.
>>>>>>
>>>>>> -Chris George
>>>>>>
>>>>>>
>>>>>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcryans@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>> You guys make a convincing point, although on the upsert side we'll
>>>>>> need more support from the servers. Right now all you can do is an
INSERT
>>>>>> then, if you get a dup key, do an UPDATE. I guess we could at least
add an
>>>>>> API on the client side that would manage it, but it wouldn't be atomic.
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra <
>>>>>> mark@clearstorydata.com> wrote:
>>>>>>
>>>>>>> It's pretty simple, actually.  I need to support versioned datasets
>>>>>>> in a Spark SQL environment.  Instead of a hack on top of a Parquet
data
>>>>>>> store, I'm hoping (among other reasons) to be able to use Kudu's
write and
>>>>>>> timestamp-based read operations to support not only appending
data, but
>>>>>>> also updating existing data, and even some schema migration.
 The most
>>>>>>> typical use case is a dataset that is updated periodically (e.g.,
weekly or
>>>>>>> monthly) in which the the preliminary data in the previous window
(week or
>>>>>>> month) is updated with values that are expected to remain unchanged
from
>>>>>>> then on, and a new set of preliminary values for the current
window need to
>>>>>>> be added/appended.
>>>>>>>
>>>>>>> Using Kudu's Java API and developing additional functionality
on top
>>>>>>> of what Kudu has to offer isn't too much to ask, but the ease
of
>>>>>>> integration with Spark SQL will gate how quickly we would move
to using
>>>>>>> Kudu and how seriously we'd look at alternatives before making
that
>>>>>>> decision.
>>>>>>>
>>>>>>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans <
>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>
>>>>>>>> Mark,
>>>>>>>>
>>>>>>>> Thanks for taking some time to reply in this thread, glad
it caught
>>>>>>>> the attention of other folks!
>>>>>>>>
>>>>>>>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra <
>>>>>>>> mark@clearstorydata.com> wrote:
>>>>>>>>
>>>>>>>>> Do they care being able to insert into Kudu with SparkSQL
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I care about insert into Kudu with Spark SQL.  I'm currently
>>>>>>>>> delaying a refactoring of some Spark SQL-oriented insert
functionality
>>>>>>>>> while trying to evaluate what to expect from Kudu.  Whether
Kudu does a
>>>>>>>>> good job supporting inserts with Spark SQL will be a
key consideration as
>>>>>>>>> to whether we adopt Kudu.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I'd like to know more about why SparkSQL inserts in necessary
for
>>>>>>>> you. Is it just that you currently do it that way into some
database or
>>>>>>>> parquet so with minimal refactoring you'd be able to use
Kudu? Would
>>>>>>>> re-writing those SQL lines into Scala and directly use the
Java API's
>>>>>>>> KuduSession be too much work?
>>>>>>>>
>>>>>>>> Additionally, what do you expect to gain from using Kudu
VS your
>>>>>>>> current solution? If it's not completely clear, I'd love
to help you think
>>>>>>>> through it.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans
<
>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> Yup, starting to get a good idea.
>>>>>>>>>>
>>>>>>>>>> What are your DS folks looking for in terms of functionality
>>>>>>>>>> related to Spark? A SparkSQL integration that's as
fully featured as
>>>>>>>>>> Impala's? Do they care being able to insert into
Kudu with SparkSQL or just
>>>>>>>>>> being able to query real fast? Anything more specific
to Spark that I'm
>>>>>>>>>> missing?
>>>>>>>>>>
>>>>>>>>>> FWIW the plan is to get to 1.0 in late Summer/early
Fall. At
>>>>>>>>>> Cloudera all our resources are committed to making
things happen in time,
>>>>>>>>>> and a more fully featured Spark integration isn't
in our plans during that
>>>>>>>>>> period. I'm really hoping someone in the community
will help with Spark,
>>>>>>>>>> the same way we got a big contribution for the Flume
sink.
>>>>>>>>>>
>>>>>>>>>> J-D
>>>>>>>>>>
>>>>>>>>>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim <
>>>>>>>>>> bbuild11@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes, we took Kudu for a test run using 0.6 and
0.7 versions.
>>>>>>>>>>> But, since it’s not “production-ready”,
upper management doesn’t want to
>>>>>>>>>>> fully deploy it yet. They just want to keep an
eye on it though. Kudu was
>>>>>>>>>>> so much simpler and easier to use in every aspect
compared to HBase. Impala
>>>>>>>>>>> was great for the report writers and analysts
to experiment with for the
>>>>>>>>>>> short time it was up. But, once again, the only
blocker was the lack of
>>>>>>>>>>> Spark support for our Data Developers/Scientists.
So, production-level data
>>>>>>>>>>> population won’t happen until then.
>>>>>>>>>>>
>>>>>>>>>>> I hope this helps you get an idea where I am
coming from…
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Ben
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans
<
>>>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim
<
>>>>>>>>>>> bbuild11@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> J-D,
>>>>>>>>>>>>
>>>>>>>>>>>> The main thing I hear that Cassandra is being
used as an
>>>>>>>>>>>> updatable hot data store to ensure that duplicates
are taken care of and
>>>>>>>>>>>> idempotency is maintained. Whether data was
directly retrieved from
>>>>>>>>>>>> Cassandra for analytics, reports, or searches,
it was not clear as to what
>>>>>>>>>>>> was its main use. Some also just used it
for a staging area to populate
>>>>>>>>>>>> downstream tables in parquet format. The
last thing I heard was that CQL
>>>>>>>>>>>> was terrible, so that rules out much use
of direct queries against it.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I'm no C* expert, but I don't think CQL is meant
for real
>>>>>>>>>>> analytics, just ease of use instead of plainly
using the APIs. Even then,
>>>>>>>>>>> Kudu should beat it easily on big scans. Same
for HBase. We've done
>>>>>>>>>>> benchmarks against the latter, not the former.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> As for our company, we have been looking
for an updatable data
>>>>>>>>>>>> store for a long time that can be quickly
queried directly either using
>>>>>>>>>>>> Spark SQL or Impala or some other SQL engine
and still handle TB or PB of
>>>>>>>>>>>> data without performance degradation and
many configuration headaches. For
>>>>>>>>>>>> now, we are using HBase to take on this role
with Phoenix as a fast way to
>>>>>>>>>>>> directly query the data. I can see Kudu as
the best way to fill this gap
>>>>>>>>>>>> easily, especially being the closest thing
to other relational databases
>>>>>>>>>>>> out there in familiarity for the many SQL
analytics people in our company.
>>>>>>>>>>>> The other alternative would be to go with
AWS Redshift for the same
>>>>>>>>>>>> reasons, but it would come at a cost, of
course. If we went with either
>>>>>>>>>>>> solutions, Kudu or Redshift, it would get
rid of the need to extract from
>>>>>>>>>>>> HBase to parquet tables or export to PostgreSQL
to support more of the SQL
>>>>>>>>>>>> language using by analysts or the reporting
software we use..
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Ok, the usual then *smile*. Looks like we're
not too far off
>>>>>>>>>>> with Kudu. Have you folks tried Kudu with Impala
yet with those use cases?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I hope this helps.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> It does, thanks for nice reply.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Ben
>>>>>>>>>>>>
>>>>>>>>>>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans
<
>>>>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Ha first time I'm hearing about SMACK. Inside
Cloudera we like
>>>>>>>>>>>> to refer to "Impala + Kudu" as Kimpala, but
yeah it's not as sexy. My
>>>>>>>>>>>> colleagues who were also there did say that
the hype around Spark isn't
>>>>>>>>>>>> dying down.
>>>>>>>>>>>>
>>>>>>>>>>>> There's definitely an overlap in the use
cases that Cassandra,
>>>>>>>>>>>> HBase, and Kudu cater to. I wouldn't go as
far as saying that C* is just an
>>>>>>>>>>>> interim solution for the use case you describe.
>>>>>>>>>>>>
>>>>>>>>>>>> Nothing significant happened in Kudu over
the past month, it's
>>>>>>>>>>>> a storage engine so things move slowly *smile*.
I'd love to see more
>>>>>>>>>>>> contributions on the Spark front. I know
there's code out there that could
>>>>>>>>>>>> be integrated in kudu-spark, it just needs
to land in gerrit. I'm sure
>>>>>>>>>>>> folks will happily review it.
>>>>>>>>>>>>
>>>>>>>>>>>> Do you have relevant experiences you can
share? I'd love to
>>>>>>>>>>>> learn more about the use cases for which
you envision using Kudu as a C*
>>>>>>>>>>>> replacement.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> J-D
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin
Kim <
>>>>>>>>>>>> bbuild11@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi J-D,
>>>>>>>>>>>>>
>>>>>>>>>>>>> My colleagues recently came back from
Strata in San Jose. They
>>>>>>>>>>>>> told me that everything was about Spark
and there is a big buzz about the
>>>>>>>>>>>>> SMACK stack (Spark, Mesos, Akka, Cassandra,
Kafka). I still think that
>>>>>>>>>>>>> Cassandra is just an interim solution
as a low-latency, easily queried data
>>>>>>>>>>>>> store. I was wondering if anything significant
happened in regards to Kudu,
>>>>>>>>>>>>> especially on the Spark front. Plus,
can you come up with your own proposed
>>>>>>>>>>>>> stack acronym to promote?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel
Cryans <
>>>>>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Ben,
>>>>>>>>>>>>>
>>>>>>>>>>>>> AFAIK no one in the dev community committed
to any timeline. I
>>>>>>>>>>>>> know of one person on the Kudu Slack
who's working on a better RDD, but
>>>>>>>>>>>>> that's about it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>
>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin
Kim <bkim@amobee.com
>>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi J-D,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Quick question… Is there an ETA
for KUDU-1214? I want to
>>>>>>>>>>>>>> target a version of Kudu to begin
real testing of Spark against it for our
>>>>>>>>>>>>>> devs. At least, I can tell them what
timeframe to anticipate.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just curious,
>>>>>>>>>>>>>> *Benjamin Kim*
>>>>>>>>>>>>>> *Data Solutions Architect*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [a•mo•bee] *(n.)* the company
defining digital marketing.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *Mobile: +1 818 635 2900 <%2B1%20818%20635%202900>*
>>>>>>>>>>>>>> 3250 Ocean Park Blvd, Suite 200 
|  Santa Monica, CA 90405
>>>>>>>>>>>>>> |  www.amobee.com
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel
Cryans <
>>>>>>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The DStream stuff isn't there at
all. I'm not sure if it's
>>>>>>>>>>>>>> needed either.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The kuduRDD is just leveraging the
MR input format, ideally
>>>>>>>>>>>>>> we'd use scans directly.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The SparkSQL stuff is there but it
doesn't do any sort of
>>>>>>>>>>>>>> pushdown. It's really basic.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The goal was to provide something
for others to contribute
>>>>>>>>>>>>>> to. We have some basic unit tests
that others can easily extend. None of us
>>>>>>>>>>>>>> on the team are Spark experts, but
we'd be really happy to assist one
>>>>>>>>>>>>>> improve the kudu-spark code.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:41 PM,
Benjamin Kim <
>>>>>>>>>>>>>> bbuild11@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> J-D,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It looks like it fulfills most
of the basic requirements
>>>>>>>>>>>>>>> (kudu RDD, kudu DStream) in KUDU-1214.
Am I right? Besides shoring up more
>>>>>>>>>>>>>>> Spark SQL functionality (Dataframes)
and doing the documentation, what more
>>>>>>>>>>>>>>> needs to be done? Optimizations?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I believe that it’s a good
place to start using Spark with
>>>>>>>>>>>>>>> Kudu and compare it to HBase
with Spark (not clean).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Feb 24, 2016, at 3:10 PM,
Jean-Daniel Cryans <
>>>>>>>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> AFAIK no one is working on it,
but we did manage to get this
>>>>>>>>>>>>>>> in for 0.7.0: https://issues.cloudera.org/browse/KUDU-1321
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It's a really simple wrapper,
and yes you can use SparkSQL
>>>>>>>>>>>>>>> on Kudu, but it will require
a lot more work to make it fast/useful.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hope this helps,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:08
PM, Benjamin Kim <
>>>>>>>>>>>>>>> bbuild11@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I see this KUDU-1214
>>>>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1214>
targeted
>>>>>>>>>>>>>>>> for 0.8.0, but I see no progress
on it. When this is complete, will this
>>>>>>>>>>>>>>>> mean that Spark will be able
to work with Kudu both programmatically and as
>>>>>>>>>>>>>>>> a client via Spark SQL? Or
is there more work that needs to be done on the
>>>>>>>>>>>>>>>> Spark side for it to work?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Just curious.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
>

Mime
View raw message