kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Bulk / Initial load of large tables into Kudu using Spark
Date Wed, 31 Jan 2018 01:34:52 GMT
On Mon, Jan 29, 2018 at 1:19 PM, Boris Tyukin <boris@boristyukin.com> wrote:

> thank you both. Does it make a difference from performance perspective
> though if I do a bulk load through Impala versus Spark? is the Kudu client
> with Spark will be faster than Impala?
>

Impala in recent versions has some tricks it does to pre-sort and
pre-shuffle the data to avoid compactions in Kudu during the insert. Spark
does not currently have these optimizations. So I would guess that Impala
would be able to bulk load large datasets more efficiently than Spark for
the time being.

-Todd


>
> On Mon, Jan 29, 2018 at 2:22 PM, Todd Lipcon <todd@cloudera.com> wrote:
>
>> On Mon, Jan 29, 2018 at 11:18 AM, Patrick Angeles <patrick@cloudera.com>
>> wrote:
>>
>>> Hi Boris.
>>>
>>> 1) I would like to bypass Impala as data for my bulk load coming from
>>>> sqoop and avro files are stored on HDFS.
>>>>
>>> What's the objection to Impala? In the example below, Impala reads from
>>> an HDFS-resident table, and writes to the Kudu table.
>>>
>>>
>>>> 2) we do not want to deal with MapReduce.
>>>>
>>>
>>> You can still use Spark... the MR reference is in regards to the
>>> Input/OutputFormat classes, which are defined in Hadoop MR. Spark can use
>>> these. See, for example:
>>>
>>> https://dzone.com/articles/implementing-hadoops-input-format
>>> -and-output-forma
>>>
>>
>> While that's possible I'd recommend using the dataframes API instead. eg
>> see https://kudu.apache.org/docs/developing.html#_kudu_integ
>> ration_with_spark
>>
>> That should work as well (or better) than the MR outputformat.
>>
>> -Todd
>>
>>
>>
>>> However, you'll have to write (simple) Spark code, whereas with method
>>> #1 you do effectively the same thing under the covers using SQL statements
>>> via Impala.
>>>
>>>
>>>>
>>>> Thanks!
>>>> What’s the most efficient way to bulk load data into Kudu?
>>>> <https://kudu.apache.org/faq.html#whats-the-most-efficient-way-to-bulk-load-data-into-kudu>
>>>>
>>>> The easiest way to load data into Kudu is if the data is already
>>>> managed by Impala. In this case, a simple INSERT INTO TABLE
>>>> some_kudu_table SELECT * FROM some_csv_tabledoes the trick.
>>>>
>>>> You can also use Kudu’s MapReduce OutputFormat to load data from HDFS,
>>>> HBase, or any other data store that has an InputFormat.
>>>>
>>>> No tool is provided to load data directly into Kudu’s on-disk data
>>>> format. We have found that for many workloads, the insert performance of
>>>> Kudu is comparable to bulk load performance of other systems.
>>>>
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message