From user-return-1242-archive-asf-public=cust-asf.ponee.io@kudu.apache.org Mon Jan 29 20:19:24 2018 Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id 49B14180654 for ; Mon, 29 Jan 2018 20:19:24 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 36572160C31; Mon, 29 Jan 2018 19:19:24 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 585D6160C2F for ; Mon, 29 Jan 2018 20:19:23 +0100 (CET) Received: (qmail 72179 invoked by uid 500); 29 Jan 2018 19:19:22 -0000 Mailing-List: contact user-help@kudu.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@kudu.apache.org Delivered-To: mailing list user@kudu.apache.org Received: (qmail 72169 invoked by uid 99); 29 Jan 2018 19:19:21 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Jan 2018 19:19:21 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 5616918043C for ; Mon, 29 Jan 2018 19:19:21 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.898 X-Spam-Level: * X-Spam-Status: No, score=1.898 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=cloudera.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id S587cL1LQQix for ; Mon, 29 Jan 2018 19:19:20 +0000 (UTC) Received: from mail-io0-f174.google.com (mail-io0-f174.google.com [209.85.223.174]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 0D1605F341 for ; Mon, 29 Jan 2018 19:19:20 +0000 (UTC) Received: by mail-io0-f174.google.com with SMTP id l17so8804172ioc.3 for ; Mon, 29 Jan 2018 11:19:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudera.com; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=DpWBuekmvIWIzTk/u8MnUnT8xZcDMvw8TBlIOmr3t1I=; b=OsmhJOAqvYFtPAikvH6L7QP1MCxQX8JvUJC95OEQyzmjPw3BRc0h9ob6Y53g5gMTFG KPxlw3fB4Pqr4NVmKVkp0QF65RcCxZAlfAGq2bCB4kAjSrDTg90X1jjmMvLNodetEXt5 EpeESzUjXbBck6l/4wLrdXD/18Zr1bZ1+Ej0sSXhCZH3XWXe5U1A6A8uwUStCLKih7ie TqKUe0/qK3XPksmRIHjJpomnxaws3w6Z7yRO4sy0SA3M/xEs1B5zdONBZfyRQA//SmlL l/wsj78ZrFwVpOKfh3b1vPYTwGjBmma6Y/6vB2jvXA0c1gXnBd332+xMSsE+H6Rfc6ez u3mw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=DpWBuekmvIWIzTk/u8MnUnT8xZcDMvw8TBlIOmr3t1I=; b=V9V/cBQFwGGnc8YTM2Y4orDxSD1UZKbWBQ8t8bVWOxBrXBHesT+Rq2sP3v/U1mcoIV SLamvfi0P/cYZRErr4HGdu5tdLfsXgInl/dEzLoC4VD244iEbiCB+DvtmwY9Pzh+jbJ0 7rqHRyFXf9L5pV8BpyIp1QOZGc+Bw3LPzqyJdP8wxGct8UZoRgF5D9An1fXrucG90k9G SeDwBFG369oQ9WhadngmfxPo0jcC79734IqQZ+TrNVhWrJqhDHJN4oqHLG4E0FaMO731 yWgBKaCWBP++XzZmI8UPOSZG/rYjxSzVb1vTcgqQSuvc9XOS55V3v9jiMpG6KhlunNRM 52WA== X-Gm-Message-State: AKwxytd5L/bbQ5YpSvdZnRV3nImoWntWgLJlag+gnK7P8Obl4YXikDmZ pEcmwMNWTnscmLWludv+162W/Jzj4CXTlC0v+kn22nnD X-Google-Smtp-Source: AH8x226xthXCrK54MbW0QpiFtWZVsqatxi+a5i0M3aA3EI0eV7zKs9kvaxFMwn68HJQfj5RINcdhsLXxYC8xSpM8yNw= X-Received: by 10.107.32.19 with SMTP id g19mr26821525iog.217.1517253559370; Mon, 29 Jan 2018 11:19:19 -0800 (PST) MIME-Version: 1.0 Received: by 10.107.88.18 with HTTP; Mon, 29 Jan 2018 11:18:38 -0800 (PST) In-Reply-To: References: From: Patrick Angeles Date: Mon, 29 Jan 2018 14:18:38 -0500 Message-ID: Subject: Re: Bulk / Initial load of large tables into Kudu using Spark To: user@kudu.apache.org Content-Type: multipart/alternative; boundary="001a1141a2a470ede20563ef1fb6" --001a1141a2a470ede20563ef1fb6 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Boris. 1) I would like to bypass Impala as data for my bulk load coming from > sqoop and avro files are stored on HDFS. > What's the objection to Impala? In the example below, Impala reads from an HDFS-resident table, and writes to the Kudu table. > 2) we do not want to deal with MapReduce. > You can still use Spark... the MR reference is in regards to the Input/OutputFormat classes, which are defined in Hadoop MR. Spark can use these. See, for example: https://dzone.com/articles/implementing-hadoops-input-format-and-output-for= ma However, you'll have to write (simple) Spark code, whereas with method #1 you do effectively the same thing under the covers using SQL statements via Impala. > > Thanks! > What=E2=80=99s the most efficient way to bulk load data into Kudu? > > > The easiest way to load data into Kudu is if the data is already managed > by Impala. In this case, a simple INSERT INTO TABLE some_kudu_table > SELECT * FROM some_csv_tabledoes the trick. > > You can also use Kudu=E2=80=99s MapReduce OutputFormat to load data from = HDFS, > HBase, or any other data store that has an InputFormat. > > No tool is provided to load data directly into Kudu=E2=80=99s on-disk dat= a format. > We have found that for many workloads, the insert performance of Kudu is > comparable to bulk load performance of other systems. > --001a1141a2a470ede20563ef1fb6 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Boris.

1) I would like to bypass Impala a= s data for my bulk load coming from sqoop=C2=A0and avro=C2=A0files are stor= ed on HDFS.
What's the objection to Impala= ? In the example below, Impala reads from an HDFS-resident table, and write= s to the Kudu table.
=C2=A0
2) we do not want to deal with MapR= educe.

You can still use Spark.= .. the MR reference is in regards to the Input/OutputFormat classes, which = are defined in Hadoop MR. Spark can use these. See, for example:
=

However, you&= #39;ll have to write (simple) Spark code, whereas with method #1 you do eff= ectively the same thing under the covers using SQL statements via Impala.
=C2=A0

Thanks!

= What=E2=80=99s the most efficient way to bulk load data into Kudu?<= /h4>

The easiest way to load data into Kudu is if the data is already mana= ged by Impala. In this case, a simple=C2=A0INSE= RT INTO TABLE some_kudu_table SELECT * FROM some_csv_tabledoes the t= rick.

You can also use Kudu=E2=80=99s MapReduce OutputFormat to load d= ata from HDFS, HBase, or any other data store that has an InputFormat.

<= p style=3D"box-sizing:border-box;margin:10px 0px;color:rgb(51,51,51);font-f= amily:"Helvetica Neue",Helvetica,Arial,sans-serif;font-size:14px"= >No tool is provided to load data directly into Kudu=E2=80=99s on-disk data= format. We have found that for many workloads, the insert performance of K= udu is comparable to bulk load performance of other systems.


--001a1141a2a470ede20563ef1fb6--