From dev-return-17660-apmail-sqoop-dev-archive=sqoop.apache.org@sqoop.apache.org Mon Dec 1 23:17:22 2014 Return-Path: X-Original-To: apmail-sqoop-dev-archive@www.apache.org Delivered-To: apmail-sqoop-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B4D0210AC9 for ; Mon, 1 Dec 2014 23:17:22 +0000 (UTC) Received: (qmail 53216 invoked by uid 500); 1 Dec 2014 23:17:22 -0000 Delivered-To: apmail-sqoop-dev-archive@sqoop.apache.org Received: (qmail 53170 invoked by uid 500); 1 Dec 2014 23:17:22 -0000 Mailing-List: contact dev-help@sqoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@sqoop.apache.org Delivered-To: mailing list dev@sqoop.apache.org Received: (qmail 53157 invoked by uid 99); 1 Dec 2014 23:17:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Dec 2014 23:17:22 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jarcec@gmail.com designates 209.85.192.175 as permitted sender) Received: from [209.85.192.175] (HELO mail-pd0-f175.google.com) (209.85.192.175) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Dec 2014 23:16:56 +0000 Received: by mail-pd0-f175.google.com with SMTP id y10so11863402pdj.34 for ; Mon, 01 Dec 2014 15:16:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=ifo4g/MCT6z9lZjT7be1qeo8SD1m1cx6rBAVlmZxDVs=; b=RGurYJKDChPRBCIevYXa1wrnoETIDLTjcCLq5HwovZ3IU1mDHcormDkIGC13ZDwJCz 8txnlMyk2VWFUmOPyj/v94aB9fs80KrKgRHurCt0yb5MN1qjkhpS1KSJUhVPOrWd5ct1 jlpKUzkaUdgnymFeAjV9b0EJUtCmJDswSEP2cmXu7l9hhjKit9rnRlIV/p3R/SEph2l0 iz0q3yGNHXqbaRSKOPbd+jBkxsZp+dxsaQ+jxilpMh+KUecIJt7TkannV7mJ4/A1Tbcl AcBDk4Yu5rpJd13QcCU+ox0jz7/aIxjga4drQIoxdGvEgxl6K8hyuDLpHBnTwT53lzK0 fSUQ== X-Received: by 10.68.211.104 with SMTP id nb8mr200692pbc.29.1417475814719; Mon, 01 Dec 2014 15:16:54 -0800 (PST) Received: from [172.16.2.132] ([74.217.76.11]) by mx.google.com with ESMTPSA id nr15sm2924126pdb.73.2014.12.01.15.16.53 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 01 Dec 2014 15:16:54 -0800 (PST) Sender: Jarek Jarcec Cecho Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 8.1 \(1993\)) Subject: Re: Configurable NULL in IDF or Connector? From: Jarek Jarcec Cecho In-Reply-To: Date: Mon, 1 Dec 2014 15:16:53 -0800 Content-Transfer-Encoding: quoted-printable Message-Id: <3D977D1E-F810-43F2-A44E-57843BB69A9F@apache.org> References: <06CC5BAA-608C-41D8-8BA2-3C5F7710B849@apache.org> To: dev@sqoop.apache.org X-Mailer: Apple Mail (2.1993) X-Virus-Checked: Checked by ClamAV on apache.org Gwen, we=E2=80=99ve investigated mysqldump, pg_dump and few others already, = the results are on the wiki [1]. The resulting CSV-ish specification is = following those two very closely. In MySQL case specifically, I=E2=80=99ve looked into mysqldump output = rather then =E2=80=9CLOAD DATA=E2=80=9D/=E2=80=9CSELECT INTO OUTFILE" = statement because =E2=80=9CLOAD DATA=E2=80=9D requires the file to = exists on the database machine whereas mysqldump/mysqlimport allows us = to import data to the database from any machine on the Hadoop cluster. Jarcec Links: 1: = https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+CSV+Intermediate+= representation > On Dec 1, 2014, at 11:55 AM, Gwen Shapira = wrote: >=20 > Agreed. I hope we'll have at least one direct connector real soon now > to prove it. >=20 > Reading this: > http://dev.mysql.com/doc/refman/5.6/en/load-data.html > was a bit discouraging... >=20 > On Mon, Dec 1, 2014 at 11:50 AM, Abraham Elmahrek = wrote: >> My understanding is that MySQL and PostgreSQL can output to CSV in = the >> suggested format. >>=20 >> NOTE: getTextData() and setTextData() APIs are effectively useless if >> reduced processing load is not possible. >>=20 >> On Mon, Dec 1, 2014 at 11:42 AM, Gwen Shapira = wrote: >>=20 >>> (hijacking the thread a bit for a related point) >>>=20 >>> I have some misgivings around how we manage the IDF now. >>>=20 >>> We go with a pretty specific CSV in order to avoid extra-processing >>> for MySQL/Postgres direct connectors. >>> I think the intent is to allow running LOAD DATA without any = processing. >>> Therefore we need to research and document the specific formats >>> required by MySQL and Postgres. Both DBs have pretty specific (and >>> often funky) formatting they need (If escaping is not used then NULL >>> is null, otherwise \N...) >>>=20 >>> If zero-processing load is not feasible, I'd re-consider the IDF and >>> lean toward a more structured format (Avro?). If the connectors = need >>> to parse the CSV and modify it, we are not gaining anything here. Or >>> at the very least benchmark to validate that CSV+processing is still >>> the fastest / least CPU option. >>>=20 >>> Gwen >>>=20 >>>=20 >>> On Mon, Dec 1, 2014 at 11:26 AM, Abraham Elmahrek >>> wrote: >>>> Indeed. I created SQOOP-1678 is intended to address #1. Let me = re-define >>>> it... >>>>=20 >>>> Also, for #2... There are a few ways of generating output. It seems = NULL >>>> values range from "\N" to 0x0 to "NULL". I think keeping NULL makes >>> sense. >>>>=20 >>>> On Mon, Dec 1, 2014 at 10:58 AM, Jarek Jarcec Cecho = >>>> wrote: >>>>=20 >>>>> I do share the same point of view as Gwen. The CSV format for UDF = is >>> very >>>>> strict so that we have minimal surface area for inconsistencies = between >>>>> multiple connectors. This is because the IDF is an agreed upon = exchange >>>>> format when transferring data from one connector to the other. = That >>> however >>>>> shouldn't stop one connector (such as HDFS) to offer ways to save = the >>>>> resulting CSV differently. >>>>>=20 >>>>> We had similar discussion about separator and quote characters in >>>>> SQOOP-1522 that seems to be relevant to the NULL discussion here. >>>>>=20 >>>>> Jarcec >>>>>=20 >>>>>> On Dec 1, 2014, at 10:42 AM, Gwen Shapira >>> wrote: >>>>>>=20 >>>>>> I think its two different things: >>>>>>=20 >>>>>> 1. HDFS connector should give more control over the formatting of = the >>>>>> data in text files (nulls, escaping, etc) >>>>>> 2. IDF should give NULLs in a format that is optimized for >>>>>> MySQL/Postgres direct connectors (since thats one of the IDF = design >>>>>> goals). >>>>>>=20 >>>>>> Gwen >>>>>>=20 >>>>>> On Mon, Dec 1, 2014 at 9:52 AM, Abraham Elmahrek = >>>>> wrote: >>>>>>> Hey guys, >>>>>>>=20 >>>>>>> Any thoughts on where configurable NULL values should be? Either = the >>>>> IDF or >>>>>>> HDFS connector? >>>>>>>=20 >>>>>>> cf: https://issues.apache.org/jira/browse/SQOOP-1678 >>>>>>>=20 >>>>>>> -Abe >>>>>=20 >>>>>=20 >>>=20