From dev-return-17665-apmail-sqoop-dev-archive=sqoop.apache.org@sqoop.apache.org Mon Dec 1 23:32:02 2014 Return-Path: X-Original-To: apmail-sqoop-dev-archive@www.apache.org Delivered-To: apmail-sqoop-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 35F1E10B58 for ; Mon, 1 Dec 2014 23:32:02 +0000 (UTC) Received: (qmail 92760 invoked by uid 500); 1 Dec 2014 23:32:01 -0000 Delivered-To: apmail-sqoop-dev-archive@sqoop.apache.org Received: (qmail 92718 invoked by uid 500); 1 Dec 2014 23:32:01 -0000 Mailing-List: contact dev-help@sqoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@sqoop.apache.org Delivered-To: mailing list dev@sqoop.apache.org Received: (qmail 92707 invoked by uid 99); 1 Dec 2014 23:32:00 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Dec 2014 23:32:00 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of gshapira@cloudera.com designates 209.85.160.179 as permitted sender) Received: from [209.85.160.179] (HELO mail-yk0-f179.google.com) (209.85.160.179) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Dec 2014 23:31:35 +0000 Received: by mail-yk0-f179.google.com with SMTP id 19so5286478ykq.38 for ; Mon, 01 Dec 2014 15:31:34 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type:content-transfer-encoding; bh=a7X+P5CcNXdUJn8ZBU9M7dRSjCRKBH/pwcH7/kbFOEo=; b=c30hZnAtRciqeYisch/f3kGS5biKNvy187SGS3EuIIv2r/lahj0FTLX7W0DTuYoKn2 4DiFAWyRl5dkzJnbSX2z3f2L6AaOjwW9rBEh53UBL+oV5pG0LO2fRGfHbQOpwW5HY10A xpic+t2MvIsYtSb/KBEovNfCFhFHFk8iFYqXTkV5PRvaQ8+41Eqat+ETXmj1AeGBKMry dYQ9LxCFlvAKF5ci05WlhDNEnHEzd9POEPxGUsWbSnIWneE8n1RBIPFK8Zjxek6XDdMG 338HXfR1MqNy4hZMVpsLjR4Z2eg9XCjI2NIufXSvJDg/T6+dRfGuaL7xtaiLpOGb5w5k UZ6Q== X-Gm-Message-State: ALoCoQlbYmH5h6VRwvJ127xTL0r47LNi+nyW6IHfK93ZDNcw2+wJ35q5R55XIGkPZXbRFAKJ3BPK MIME-Version: 1.0 X-Received: by 10.170.123.204 with SMTP id p195mr74115495ykb.18.1417476694356; Mon, 01 Dec 2014 15:31:34 -0800 (PST) Received: by 10.170.228.2 with HTTP; Mon, 1 Dec 2014 15:31:34 -0800 (PST) In-Reply-To: References: <06CC5BAA-608C-41D8-8BA2-3C5F7710B849@apache.org> <3D977D1E-F810-43F2-A44E-57843BB69A9F@apache.org> Date: Mon, 1 Dec 2014 15:31:34 -0800 Message-ID: Subject: Re: Configurable NULL in IDF or Connector? From: Gwen Shapira To: "dev@sqoop.apache.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Performance numbers would be sweet at some point for sure. Based on some rough tests we did in the field (on another project), Avro serialization does have significant overhead (I think Hive writing CSV was 20% faster than to Avro, I can dig up my results later). It may be even worse for Sqoop since Hive does serialization in batches. This is not completely scientific, but leads me to believe that as much as I love Avro, we'll need a good reason to use it internally. On Mon, Dec 1, 2014 at 3:19 PM, Veena Basavaraj w= rote: > Jarcec, > > If we were more metrics driven/ with some tests and/or benchmarks to prov= e > how much fast this would be, it would have been great. Just a suggestion. > > Gwen probably meant the same as well. > > > > > > > Best, > *./Vee* > > On Mon, Dec 1, 2014 at 3:16 PM, Jarek Jarcec Cecho > wrote: > >> Gwen, >> we=E2=80=99ve investigated mysqldump, pg_dump and few others already, th= e results >> are on the wiki [1]. The resulting CSV-ish specification is following th= ose >> two very closely. >> >> In MySQL case specifically, I=E2=80=99ve looked into mysqldump output ra= ther then >> =E2=80=9CLOAD DATA=E2=80=9D/=E2=80=9CSELECT INTO OUTFILE" statement beca= use =E2=80=9CLOAD DATA=E2=80=9D requires >> the file to exists on the database machine whereas mysqldump/mysqlimport >> allows us to import data to the database from any machine on the Hadoop >> cluster. >> >> Jarcec >> >> Links: >> 1: >> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+CSV+Intermediat= e+representation >> >> > On Dec 1, 2014, at 11:55 AM, Gwen Shapira wrot= e: >> > >> > Agreed. I hope we'll have at least one direct connector real soon now >> > to prove it. >> > >> > Reading this: >> > http://dev.mysql.com/doc/refman/5.6/en/load-data.html >> > was a bit discouraging... >> > >> > On Mon, Dec 1, 2014 at 11:50 AM, Abraham Elmahrek >> wrote: >> >> My understanding is that MySQL and PostgreSQL can output to CSV in th= e >> >> suggested format. >> >> >> >> NOTE: getTextData() and setTextData() APIs are effectively useless if >> >> reduced processing load is not possible. >> >> >> >> On Mon, Dec 1, 2014 at 11:42 AM, Gwen Shapira >> wrote: >> >> >> >>> (hijacking the thread a bit for a related point) >> >>> >> >>> I have some misgivings around how we manage the IDF now. >> >>> >> >>> We go with a pretty specific CSV in order to avoid extra-processing >> >>> for MySQL/Postgres direct connectors. >> >>> I think the intent is to allow running LOAD DATA without any >> processing. >> >>> Therefore we need to research and document the specific formats >> >>> required by MySQL and Postgres. Both DBs have pretty specific (and >> >>> often funky) formatting they need (If escaping is not used then NULL >> >>> is null, otherwise \N...) >> >>> >> >>> If zero-processing load is not feasible, I'd re-consider the IDF and >> >>> lean toward a more structured format (Avro?). If the connectors nee= d >> >>> to parse the CSV and modify it, we are not gaining anything here. Or >> >>> at the very least benchmark to validate that CSV+processing is still >> >>> the fastest / least CPU option. >> >>> >> >>> Gwen >> >>> >> >>> >> >>> On Mon, Dec 1, 2014 at 11:26 AM, Abraham Elmahrek >> >>> wrote: >> >>>> Indeed. I created SQOOP-1678 is intended to address #1. Let me >> re-define >> >>>> it... >> >>>> >> >>>> Also, for #2... There are a few ways of generating output. It seems >> NULL >> >>>> values range from "\N" to 0x0 to "NULL". I think keeping NULL makes >> >>> sense. >> >>>> >> >>>> On Mon, Dec 1, 2014 at 10:58 AM, Jarek Jarcec Cecho < >> jarcec@apache.org> >> >>>> wrote: >> >>>> >> >>>>> I do share the same point of view as Gwen. The CSV format for UDF = is >> >>> very >> >>>>> strict so that we have minimal surface area for inconsistencies >> between >> >>>>> multiple connectors. This is because the IDF is an agreed upon >> exchange >> >>>>> format when transferring data from one connector to the other. Tha= t >> >>> however >> >>>>> shouldn't stop one connector (such as HDFS) to offer ways to save = the >> >>>>> resulting CSV differently. >> >>>>> >> >>>>> We had similar discussion about separator and quote characters in >> >>>>> SQOOP-1522 that seems to be relevant to the NULL discussion here. >> >>>>> >> >>>>> Jarcec >> >>>>> >> >>>>>> On Dec 1, 2014, at 10:42 AM, Gwen Shapira >> >>> wrote: >> >>>>>> >> >>>>>> I think its two different things: >> >>>>>> >> >>>>>> 1. HDFS connector should give more control over the formatting of >> the >> >>>>>> data in text files (nulls, escaping, etc) >> >>>>>> 2. IDF should give NULLs in a format that is optimized for >> >>>>>> MySQL/Postgres direct connectors (since thats one of the IDF desi= gn >> >>>>>> goals). >> >>>>>> >> >>>>>> Gwen >> >>>>>> >> >>>>>> On Mon, Dec 1, 2014 at 9:52 AM, Abraham Elmahrek >> >>>>> wrote: >> >>>>>>> Hey guys, >> >>>>>>> >> >>>>>>> Any thoughts on where configurable NULL values should be? Either >> the >> >>>>> IDF or >> >>>>>>> HDFS connector? >> >>>>>>> >> >>>>>>> cf: https://issues.apache.org/jira/browse/SQOOP-1678 >> >>>>>>> >> >>>>>>> -Abe >> >>>>> >> >>>>> >> >>> >> >>