Return-Path: X-Original-To: apmail-sqoop-dev-archive@www.apache.org Delivered-To: apmail-sqoop-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C905510FDB for ; Mon, 1 Dec 2014 19:56:11 +0000 (UTC) Received: (qmail 96625 invoked by uid 500); 1 Dec 2014 19:56:11 -0000 Delivered-To: apmail-sqoop-dev-archive@sqoop.apache.org Received: (qmail 96592 invoked by uid 500); 1 Dec 2014 19:56:11 -0000 Mailing-List: contact dev-help@sqoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@sqoop.apache.org Delivered-To: mailing list dev@sqoop.apache.org Received: (qmail 96578 invoked by uid 99); 1 Dec 2014 19:56:11 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Dec 2014 19:56:11 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of gshapira@cloudera.com designates 209.85.160.169 as permitted sender) Received: from [209.85.160.169] (HELO mail-yk0-f169.google.com) (209.85.160.169) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Dec 2014 19:55:46 +0000 Received: by mail-yk0-f169.google.com with SMTP id 79so5121734ykr.14 for ; Mon, 01 Dec 2014 11:55:00 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=k8FGPBQ9ST+YCb/CQxasx/nwxVDqSvcsnAsOK7xu2H0=; b=Bcly4vK50UdG22CnB3z3uRz0lI4jvixyw28v50M9gQJct84oVUG5CyL/tszg4pJ8lb fY/dydKrRsrL2j65ZEnrpAgHCjSZdlifGfnydRxiel7Pl+rCmDK4w4trrA6zmjAxStts dB/KIVsWqBYbDE4YnCfCqnkVbIpjfDz228ay/FxnBoaM++cCF+TyPTPWjsY6YE90EeGW PUSzsM37Quwqhg0MB28JzwGPGiNlujv4+RUN1vFMaXJ3OxqulTyLK83Wh+u5M0ZIstg0 +AB7/a0EKUXwND1sOWCYeIvJvoA8wICZ8RzintzIcT4zjTWg7Boo47uA+lq1wsXnUBGK rvJA== X-Gm-Message-State: ALoCoQnzUQiIXsOAlL91an7+dxyiafmZCsdsI4FSoOW6sTvMrxMJVijlRDXjaWP4IwZfXLzt6c59 MIME-Version: 1.0 X-Received: by 10.236.228.225 with SMTP id f91mr62868344yhq.193.1417463700086; Mon, 01 Dec 2014 11:55:00 -0800 (PST) Received: by 10.170.228.2 with HTTP; Mon, 1 Dec 2014 11:55:00 -0800 (PST) In-Reply-To: References: <06CC5BAA-608C-41D8-8BA2-3C5F7710B849@apache.org> Date: Mon, 1 Dec 2014 11:55:00 -0800 Message-ID: Subject: Re: Configurable NULL in IDF or Connector? From: Gwen Shapira To: "dev@sqoop.apache.org" Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org Agreed. I hope we'll have at least one direct connector real soon now to prove it. Reading this: http://dev.mysql.com/doc/refman/5.6/en/load-data.html was a bit discouraging... On Mon, Dec 1, 2014 at 11:50 AM, Abraham Elmahrek wrote: > My understanding is that MySQL and PostgreSQL can output to CSV in the > suggested format. > > NOTE: getTextData() and setTextData() APIs are effectively useless if > reduced processing load is not possible. > > On Mon, Dec 1, 2014 at 11:42 AM, Gwen Shapira wrote: > >> (hijacking the thread a bit for a related point) >> >> I have some misgivings around how we manage the IDF now. >> >> We go with a pretty specific CSV in order to avoid extra-processing >> for MySQL/Postgres direct connectors. >> I think the intent is to allow running LOAD DATA without any processing. >> Therefore we need to research and document the specific formats >> required by MySQL and Postgres. Both DBs have pretty specific (and >> often funky) formatting they need (If escaping is not used then NULL >> is null, otherwise \N...) >> >> If zero-processing load is not feasible, I'd re-consider the IDF and >> lean toward a more structured format (Avro?). If the connectors need >> to parse the CSV and modify it, we are not gaining anything here. Or >> at the very least benchmark to validate that CSV+processing is still >> the fastest / least CPU option. >> >> Gwen >> >> >> On Mon, Dec 1, 2014 at 11:26 AM, Abraham Elmahrek >> wrote: >> > Indeed. I created SQOOP-1678 is intended to address #1. Let me re-define >> > it... >> > >> > Also, for #2... There are a few ways of generating output. It seems NULL >> > values range from "\N" to 0x0 to "NULL". I think keeping NULL makes >> sense. >> > >> > On Mon, Dec 1, 2014 at 10:58 AM, Jarek Jarcec Cecho >> > wrote: >> > >> >> I do share the same point of view as Gwen. The CSV format for UDF is >> very >> >> strict so that we have minimal surface area for inconsistencies between >> >> multiple connectors. This is because the IDF is an agreed upon exchange >> >> format when transferring data from one connector to the other. That >> however >> >> shouldn't stop one connector (such as HDFS) to offer ways to save the >> >> resulting CSV differently. >> >> >> >> We had similar discussion about separator and quote characters in >> >> SQOOP-1522 that seems to be relevant to the NULL discussion here. >> >> >> >> Jarcec >> >> >> >> > On Dec 1, 2014, at 10:42 AM, Gwen Shapira >> wrote: >> >> > >> >> > I think its two different things: >> >> > >> >> > 1. HDFS connector should give more control over the formatting of the >> >> > data in text files (nulls, escaping, etc) >> >> > 2. IDF should give NULLs in a format that is optimized for >> >> > MySQL/Postgres direct connectors (since thats one of the IDF design >> >> > goals). >> >> > >> >> > Gwen >> >> > >> >> > On Mon, Dec 1, 2014 at 9:52 AM, Abraham Elmahrek >> >> wrote: >> >> >> Hey guys, >> >> >> >> >> >> Any thoughts on where configurable NULL values should be? Either the >> >> IDF or >> >> >> HDFS connector? >> >> >> >> >> >> cf: https://issues.apache.org/jira/browse/SQOOP-1678 >> >> >> >> >> >> -Abe >> >> >> >> >>