sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jarek Jarcec Cecho <jar...@apache.org>
Subject Re: Configurable NULL in IDF or Connector?
Date Mon, 01 Dec 2014 23:16:53 GMT
we’ve investigated mysqldump, pg_dump and few others already, the results are on the wiki
[1]. The resulting CSV-ish specification is following those two very closely.

In MySQL case specifically, I’ve looked into mysqldump output rather then “LOAD DATA”/“SELECT
INTO OUTFILE" statement because “LOAD DATA” requires the file to exists on the database
machine whereas mysqldump/mysqlimport allows us to import data to the database from any machine
on the Hadoop cluster.


1: https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+CSV+Intermediate+representation

> On Dec 1, 2014, at 11:55 AM, Gwen Shapira <gshapira@cloudera.com> wrote:
> Agreed. I hope we'll have at least one direct connector real soon now
> to prove it.
> Reading this:
> http://dev.mysql.com/doc/refman/5.6/en/load-data.html
> was a bit discouraging...
> On Mon, Dec 1, 2014 at 11:50 AM, Abraham Elmahrek <abe@cloudera.com> wrote:
>> My understanding is that MySQL and PostgreSQL can output to CSV in the
>> suggested format.
>> NOTE: getTextData() and setTextData() APIs are effectively useless if
>> reduced processing load is not possible.
>> On Mon, Dec 1, 2014 at 11:42 AM, Gwen Shapira <gshapira@cloudera.com> wrote:
>>> (hijacking the thread a bit for a related point)
>>> I have some misgivings around how we manage the IDF now.
>>> We go with a pretty specific CSV in order to avoid extra-processing
>>> for MySQL/Postgres direct connectors.
>>> I think the intent is to allow running LOAD DATA  without any processing.
>>> Therefore we need to research and document the specific formats
>>> required by MySQL and Postgres. Both DBs have pretty specific (and
>>> often funky) formatting they need (If escaping is not used then NULL
>>> is null, otherwise \N...)
>>> If zero-processing load is not feasible, I'd re-consider the IDF and
>>> lean toward a more structured format (Avro?).  If the connectors need
>>> to parse the CSV and modify it, we are not gaining anything here. Or
>>> at the very least benchmark to validate that CSV+processing is still
>>> the fastest / least CPU option.
>>> Gwen
>>> On Mon, Dec 1, 2014 at 11:26 AM, Abraham Elmahrek <abe@cloudera.com>
>>> wrote:
>>>> Indeed. I created SQOOP-1678 is intended to address #1. Let me re-define
>>>> it...
>>>> Also, for #2... There are a few ways of generating output. It seems NULL
>>>> values range from "\N" to 0x0 to "NULL". I think keeping NULL makes
>>> sense.
>>>> On Mon, Dec 1, 2014 at 10:58 AM, Jarek Jarcec Cecho <jarcec@apache.org>
>>>> wrote:
>>>>> I do share the same point of view as Gwen. The CSV format for UDF is
>>> very
>>>>> strict so that we have minimal surface area for inconsistencies between
>>>>> multiple connectors. This is because the IDF is an agreed upon exchange
>>>>> format when transferring data from one connector to the other. That
>>> however
>>>>> shouldn't stop one connector (such as HDFS) to offer ways to save the
>>>>> resulting CSV differently.
>>>>> We had similar discussion about separator and quote characters in
>>>>> SQOOP-1522 that seems to be relevant to the NULL discussion here.
>>>>> Jarcec
>>>>>> On Dec 1, 2014, at 10:42 AM, Gwen Shapira <gshapira@cloudera.com>
>>> wrote:
>>>>>> I think its two different things:
>>>>>> 1. HDFS connector should give more control over the formatting of
>>>>>> data in text files (nulls, escaping, etc)
>>>>>> 2. IDF should give NULLs in a format that is optimized for
>>>>>> MySQL/Postgres direct connectors (since thats one of the IDF design
>>>>>> goals).
>>>>>> Gwen
>>>>>> On Mon, Dec 1, 2014 at 9:52 AM, Abraham Elmahrek <abe@cloudera.com>
>>>>> wrote:
>>>>>>> Hey guys,
>>>>>>> Any thoughts on where configurable NULL values should be? Either
>>>>> IDF or
>>>>>>> HDFS connector?
>>>>>>> cf: https://issues.apache.org/jira/browse/SQOOP-1678
>>>>>>> -Abe

View raw message