Gwen, we’ve investigated mysqldump, pg_dump and few others already, the results are on the wiki [1]. The resulting CSV-ish specification is following those two very closely. In MySQL case specifically, I’ve looked into mysqldump output rather then “LOAD DATA”/“SELECT INTO OUTFILE" statement because “LOAD DATA” requires the file to exists on the database machine whereas mysqldump/mysqlimport allows us to import data to the database from any machine on the Hadoop cluster. Jarcec Links: 1: https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+CSV+Intermediate+representation > On Dec 1, 2014, at 11:55 AM, Gwen Shapira wrote: > > Agreed. I hope we'll have at least one direct connector real soon now > to prove it. > > Reading this: > http://dev.mysql.com/doc/refman/5.6/en/load-data.html > was a bit discouraging... > > On Mon, Dec 1, 2014 at 11:50 AM, Abraham Elmahrek wrote: >> My understanding is that MySQL and PostgreSQL can output to CSV in the >> suggested format. >> >> NOTE: getTextData() and setTextData() APIs are effectively useless if >> reduced processing load is not possible. >> >> On Mon, Dec 1, 2014 at 11:42 AM, Gwen Shapira wrote: >> >>> (hijacking the thread a bit for a related point) >>> >>> I have some misgivings around how we manage the IDF now. >>> >>> We go with a pretty specific CSV in order to avoid extra-processing >>> for MySQL/Postgres direct connectors. >>> I think the intent is to allow running LOAD DATA without any processing. >>> Therefore we need to research and document the specific formats >>> required by MySQL and Postgres. Both DBs have pretty specific (and >>> often funky) formatting they need (If escaping is not used then NULL >>> is null, otherwise \N...) >>> >>> If zero-processing load is not feasible, I'd re-consider the IDF and >>> lean toward a more structured format (Avro?). If the connectors need >>> to parse the CSV and modify it, we are not gaining anything here. Or >>> at the very least benchmark to validate that CSV+processing is still >>> the fastest / least CPU option. >>> >>> Gwen >>> >>> >>> On Mon, Dec 1, 2014 at 11:26 AM, Abraham Elmahrek >>> wrote: >>>> Indeed. I created SQOOP-1678 is intended to address #1. Let me re-define >>>> it... >>>> >>>> Also, for #2... There are a few ways of generating output. It seems NULL >>>> values range from "\N" to 0x0 to "NULL". I think keeping NULL makes >>> sense. >>>> >>>> On Mon, Dec 1, 2014 at 10:58 AM, Jarek Jarcec Cecho >>>> wrote: >>>> >>>>> I do share the same point of view as Gwen. The CSV format for UDF is >>> very >>>>> strict so that we have minimal surface area for inconsistencies between >>>>> multiple connectors. This is because the IDF is an agreed upon exchange >>>>> format when transferring data from one connector to the other. That >>> however >>>>> shouldn't stop one connector (such as HDFS) to offer ways to save the >>>>> resulting CSV differently. >>>>> >>>>> We had similar discussion about separator and quote characters in >>>>> SQOOP-1522 that seems to be relevant to the NULL discussion here. >>>>> >>>>> Jarcec >>>>> >>>>>> On Dec 1, 2014, at 10:42 AM, Gwen Shapira >>> wrote: >>>>>> >>>>>> I think its two different things: >>>>>> >>>>>> 1. HDFS connector should give more control over the formatting of the >>>>>> data in text files (nulls, escaping, etc) >>>>>> 2. IDF should give NULLs in a format that is optimized for >>>>>> MySQL/Postgres direct connectors (since thats one of the IDF design >>>>>> goals). >>>>>> >>>>>> Gwen >>>>>> >>>>>> On Mon, Dec 1, 2014 at 9:52 AM, Abraham Elmahrek >>>>> wrote: >>>>>>> Hey guys, >>>>>>> >>>>>>> Any thoughts on where configurable NULL values should be? Either the >>>>> IDF or >>>>>>> HDFS connector? >>>>>>> >>>>>>> cf: https://issues.apache.org/jira/browse/SQOOP-1678 >>>>>>> >>>>>>> -Abe >>>>> >>>>> >>>