pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Le Clue (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PIG-4572) CSVExcelStorage treats newlines within fields as record seperator when input file is split
Date Tue, 26 May 2015 07:48:17 GMT

     [ https://issues.apache.org/jira/browse/PIG-4572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Le Clue updated PIG-4572:
-------------------------
    Attachment: SmallTest.txt

Sample Input Data
3190 Bytes

> CSVExcelStorage treats newlines within fields as record seperator when input file is
split
> ------------------------------------------------------------------------------------------
>
>                 Key: PIG-4572
>                 URL: https://issues.apache.org/jira/browse/PIG-4572
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.12.0, 0.14.0
>         Environment: Amazon ElasticMapReduce AMI 3.6.0
> Apache Pig version 0.14.0 and 0.12.0
> Hadoop 2.4.0
>            Reporter: Le Clue
>              Labels: CSVExcelStorage, pig
>         Attachments: SmallTest.txt
>
>
> It seems that when a field enclosed by double-quotes contains a carriage return or linefeed,
and the input file is bigger than the dfs blocksize, the input split does not honor CSVExcelStorage's
treatment of newlines within fields.
> It seems that the input is split by the linefeed closest to the byte range defined for
the split, and causes fields to become skewed.
> For example, 3190 Byte Text file containing 21 identical records such as the below:
> "John Doe"~"025719e8244c7c400b811ea349f2c18e"~"This is a multiline message:
> This is the second line.
> Thank you for listening."~"2012-08-24 09:16:02"
> Each line termination here is specified by a CRLF
> Run through a pig script:
> SET mapred.min.split.size 1024;
> SET mapred.max.split.size 1024;
> SET pig.noSplitCombination true;
> SET mapred.max.jobs.per.node 1;
> myinput_file = LOAD 's3://sourcebucket/inputfile.txt' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~',
'YES_MULTILINE','WINDOWS')
> AS(
>   name:chararray,
>   sysid:chararray,
>   message:chararray,
>   messagedate:chararray
> );
> myinput_tuples = FOREACH myinput_file GENERATE name;
> STORE myinput_tuples INTO '/output052/' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',');
> Results in 4 output files:
> -rw-r--r--   1 hadoop supergroup          0 2015-05-26 07:19 /output052/_SUCCESS
> -rw-r--r--   1 hadoop supergroup         63 2015-05-26 07:19 /output052/part-m-00000
> -rw-r--r--   1 hadoop supergroup         54 2015-05-26 07:19 /output052/part-m-00001
> -rw-r--r--   1 hadoop supergroup        769 2015-05-26 07:19 /output052/part-m-00002
> -rw-r--r--   1 hadoop supergroup         25 2015-05-26 07:19 /output052/part-m-00003
> [hadoop@ip-10-102-154-33 ~]$ hadoop fs -cat /output052/part-m-00000
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> [hadoop@ip-10-102-154-33 ~]$ hadoop fs -cat /output052/part-m-00001
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> [hadoop@ip-10-102-154-33 ~]$ hadoop fs -cat /output052/part-m-00002
> This is the second line.
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> [hadoop@ip-10-102-154-33 ~]$ hadoop fs -cat /output052/part-m-00003
> This is the second line.
> Skewing occurs on the third part.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message