Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Date: Tue, 23 May 2017 11:52:04 +0000 (UTC)
From: =?utf-8?Q?Till_Sch=C3=A4fer_=28JIRA=29?= <jira@apache.org>
To: mapreduce-issues@hadoop.apache.org
Message-ID: <JIRA.13073928.1495471405000.271833.1495540324111@Atlassian.JIRA>
In-Reply-To: <JIRA.13073928.1495471405000@Atlassian.JIRA>
References: <JIRA.13073928.1495471405000@Atlassian.JIRA> <JIRA.13073928.1495471405300@jira-lw-us.apache.org>
Subject: [jira] [Commented] (MAPREDUCE-6891) TextInputFormat: duplicate
 records with custom delimiter
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Tue, 23 May 2017 11:52:08 -0000


    [ https://issues.apache.org/jira/browse/MAPREDUCE-6891?page=3Dcom.atlas=
sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D=
16021088#comment-16021088 ]=20

Till Sch=C3=A4fer commented on MAPREDUCE-6891:
-----------------------------------------

Seems that you are right. I have retested this with Hadoop 2.7.3 and it is =
not reproducible anymore.

> TextInputFormat: duplicate records with custom delimiter
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-6891
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6891
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.2.0
>            Reporter: Till Sch=C3=A4fer
>
> When using a custom delimiter for TextInputFormat, the resulting blocks a=
re not correct under some circumstances. It happens that the total number o=
f records is wrong and some entries are duplicated.
> I have created a reproducible test case:=20
> Generate a File
> {code:bash}
> for i in $(seq 1 10000000); do=20
>   echo -n $i >> long_delimiter-1to10000000-with_newline.txt;
>   echo "--------------------------------------------" >> long_delimiter-1=
to10000000-with_newline.txt;=20
> done
> {code}=20
> Java-Test to reproduce the error
> {code:java}
> public static void longDelimiterBug(JavaSparkContext sc) {
> =09Configuration hadoopConf =3D new Configuration();
> =09String delimitedFile =3D "long_delimiter-1to10000000-with_newline.txt"=
;
> =09hadoopConf.set("textinputformat.record.delimiter", "------------------=
--------------------------\n");
> =09JavaPairRDD<LongWritable, Text> input =3D sc.newAPIHadoopFile(delimite=
dFile, TextInputFormat.class,
> =09=09=09LongWritable.class, Text.class, hadoopConf);
> =09List<String> values =3D input.map(t -> t._2.toString()).collect();
> =09Assert.assertEquals(10000000, values.size());
> =09for (int i =3D 0; i < 10000000; i++) {
> =09=09boolean correct =3D values.get(i).equals(Integer.toString(i + 1));
> =09=09if (!correct) {
> =09=09=09logger.error("Wrong value for index {}: expected {} -> got {}", =
i, i + 1, values.get(i));
> =09=09} else {
> =09=09=09logger.info("Correct value for index {}: expected {} -> got {}",=
 i, i + 1, values.get(i));
> =09=09}
> =09=09Assert.assertTrue(correct);
> =09}
> }
> {code}
> This example fails with the error=20
> {quote}
> java.lang.AssertionError: expected:<10000000> but was:<10042616>
> {quote}
> when commenting out the Assert about the size of the collection, my log o=
utput ends like this:=20
> {quote}
> [main] INFO  edu.udo.cs.schaefer.testspark.Main  - Correct value for inde=
x 663244: expected 663245 -> got 663245
> [main] ERROR edu.udo.cs.schaefer.testspark.Main  - Wrong value for index =
663245: expected 663246 -> got 660111
> {quote}
> After the the wrong value for index 663245 the values are sorted again an=
 a continuing with 660112, 660113, ....
> The error is not reproducible with _\n_ as delimiter, i.e. when not using=
 a custom delimiter.=20


--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org