Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of John.Pauley@threattrack.com
 designates 66.129.99.126 as permitted sender)
From: John Pauley <John.Pauley@threattrack.com>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Subject: Re: [hadoop] AvroMultipleOutputs
 org.apache.avro.file.DataFileWriter$AppendWriteException
Thread-Topic: [hadoop] AvroMultipleOutputs
 org.apache.avro.file.DataFileWriter$AppendWriteException
Thread-Index: AQHPNJhXQrZqtUc8e0K4jBkAWcvYYJrPh0WAgAD2HgCAAJVWAA==
Date: Tue, 4 Mar 2014 15:25:24 +0000
Message-ID: <CF3B5A10.950D%john.pauley@threattrack.com>
In-Reply-To: 
 <CAOAr05tpqEe9FOV76nouPbd=yn8NhpuY_BXV16sWn7VkaS7TAw@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: multipart/alternative;
	boundary="_000_CF3B5A10950Djohnpauleythreattrackcom_"
MIME-Version: 1.0

--_000_CF3B5A10950Djohnpauleythreattrackcom_
Content-Type: text/plain; charset="Windows-1252"
Content-Transfer-Encoding: quoted-printable

Outside hadoop: avro-1.7.6
Inside hadoop:  avro-mapred-1.7.6-hadoop2

From: Stanley Shi <sshi@gopivotal.com<mailto:sshi@gopivotal.com>>
Reply-To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@had=
oop.apache.org<mailto:user@hadoop.apache.org>>
Date: Monday, March 3, 2014 at 8:30 PM
To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@hadoop.ap=
ache.org<mailto:user@hadoop.apache.org>>
Subject: Re: [hadoop] AvroMultipleOutputs org.apache.avro.file.DataFileWrit=
er$AppendWriteException

which avro version are you using when running outside of hadoop?

Regards,
Stanley Shi,
[http://www.gopivotal.com/files/media/logos/pivotal-logo-email-signature.pn=
g]


On Mon, Mar 3, 2014 at 11:49 PM, John Pauley <John.Pauley@threattrack.com<m=
ailto:John.Pauley@threattrack.com>> wrote:
This is cross posted to avro-user list (http://mail-archives.apache.org/mod=
_mbox/avro-user/201402.mbox/%3cCF3612F6.94D2%25john.pauley@threattrack.com%=
3e).

Hello all,

I=92m having an issue using AvroMultipleOutputs in a map/reduce job.  The i=
ssue occurs when using a schema that has a union of null and a fixed (among=
 other complex types), default to null, and it is not null.  Please find th=
e full stack trace below and a sample map/reduce job that generates an Avro=
 container file and uses that for the m/r input.  Note that I can serialize=
/deserialize without issue using GenericDatumWriter/GenericDatumReader outs=
ide of hadoop=85  Any insight would be helpful.

Stack trace:
java.lang.Exception: org.apache.avro.file.DataFileWriter$AppendWriteExcepti=
on: java.lang.NullPointerException: in com.foo.bar.simple_schema in union n=
ull of union in field baz of com.foo.bar.simple_schema
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:404)
Caused by: org.apache.avro.file.DataFileWriter$AppendWriteException: java.l=
ang.NullPointerException: in com.foo.bar.simple_schema in union null of uni=
on in field baz of com.foo.bar.simple_schema
at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:296)
at org.apache.avro.mapreduce.AvroKeyRecordWriter.write(AvroKeyRecordWriter.=
java:77)
at org.apache.avro.mapreduce.AvroKeyRecordWriter.write(AvroKeyRecordWriter.=
java:39)
at org.apache.avro.mapreduce.AvroMultipleOutputs.write(AvroMultipleOutputs.=
java:400)
at org.apache.avro.mapreduce.AvroMultipleOutputs.write(AvroMultipleOutputs.=
java:378)
at com.tts.ox.mapreduce.example.avro.AvroContainerFileDriver$SampleMapper.m=
ap(AvroContainerFileDriver.java:78)
at com.tts.ox.mapreduce.example.avro.AvroContainerFileDriver$SampleMapper.m=
ap(AvroContainerFileDriver.java:62)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJob=
Runner.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecuto=
r.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja=
va:918)
at java.lang.Thread.run(Thread.java:695)
Caused by: java.lang.NullPointerException: in com.foo.bar.simple_schema in =
union null of union in field baz of com.foo.bar.simple_schema
at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java=
:145)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java=
:58)
at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:290)
... 16 more
Caused by: java.lang.NullPointerException
at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:457)
at org.apache.avro.specific.SpecificData.getSchema(SpecificData.java:189)
at org.apache.avro.reflect.ReflectData.isRecord(ReflectData.java:167)
at org.apache.avro.generic.GenericData.getSchemaName(GenericData.java:608)
at org.apache.avro.specific.SpecificData.getSchemaName(SpecificData.java:26=
5)
at org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:597)
at org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWrit=
er.java:151)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java=
:71)
at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java=
:143)
at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter=
.java:114)
at org.apache.avro.reflect.ReflectDatumWriter.writeField(ReflectDatumWriter=
.java:175)
at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWrite=
r.java:104)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java=
:66)
at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java=
:143)

Sample m/r job:
<mr_job>
package com.tts.ox.mapreduce.example.avro;

import org.apache.avro.Schema;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.generic.GenericRecordBuilder;
import org.apache.avro.io.DatumWriter;
import org.apache.avro.mapred.AvroKey;
import org.apache.avro.mapreduce.AvroJob;
import org.apache.avro.mapreduce.AvroKeyInputFormat;
import org.apache.avro.mapreduce.AvroKeyOutputFormat;
import org.apache.avro.mapreduce.AvroMultipleOutputs;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.io.File;
import java.io.IOException;

public class AvroContainerFileDriver extends Configured implements Tool {
    //
    // define a schema with a union of null and fixed
    private static final String SCHEMA =3D "{\n" +
            "    \"namespace\": \"com.foo.bar\",\n" +
            "    \"name\": \"simple_schema\",\n" +
            "    \"type\": \"record\",\n" +
            "    \"fields\": [{\n" +
            "        \"name\": \"foo\",\n" +
            "        \"type\": {\n" +
            "            \"name\": \"bar\",\n" +
            "            \"type\": \"fixed\",\n" +
            "            \"size\": 2\n" +
            "        }\n" +
            "    }, {\n" +
            "        \"name\": \"baz\",\n" +
            "        \"type\": [\"null\", \"bar\"],\n" +
            "        \"default\": null\n" +
            "    }]\n" +
            "}";

    public static class SampleMapper extends Mapper<AvroKey<GenericRecord>,=
 NullWritable, NullWritable, NullWritable> {
        private AvroMultipleOutputs amos;

        @Override
        protected void setup(Context context) {
            amos =3D new AvroMultipleOutputs(context);
        }

        @Override
        protected void cleanup(Context context) throws IOException, Interru=
ptedException {
            amos.close();
        }

        @Override
        protected void map(AvroKey<GenericRecord> record, NullWritable igno=
re, Context context)
                throws IOException, InterruptedException {
    // simply write the record to a container using AvroMultipleOutputs
            amos.write("avro", new AvroKey<GenericRecord>(record.datum()), =
NullWritable.get());
        }
    }

    @Override
    public int run(final String[] args) throws Exception {
        Schema.Parser parser =3D new Schema.Parser();
        Schema schema =3D parser.parse(SCHEMA);

        //
        // generate avro container file for input to mapper
        byte[] dummy =3D {(byte) 0x01, (byte) 0x02};
        GenericData.Fixed foo =3D new GenericData.Fixed(schema.getField("fo=
o").schema(), dummy);
        GenericData.Fixed baz =3D new GenericData.Fixed(schema.getField("ba=
z").schema().getTypes().get(1), dummy);

        GenericRecordBuilder builder =3D new GenericRecordBuilder(schema)
                .set(schema.getField("foo"), foo);
        GenericRecord record0 =3D builder.build(); // baz is null

        builder.set(schema.getField("baz"), baz);
        GenericRecord record1 =3D builder.build(); // baz is not null, bad =
news

        File file =3D new File("/tmp/avrotest/input/test.avro");
        DatumWriter<GenericRecord> datumWriter =3D new GenericDatumWriter<G=
enericRecord>(schema);
        DataFileWriter<GenericRecord> dataFileWriter =3D new DataFileWriter=
<GenericRecord>(datumWriter);
        dataFileWriter.create(schema, file);
        dataFileWriter.append(record0);
        //
        // HELP: job succeeds when we do not have record with non-null baz,=
 comment out to succeed
        //
        dataFileWriter.append(record1);
        dataFileWriter.close();

        //
        // configure and run job
        Configuration configuration =3D new Configuration();
        String[] otherArgs =3D new GenericOptionsParser(configuration, args=
).getRemainingArgs();
        Job job =3D Job.getInstance(configuration, "Sample Avro Map Reduce"=
);

        job.setInputFormatClass(AvroKeyInputFormat.class);
        AvroJob.setInputKeySchema(job, schema);

        job.setMapperClass(SampleMapper.class);
        job.setNumReduceTasks(0);

        AvroJob.setOutputKeySchema(job, schema);
        AvroMultipleOutputs.addNamedOutput(job, "avro", AvroKeyOutputFormat=
.class, schema);

        FileInputFormat.addInputPath(job, new Path(("/tmp/avrotest/input"))=
);
        FileOutputFormat.setOutputPath(job, new Path("/tmp/avrotest/output"=
));

        return (job.waitForCompletion(true) ? 0 : 1);
    }

    public static void main(String[] args) throws Exception {
        int exitCode =3D ToolRunner.run(new AvroContainerFileDriver(), args=
);
        System.exit(exitCode);
    }
}
</mr_job>

Thanks,
John Pauley
Sr. Software Engineer
ThreatTrack Security


--_000_CF3B5A10950Djohnpauleythreattrackcom_
Content-Type: text/html; charset="Windows-1252"
Content-ID: <B098D7BD03D3F745BA8B1D306C2C2E5A@threattrack.com>
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3DWindows-1=
252">
</head>
<body style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-lin=
e-break: after-white-space; color: rgb(0, 0, 0); font-size: 14px; font-fami=
ly: Calibri, sans-serif;">
<div>
<div>Outside hadoop: avro-1.7.6</div>
<div>Inside hadoop: &nbsp;avro-mapred-1.7.6-hadoop2</div>
<div>
<div>
<div><font class=3D"Apple-style-span" color=3D"#558ed5">
<p class=3D"MsoNormal" style=3D"font-size: 15px;"><o:p></o:p></p>
<!--EndFragment--></font></div>
</div>
<div><font class=3D"Apple-style-span" face=3D"Calibri,sans-serif"><!--EndFr=
agment--></font>
<p></p>
</div>
</div>
</div>
<div><br>
</div>
<span id=3D"OLK_SRC_BODY_SECTION">
<div style=3D"font-family:Calibri; font-size:11pt; text-align:left; color:b=
lack; BORDER-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM:=
 0in; PADDING-LEFT: 0in; PADDING-RIGHT: 0in; BORDER-TOP: #b5c4df 1pt solid;=
 BORDER-RIGHT: medium none; PADDING-TOP: 3pt">
<span style=3D"font-weight:bold">From: </span>Stanley Shi &lt;<a href=3D"ma=
ilto:sshi@gopivotal.com">sshi@gopivotal.com</a>&gt;<br>
<span style=3D"font-weight:bold">Reply-To: </span>&quot;<a href=3D"mailto:u=
ser@hadoop.apache.org">user@hadoop.apache.org</a>&quot; &lt;<a href=3D"mail=
to:user@hadoop.apache.org">user@hadoop.apache.org</a>&gt;<br>
<span style=3D"font-weight:bold">Date: </span>Monday, March 3, 2014 at 8:30=
 PM<br>
<span style=3D"font-weight:bold">To: </span>&quot;<a href=3D"mailto:user@ha=
doop.apache.org">user@hadoop.apache.org</a>&quot; &lt;<a href=3D"mailto:use=
r@hadoop.apache.org">user@hadoop.apache.org</a>&gt;<br>
<span style=3D"font-weight:bold">Subject: </span>Re: [hadoop] AvroMultipleO=
utputs org.apache.avro.file.DataFileWriter$AppendWriteException<br>
</div>
<div><br>
</div>
<div>
<div>
<div dir=3D"ltr">which avro version are you using when running outside of h=
adoop?&nbsp;</div>
<div class=3D"gmail_extra"><br clear=3D"all">
<div>
<div dir=3D"ltr">
<div>Regards,</div>
<div><b>Stanley Shi,</b></div>
<img src=3D"http://www.gopivotal.com/files/media/logos/pivotal-logo-email-s=
ignature.png"><br>
</div>
</div>
<br>
<br>
<div class=3D"gmail_quote">On Mon, Mar 3, 2014 at 11:49 PM, John Pauley <sp=
an dir=3D"ltr">
&lt;<a href=3D"mailto:John.Pauley@threattrack.com" target=3D"_blank">John.P=
auley@threattrack.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div style=3D"font-size:14px;font-family:Calibri,sans-serif;word-wrap:break=
-word">
<div>
<div>
<div>This is cross posted to avro-user list (<a href=3D"http://mail-archive=
s.apache.org/mod_mbox/avro-user/201402.mbox/%3cCF3612F6.94D2%25john.pauley@=
threattrack.com%3e" target=3D"_blank">http://mail-archives.apache.org/mod_m=
box/avro-user/201402.mbox/%3cCF3612F6.94D2%25john.pauley@threattrack.com%3e=
</a>).</div>
</div>
</div>
<div><br>
</div>
<span>
<div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif;word-wrap:break=
-word">
<div>Hello all,</div>
<div><br>
</div>
<div>I=92m having an issue using&nbsp;AvroMultipleOutputs in a map/reduce j=
ob. &nbsp;The issue occurs when using a schema that has a union of null and=
 a fixed (among other complex types), default to null, and it is
<span style=3D"font-style:italic">not</span>&nbsp;null. &nbsp;Please find t=
he full stack trace below and a sample map/reduce job that generates an Avr=
o container file and uses that for the m/r input. &nbsp;Note that I can ser=
ialize/deserialize without issue using GenericDatumWriter/GenericDatumReade=
r
 outside of hadoop=85 &nbsp;Any insight would be helpful.</div>
<div><br>
</div>
<div>Stack trace:</div>
<div>
<div>java.lang.Exception: org.apache.avro.file.DataFileWriter$AppendWriteEx=
ception: java.lang.NullPointerException: in com.foo.bar.simple_schema in un=
ion null of union in field baz of com.foo.bar.simple_schema</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.hadoop.mapre=
d.LocalJobRunner$Job.run(LocalJobRunner.java:404)</div>
<div>Caused by: org.apache.avro.file.DataFileWriter$AppendWriteException: j=
ava.lang.NullPointerException: in com.foo.bar.simple_schema in union null o=
f union in field baz of com.foo.bar.simple_schema</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.avro.file.Da=
taFileWriter.append(DataFileWriter.java:296)</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.avro.mapredu=
ce.AvroKeyRecordWriter.write(AvroKeyRecordWriter.java:77)</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.avro.mapredu=
ce.AvroKeyRecordWriter.write(AvroKeyRecordWriter.java:39)</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.avro.mapredu=
ce.AvroMultipleOutputs.write(AvroMultipleOutputs.java:400)</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.avro.mapredu=
ce.AvroMultipleOutputs.write(AvroMultipleOutputs.java:378)</div>
<div><span style=3D"white-space:pre-wrap"></span>at com.tts.ox.mapreduce.ex=
ample.avro.AvroContainerFileDriver$SampleMapper.map(AvroContainerFileDriver=
.java:78)</div>
<div><span style=3D"white-space:pre-wrap"></span>at com.tts.ox.mapreduce.ex=
ample.avro.AvroContainerFileDriver$SampleMapper.map(AvroContainerFileDriver=
.java:62)</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.hadoop.mapre=
duce.Mapper.run(Mapper.java:140)</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.hadoop.mapre=
d.MapTask.runNewMapper(MapTask.java:672)</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.hadoop.mapre=
d.MapTask.run(MapTask.java:330)</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.hadoop.mapre=
d.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:266)</div>
<div><span style=3D"white-space:pre-wrap"></span>at java.util.concurrent.Ex=
ecutors$RunnableAdapter.call(Executors.java:439)</div>
<div><span style=3D"white-space:pre-wrap"></span>at java.util.concurrent.Fu=
tureTask$Sync.innerRun(FutureTask.java:303)</div>
<div><span style=3D"white-space:pre-wrap"></span>at java.util.concurrent.Fu=
tureTask.run(FutureTask.java:138)</div>
<div><span style=3D"white-space:pre-wrap"></span>at java.util.concurrent.Th=
readPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)</div>
<div><span style=3D"white-space:pre-wrap"></span>at java.util.concurrent.Th=
readPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)</div>
<div><span style=3D"white-space:pre-wrap"></span>at java.lang.Thread.run(Th=
read.java:695)</div>
<div>Caused by: java.lang.NullPointerException: in com.foo.bar.simple_schem=
a in union null of union in field baz of com.foo.bar.simple_schema</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.avro.reflect=
.ReflectDatumWriter.write(ReflectDatumWriter.java:145)</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.avro.generic=
.GenericDatumWriter.write(GenericDatumWriter.java:58)</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.avro.file.Da=
taFileWriter.append(DataFileWriter.java:290)</div>
<div><span style=3D"white-space:pre-wrap"></span>... 16 more</div>
<div>Caused by: java.lang.NullPointerException</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.avro.reflect=
.ReflectData.createSchema(ReflectData.java:457)</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.avro.specifi=
c.SpecificData.getSchema(SpecificData.java:189)</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.avro.reflect=
.ReflectData.isRecord(ReflectData.java:167)</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.avro.generic=
.GenericData.getSchemaName(GenericData.java:608)</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.avro.specifi=
c.SpecificData.getSchemaName(SpecificData.java:265)</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.avro.generic=
.GenericData.resolveUnion(GenericData.java:597)</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.avro.generic=
.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:151)</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.avro.generic=
.GenericDatumWriter.write(GenericDatumWriter.java:71)</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.avro.reflect=
.ReflectDatumWriter.write(ReflectDatumWriter.java:143)</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.avro.generic=
.GenericDatumWriter.writeField(GenericDatumWriter.java:114)</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.avro.reflect=
.ReflectDatumWriter.writeField(ReflectDatumWriter.java:175)</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.avro.generic=
.GenericDatumWriter.writeRecord(GenericDatumWriter.java:104)</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.avro.generic=
.GenericDatumWriter.write(GenericDatumWriter.java:66)</div>
<div><span style=3D"white-space:pre-wrap"></span>at org.apache.avro.reflect=
.ReflectDatumWriter.write(ReflectDatumWriter.java:143)</div>
</div>
<div><br>
</div>
<div>Sample m/r job:</div>
<div>&lt;mr_job&gt;</div>
<div>
<div>package com.tts.ox.mapreduce.example.avro;</div>
<div><br>
</div>
<div>import org.apache.avro.Schema;</div>
<div>import org.apache.avro.file.DataFileWriter;</div>
<div>import org.apache.avro.generic.GenericData;</div>
<div>import org.apache.avro.generic.GenericDatumWriter;</div>
<div>import org.apache.avro.generic.GenericRecord;</div>
<div>import org.apache.avro.generic.GenericRecordBuilder;</div>
<div>import org.apache.avro.io.DatumWriter;</div>
<div>import org.apache.avro.mapred.AvroKey;</div>
<div>import org.apache.avro.mapreduce.AvroJob;</div>
<div>import org.apache.avro.mapreduce.AvroKeyInputFormat;</div>
<div>import org.apache.avro.mapreduce.AvroKeyOutputFormat;</div>
<div>import org.apache.avro.mapreduce.AvroMultipleOutputs;</div>
<div>import org.apache.hadoop.conf.Configuration;</div>
<div>import org.apache.hadoop.conf.Configured;</div>
<div>import org.apache.hadoop.fs.Path;</div>
<div>import org.apache.hadoop.io.NullWritable;</div>
<div>import org.apache.hadoop.mapreduce.Job;</div>
<div>import org.apache.hadoop.mapreduce.Mapper;</div>
<div>import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;</div>
<div>import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;</div>
<div>import org.apache.hadoop.util.GenericOptionsParser;</div>
<div>import org.apache.hadoop.util.Tool;</div>
<div>import org.apache.hadoop.util.ToolRunner;</div>
<div><br>
</div>
<div>import java.io.File;</div>
<div>import java.io.IOException;</div>
<div><br>
</div>
<div>public class AvroContainerFileDriver extends Configured implements Too=
l {</div>
<div>&nbsp; &nbsp; //</div>
<div>&nbsp; &nbsp; // define a schema with a union of null and fixed</div>
<div>&nbsp; &nbsp; private static final String SCHEMA =3D &quot;{\n&quot; &=
#43;</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &quot; &nbsp; &nbsp;\&quot;n=
amespace\&quot;: \&quot;com.foo.bar\&quot;,\n&quot; &#43;</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &quot; &nbsp; &nbsp;\&quot;n=
ame\&quot;: \&quot;simple_schema\&quot;,\n&quot; &#43;</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &quot; &nbsp; &nbsp;\&quot;t=
ype\&quot;: \&quot;record\&quot;,\n&quot; &#43;</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &quot; &nbsp; &nbsp;\&quot;f=
ields\&quot;: [{\n&quot; &#43;</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &quot; &nbsp; &nbsp; &nbsp; =
&nbsp;\&quot;name\&quot;: \&quot;foo\&quot;,\n&quot; &#43;</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &quot; &nbsp; &nbsp; &nbsp; =
&nbsp;\&quot;type\&quot;: {\n&quot; &#43;</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &quot; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp;\&quot;name\&quot;: \&quot;bar\&quot;,\n&quot; &#43;</d=
iv>
<div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &quot; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp;\&quot;type\&quot;: \&quot;fixed\&quot;,\n&quot; &#43;<=
/div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &quot; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp;\&quot;size\&quot;: 2\n&quot; &#43;</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &quot; &nbsp; &nbsp; &nbsp; =
&nbsp;}\n&quot; &#43;</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &quot; &nbsp; &nbsp;}, {\n&q=
uot; &#43;</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &quot; &nbsp; &nbsp; &nbsp; =
&nbsp;\&quot;name\&quot;: \&quot;baz\&quot;,\n&quot; &#43;</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &quot; &nbsp; &nbsp; &nbsp; =
&nbsp;\&quot;type\&quot;: [\&quot;null\&quot;, \&quot;bar\&quot;],\n&quot; =
&#43;</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &quot; &nbsp; &nbsp; &nbsp; =
&nbsp;\&quot;default\&quot;: null\n&quot; &#43;</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &quot; &nbsp; &nbsp;}]\n&quo=
t; &#43;</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &quot;}&quot;;</div>
<div><br>
</div>
<div>&nbsp; &nbsp; public static class SampleMapper extends Mapper&lt;AvroK=
ey&lt;GenericRecord&gt;, NullWritable, NullWritable, NullWritable&gt; {</di=
v>
<div>&nbsp; &nbsp; &nbsp; &nbsp; private AvroMultipleOutputs amos;</div>
<div><br>
</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; @Override</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; protected void setup(Context context) {</d=
iv>
<div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; amos =3D new AvroMultipleOut=
puts(context);</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; }</div>
<div><br>
</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; @Override</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; protected void cleanup(Context context) th=
rows IOException, InterruptedException {</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; amos.close();</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; }</div>
<div><br>
</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; @Override</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; protected void map(AvroKey&lt;GenericRecor=
d&gt; record, NullWritable ignore, Context context)</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; throws IOExcep=
tion, InterruptedException {</div>
<div><span style=3D"white-space:pre-wrap"></span>&nbsp; &nbsp;&nbsp;// simp=
ly write the record to a container using AvroMultipleOutputs</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; amos.write(&quot;avro&quot;,=
 new AvroKey&lt;GenericRecord&gt;(record.datum()), NullWritable.get());</di=
v>
<div>&nbsp; &nbsp; &nbsp; &nbsp; }</div>
<div>&nbsp; &nbsp; }</div>
<div><br>
</div>
<div>&nbsp; &nbsp; @Override</div>
<div>&nbsp; &nbsp; public int run(final String[] args) throws Exception {</=
div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; Schema.Parser parser =3D new Schema.Parser=
();</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; Schema schema =3D parser.parse(SCHEMA);</d=
iv>
<div><br>
</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; //</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; // generate avro container file for input =
to mapper</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; byte[] dummy =3D {(byte) 0x01, (byte) 0x02=
};</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; GenericData.Fixed foo =3D new GenericData.=
Fixed(schema.getField(&quot;foo&quot;).schema(), dummy);</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; GenericData.Fixed baz =3D new GenericData.=
Fixed(schema.getField(&quot;baz&quot;).schema().getTypes().get(1), dummy);<=
/div>
<div><br>
</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; GenericRecordBuilder builder =3D new Gener=
icRecordBuilder(schema)</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .set(schema.ge=
tField(&quot;foo&quot;), foo);</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; GenericRecord record0 =3D builder.build();=
 // baz is null</div>
<div><br>
</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; builder.set(schema.getField(&quot;baz&quot=
;), baz);</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; GenericRecord record1 =3D builder.build();=
 // baz is not null, bad news</div>
<div><br>
</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; File file =3D new File(&quot;/tmp/avrotest=
/input/test.avro&quot;);</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; DatumWriter&lt;GenericRecord&gt; datumWrit=
er =3D new GenericDatumWriter&lt;GenericRecord&gt;(schema);</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; DataFileWriter&lt;GenericRecord&gt; dataFi=
leWriter =3D new DataFileWriter&lt;GenericRecord&gt;(datumWriter);</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; dataFileWriter.create(schema, file);</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; dataFileWriter.append(record0);</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; //</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; // HELP: job succeeds when we do not have =
record with non-null baz, comment out to succeed</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; //</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; dataFileWriter.append(record1);</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; dataFileWriter.close();</div>
<div><br>
</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; //</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; // configure and run job</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; Configuration configuration =3D new Config=
uration();</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; String[] otherArgs =3D new GenericOptionsP=
arser(configuration, args).getRemainingArgs();</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; Job job =3D Job.getInstance(configuration,=
 &quot;Sample Avro Map Reduce&quot;);</div>
<div><br>
</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; job.setInputFormatClass(AvroKeyInputFormat=
.class);</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; AvroJob.setInputKeySchema(job, schema);</d=
iv>
<div><br>
</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; job.setMapperClass(SampleMapper.class);</d=
iv>
<div>&nbsp; &nbsp; &nbsp; &nbsp; job.setNumReduceTasks(0);</div>
<div><br>
</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; AvroJob.setOutputKeySchema(job, schema);</=
div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; AvroMultipleOutputs.addNamedOutput(job, &q=
uot;avro&quot;, AvroKeyOutputFormat.class, schema);</div>
<div><br>
</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; FileInputFormat.addInputPath(job, new Path=
((&quot;/tmp/avrotest/input&quot;)));</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; FileOutputFormat.setOutputPath(job, new Pa=
th(&quot;/tmp/avrotest/output&quot;));</div>
<div><br>
</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; return (job.waitForCompletion(true) ? 0 : =
1);</div>
<div>&nbsp; &nbsp; }</div>
<div><br>
</div>
<div>&nbsp; &nbsp; public static void main(String[] args) throws Exception =
{</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; int exitCode =3D ToolRunner.run(new AvroCo=
ntainerFileDriver(), args);</div>
<div>&nbsp; &nbsp; &nbsp; &nbsp; System.exit(exitCode);</div>
<div>&nbsp; &nbsp; }</div>
<div>}</div>
</div>
<div>&lt;/mr_job&gt;</div>
<div>
<div>
<div>
<div><font><span style=3D"font-size:medium">
<div><br>
</div>
<div>Thanks,</div>
<div>John Pauley</div>
<div>Sr. Software Engineer</div>
<div>ThreatTrack Security</div>
</span></font></div>
</div>
<div><font color=3D"#558ed5">
<p class=3D"MsoNormal" style=3D"font-size:15px"><u></u><u></u></p>
</font></div>
</div>
<div><font face=3D"Calibri,sans-serif"></font>
<p></p>
</div>
</div>
</div>
</div>
</span></div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</span>
</body>
</html>

--_000_CF3B5A10950Djohnpauleythreattrackcom_--