Mailing-List: contact user-help@crunch.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@crunch.apache.org
Received-SPF: pass (athena.apache.org: domain of stoffe@gmail.com designates
 74.125.82.180 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CADSBNAiKQDHPKZBU9HQBTUODtC=Yg0SqX0edaVn+0ibh902aUQ@mail.gmail.com>
References: 
 <CADSBNAh-NxZXXxEvoHQWtyz29NFVXa5Jxc0Gch4RxnyQFDyp+w@mail.gmail.com>
	<CADH-_NLyUPZ3NS98zvWgN=rq-dfSx6ZEV-8sYbYqnq5AtsHsGw@mail.gmail.com>
	<CADSBNAiKQDHPKZBU9HQBTUODtC=Yg0SqX0edaVn+0ibh902aUQ@mail.gmail.com>
Date: Wed, 4 Jun 2014 16:38:01 +0200
Message-ID: 
 <CADSBNAhJ1Zm-r9C_2mgz5Fiu2FxRX4cMD-ufCD3it=W8cwqZng@mail.gmail.com>
Subject: Re: Avro+Parquet
From: =?UTF-8?Q?Kristoffer_Sj=C3=B6gren?= <stoffe@gmail.com>
To: user@crunch.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Sorry for the late reply.

It was my fault. A refactoring changed the java package name without
changing Avro schema namespace in the $SCHEMA field, which caused Avro
to fallback on generic records. Works find now!

On Tue, Jun 3, 2014 at 1:15 PM, Kristoffer Sj=C3=B6gren <stoffe@gmail.com> =
wrote:
> Thanks for the quick answer!  My initial test still fail, but I may
> have done something wrong here. I will do a more thorough test asap.
>
> On Mon, Jun 2, 2014 at 2:53 PM, Micah Whitacre <mkwhitacre@gmail.com> wro=
te:
>> I don't believe it is a known issue.  I modified an AvroParquetPipelineI=
T[1]
>> to verify the output to a target using a source..
>>
>>   @Test
>>   public void toAvroParquetFileTargetFromParquet() throws Exception {
>>     GenericRecord savedRecord =3D new GenericData.Record(Person.SCHEMA$)=
;
>>     savedRecord.put("name", "John Doe");
>>     savedRecord.put("age", 42);
>>     savedRecord.put("siblingnames", Lists.newArrayList("Jimmy", "Jane"))=
;
>>     populateGenericParquetFile(Lists.newArrayList(savedRecord),
>> Person.SCHEMA$);
>>
>>     Pipeline pipeline =3D new MRPipeline(AvroParquetPipelineIT.class,
>> tmpDir.getDefaultConfiguration());
>>     PCollection<Person> genericCollection =3D pipeline.read(
>>         new AvroParquetFileSource<Person>(new
>> Path(avroFile.getAbsolutePath()), Avros.records(Person.class)));
>>     File outputFile =3D tmpDir.getFile("output");
>>     Target parquetFileTarget =3D new
>> AvroParquetFileTarget(outputFile.getAbsolutePath());
>>     pipeline.write(genericCollection, parquetFileTarget);
>>     pipeline.run();
>>
>>     Person person =3D genericCollection.materialize().iterator().next();
>>
>>     PCollection<Person> persistedCollection =3D pipeline.read(
>>         new AvroParquetFileSource<Person>(new
>> Path(outputFile.getAbsolutePath()), Avros.records(Person.class)));
>>     Person persistedPerson =3D
>> persistedCollection.materialize().iterator().next();
>>
>>     Path parquetFile =3D new Path(new File(outputFile,
>> "part-m-00000.parquet").getPath());
>>
>>     AvroParquetReader<Person> reader =3D new
>> AvroParquetReader<Person>(parquetFile);
>>
>>     try {
>>       Person readPerson =3D reader.read();
>>       assertThat(readPerson, is(person));
>>       assertThat(readPerson, is(persistedPerson));
>>     } finally {
>>       reader.close();
>>     }
>>   }
>>
>> The tests passes without any issues.  There have been an number of fixes
>> since the 0.8.0-cdh4.3.0 version.  You might try upgrading to the latest
>> version available (0.8.2+71-cdh4.6.0) and see if the problem still exist=
s.
>> If it does still exist a junit/integration test would be helpful to debu=
g
>> this issue.
>>
>>
>> [1] -
>> https://github.com/apache/crunch/blob/1d9b6cf3db6daa1ee6e0fa48dfd5966e82=
1c71a3/crunch-core/src/it/java/org/apache/crunch/io/parquet/AvroParquetPipe=
lineIT.java#L120
>>
>>
>> On Mon, Jun 2, 2014 at 6:53 AM, Kristoffer Sj=C3=B6gren <stoffe@gmail.co=
m> wrote:
>>>
>>> Hi
>>>
>>> Im trying to read and write data using the avro+parquet combo that
>>> ships with crunch 0.8.0-cdh4.3.0.
>>>
>>> - The writer job looks like this.
>>>
>>> PCollection<String> lines =3D ...
>>> PCollection<User> p =3D lines.parallelDo(new DoFn<String, User>() {
>>>   @Override
>>>   public void process(String input, Emitter<User> emitter) {
>>>     User user =3D User.newBuilder().setName(input).build();
>>>     emitter.emit(user);
>>>   }
>>> }, Avros.records(User.class));
>>>
>>> AvroParquetFileSourceTarget fout =3D new
>>> AvroParquetFileSourceTarget<User>(out, Avros.records(User.class));
>>> pipeline.write(p, fout);
>>>
>>> - The reader job looks like this.
>>>
>>> AvroParquetFileSource<User> file =3D new
>>> AvroParquetFileSource<User>(out, Avros.records(User.class));
>>> PCollection<User> users =3D pipeline.read(file);
>>> // this line fails with a ClassCastException
>>> PCollection<String> lines =3D users.parallelDo(new DoFn<User, String>()=
 {
>>>  @Override
>>>  public void process(User user, Emitter<String> emitter) {
>>>     emitter.emit(user.getName().toString());
>>>   }
>>> }, Writables.strings());
>>>
>>>
>>> However, the reader fails with a java.lang.ClassCastException? Is this
>>> a know issue or am I doing something wrong?
>>>
>>> Cheers,
>>> -Kristoffer
>>>
>>>
>>> java.lang.ClassCastException:
>>> org.apache.avro.generic.GenericData$Record cannot be cast to
>>> mapred.jobs.User
>>> at mapred.jobs.ParquetReaderJob$1.process(ParquetReaderJob.java:22)
>>> at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:99)
>>> at
>>> org.apache.crunch.impl.mr.emit.IntermediateEmitter.emit(IntermediateEmi=
tter.java:56)
>>> at org.apache.crunch.MapFn.process(MapFn.java:34)
>>> at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:99)
>>> at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:110)
>>> at org.apache.crunch.impl.mr.run.CrunchMapper.map(CrunchMapper.java:60)
>>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
>>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
>>> at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at javax.security.auth.Subject.doAs(Subject.java:396)
>>
>>