avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shirahatti, Nikhil" <snik...@telenav.com>
Subject Re: Avro Map Reduce Question: GenericRecord, renaming reduce output
Date Fri, 08 Jun 2012 21:04:13 GMT
The magic number check is failing: so the top of the file has some junk in
it?

if (!Arrays.equals(DataFileConstants.MAGIC, magic))
      throw new IOException("Not a data file.");



I checked the (verified by read operation) input file: which has the same
schema:
This starts with the Obj^A^B^Vavro.schema<E0>^D

Whereas the reduce output file: has the 0<tab> before the
Obj^A^B^Vavro.schema<E0>^D
0       Obj^A^B^Vavro.schema<E0>^D


This was what I did not expect. Maybe my previous email was unclear.

Thanks,
Nikhil

On 6/8/12 1:35 PM, "Shirahatti, Nikhil" <snikhil@telenav.com> wrote:

>The reason is: when I try to read the file using GenericReader.. I get the
>error: not a data file.
>
>
>Code snippet:
>--------------
>DatumReader<GenericData.Record> reader = new
>GenericDatumReader<Record>(AVRO_SCHEMA);
>
>String MUXDEMUX_FILE = outpath.concat("part-r-00000");
>		InputStream in = new BufferedInputStream(new
>FileInputStream(MUXDEMUX_FILE));
>		DataFileStream<GenericData.Record> records = new
>DataFileStream<GenericData.Record>(in,
>				reader);
>		for (GenericData.Record r : records)
>		{
>			System.out.println(r.toString());
>		}
>
>
>
>Nikhil
>
>On 6/8/12 12:17 PM, "Doug Cutting" <cutting@apache.org> wrote:
>
>>On Fri, Jun 8, 2012 at 11:49 AM, snikhil0 <snikhil@telenav.com> wrote:
>>> My expectation is that I can use the same input schema to read the
>>>output
>>> file. But alas this is not working.
>>> In the part-r-00000 I have a 0<tab>Obj<Avroschema>....datums......
Why
>>>is
>>> this?
>>
>>That looks approximately like an Avro data file.  How is it not what you
>>expect?
>>
>>> Also how can rename the reduce output file to something other than
>>> part-r-0000*?
>>
>>That's the standard name for Hadoop mapreduce output files.  You could
>>override it in the OutputFormat, but most folks do not.  The name of
>>the directory these are in is normally used to identify the result
>>set.  The files within the directory are just fragments of that result
>>set.
>>
>>Doug
>


Mime
View raw message