hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Youssef Hatem <youssef.ha...@rwth-aachen.de>
Subject Re: Problem with streaming exact binary chunks
Date Thu, 10 Oct 2013 12:24:24 GMT
Hello,

Thanks a lot for the information. It helped me figure out the solution of this problem.

I posted the sketch of solution on StackOverflow (http://stackoverflow.com/a/19295610/337194)
for anybody who is interested.

Best regards,
Youssef Hatem

On Oct 9, 2013, at 14:08 , Peter Marron wrote:

> Hi,
> 
> The only way that I could find was to override the various InputWriter and OutputWriter
classes.
> as defined by the configuration settings
> stream.map.input.writer.class
> stream.map.output.reader.class
> stream.reduce.input.writer.class
> stream.reduce. output.reader.class
> which was painful. Hopefully someone will tell you the _correct_ way to do this.
> If not I will provide more details.
> 
> Regards,
> 
> Peter Marron
> Trillium Software UK Limited
> 
> Tel : +44 (0) 118 940 7609
> Fax : +44 (0) 118 940 7699
> E: Peter.Marron@TrilliumSoftware.com
> 
> -----Original Message-----
> From: Youssef Hatem [mailto:youssef.hatem@rwth-aachen.de] 
> Sent: 09 October 2013 12:14
> To: user@hadoop.apache.org
> Subject: Problem with streaming exact binary chunks
> 
> Hello,
> 
> I wrote a very simple InputFormat and RecordReader to send binary data to mappers. Binary
data can contain anything (including \n, \t, \r), here is what next() may actually send:
> 
> public class MyRecordReader implements
>        RecordReader<BytesWritable, BytesWritable> {
>    ...
>    public boolean next(BytesWritable key, BytesWritable ignore)
>            throws IOException {
>        ...
> 
>        byte[] result = new byte[8];
>        for (int i = 0; i < result.length; ++i)
>            result[i] = (byte)(i+1);
>        result[3] = (byte)'\n';
>        result[4] = (byte)'\n';
> 
>        key.set(result, 0, result.length);
>        return true;
>    }
> }
> 
> As you can see I am using BytesWritable to send eight bytes: 01 02 03 0a 0a 06 07 08,
I also use Hadoop-1722 typed bytes (by setting -D stream.map.input=typedbytes).
> 
> According to the documentation of typed bytes the mapper should receive the following
byte sequence: 
> 00 00 00 08 01 02 03 0a 0a 06 07 08
> 
> However bytes are somehow modified and I get the following sequence instead:
> 00 00 00 08 01 02 03 09 0a 09 0a 06 07 08
> 
> 0a = '\n'
> 09 = '\t'
> 
> It seems that Hadoop (streaming?) parsed the new line character as a separator and put
'\t' which is the key/value separator for streaming I assume.
> 
> Is there any work around to send *exactly* the same bytes sequence no matter what characters
are in the sequence? Thanks in advance.
> 
> Best regards,
> Youssef Hatem


Mime
View raw message