hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Marron <Peter.Mar...@trilliumsoftware.com>
Subject RE: Problem with streaming exact binary chunks
Date Wed, 09 Oct 2013 12:08:15 GMT
Hi,

The only way that I could find was to override the various InputWriter and OutputWriter classes.
as defined by the configuration settings
stream.map.input.writer.class
stream.map.output.reader.class
stream.reduce.input.writer.class
stream.reduce. output.reader.class
which was painful. Hopefully someone will tell you the _correct_ way to do this.
If not I will provide more details.

Regards,

Peter Marron
Trillium Software UK Limited

Tel : +44 (0) 118 940 7609
Fax : +44 (0) 118 940 7699
E: Peter.Marron@TrilliumSoftware.com

-----Original Message-----
From: Youssef Hatem [mailto:youssef.hatem@rwth-aachen.de] 
Sent: 09 October 2013 12:14
To: user@hadoop.apache.org
Subject: Problem with streaming exact binary chunks

Hello,

I wrote a very simple InputFormat and RecordReader to send binary data to mappers. Binary
data can contain anything (including \n, \t, \r), here is what next() may actually send:

public class MyRecordReader implements
        RecordReader<BytesWritable, BytesWritable> {
    ...
    public boolean next(BytesWritable key, BytesWritable ignore)
            throws IOException {
        ...

        byte[] result = new byte[8];
        for (int i = 0; i < result.length; ++i)
            result[i] = (byte)(i+1);
        result[3] = (byte)'\n';
        result[4] = (byte)'\n';

        key.set(result, 0, result.length);
        return true;
    }
}

As you can see I am using BytesWritable to send eight bytes: 01 02 03 0a 0a 06 07 08, I also
use Hadoop-1722 typed bytes (by setting -D stream.map.input=typedbytes).

According to the documentation of typed bytes the mapper should receive the following byte
sequence: 
00 00 00 08 01 02 03 0a 0a 06 07 08

However bytes are somehow modified and I get the following sequence instead:
00 00 00 08 01 02 03 09 0a 09 0a 06 07 08

0a = '\n'
09 = '\t'

It seems that Hadoop (streaming?) parsed the new line character as a separator and put '\t'
which is the key/value separator for streaming I assume.

Is there any work around to send *exactly* the same bytes sequence no matter what characters
are in the sequence? Thanks in advance.

Best regards,
Youssef Hatem

Mime
View raw message