hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vijay <tec...@gmail.com>
Subject Re: custom writable classes
Date Wed, 02 Feb 2011 18:11:44 GMT
Hadoop is not going to parse the line for you. Your mapper will take the
line, parse it and then turn it into your Writable so the next phase can
just work with your object.

Thanks,
Vijay
On Feb 2, 2011 9:51 AM, "Adeel Qureshi" <adeelmahmood@gmail.com> wrote:
> thanks for your reply .. so lets say my input files are formatted like
this
>
> each line looks like this
> DATE TIME SERVER USER URL QUERY PORT ...
>
> so to read this I would create a writable mapper
>
> public class MyMapper implements Writable {
> Date date
> long time
> String server
> String user
> String url
> String query
> int port
>
> readFields(){
> date = readDate(in); //not concerned with the actual date reading function
> time = readLong(in);
> server = readText(in);
> .....
> }
> }
>
> but I still dont understand how is hadoop gonna know to parse my line into
> these tokens .. instead of map be using the whole line as one token
>
>
> On Wed, Feb 2, 2011 at 11:42 AM, Harsh J <qwertymaniac@gmail.com> wrote:
>
>> See it this way:
>>
>> readFields(...) provides a DataInput stream that reads bytes from a
>> binary stream, and write(...) provides a DataOutput stream that writes
>> bytes to a binary stream.
>>
>> Now your data-structure may be a complex one, perhaps an array of
>> items or a mapping of some, or just a set of different types of
>> objects. All you need to do is to think about how would you
>> _serialize_ your data structure into a binary stream, so that you may
>> _de-serialize_ it back from the same stream when required.
>>
>> About what goes where, I think looking up the definition of
>> 'serialization' will help. It is all in the ordering. If you wrote A
>> before B, you read A before B - simple as that.
>>
>> This, or you could use a neat serialization library like Apache Avro
>> (http://avro.apache.org) and solve it in a simpler way with a schema.
>> I'd recommend learning/using Avro for all
>> serialization/de-serialization needs. Especially for Hadoop use-cases.
>>
>> On Wed, Feb 2, 2011 at 10:51 PM, Adeel Qureshi <adeelmahmood@gmail.com>
>> wrote:
>> > I have been trying to understand how to write a simple custom writable
>> class
>> > and I find the documentation available very vague and unclear about
>> certain
>> > things. okay so here is the sample writable implementation in javadoc
of
>> > Writable interface
>> >
>> > public class MyWritable implements Writable {
>> > // Some data
>> > private int counter;
>> > private long timestamp;
>> >
>> > *public void write(DataOutput out) throws IOException {
>> > out.writeInt(counter);
>> > out.writeLong(timestamp);
>> > }*
>> >
>> > * public void readFields(DataInput in) throws IOException {
>> > counter = in.readInt();
>> > timestamp = in.readLong();
>> > }*
>> >
>> > public static MyWritable read(DataInput in) throws IOException {
>> > MyWritable w = new MyWritable();
>> > w.readFields(in);
>> > return w;
>> > }
>> > }
>> >
>> > so in readFields function we are simply saying read an int from the
>> > datainput and put that in counter .. and then read a long and put that
in
>> > timestamp variable .. what doesnt makes sense to me is what is the
format
>> of
>> > DataInput here .. what if there are multiple ints and multiple longs ..
>> how
>> > is the correct int gonna go in counter .. what if the data I am reading
>> in
>> > my mapper is a string line .. and I am using regular expression to
parse
>> the
>> > tokens .. how do I specify which field goes where .. simply saying
>> readInt
>> > or readText .. how does that gets connected to the right stuff ..
>> >
>> > so in my case like I said I am reading from iis log files where my
mapper
>> > input is a log line which contains usual log information like data,
time,
>> > user, server, url, qry, responseTme etc .. I want to parse these into
an
>> > object that can be passed to reducer instead of dumping all that
>> information
>> > as text ..
>> >
>> > I would appreciate any help.
>> > Thanks
>> > Adeel
>> >
>>
>>
>>
>> --
>> Harsh J
>> www.harshj.com
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message