hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Sinclair <dsincl...@chariotsolutions.com>
Subject Re: custom writable classes
Date Wed, 02 Feb 2011 20:53:13 GMT
You can easily make a custom Writable delegating to the existing writables.
For example, if your writable is just a bunch of strings, use the existing
Text writables in your class and use them in your read/write methods. For
example

class MyWritable implements Writable {
   private Text fieldA;
   private Text fieldB;

   ....

   public void write(DataOutput dataOutput) throws IOException {
         fieldA.write(dataOutput);
         fieldB.write(dataOutput);
   }

   public void readFields(DataInput dataInput) throws IOException {
         fieldA.readFields(dataInput);
         fieldB.readFields(dataInput);
   }
}

dave

On Wed, Feb 2, 2011 at 3:34 PM, Adeel Qureshi <adeelmahmood@gmail.com>wrote:

> huh this interesting .. obviously I am not thinking about this whole thing
> right ..
>
> so in your mapper you parse the line into tokens and set the appropriate
> values on your writable by constructor or setters .. and let hadoop do all
> the serialization and deserialization .. and you tell hadoop how to do that
> by the read and write methods .. okay that makes more sense .. one last
> thing i still dont understand is what is the proper implementation of read
> and write methods .. if i have a bunch of strings in my writable then what
> should be the read method implementation ..
>
> I really appreciate the help from all you guys ..
>
> On Wed, Feb 2, 2011 at 12:52 PM, David Sinclair <
> dsinclair@chariotsolutions.com> wrote:
>
> > So create your writable as normal, and hadoop takes care of the
> > serialization/deserialization between mappers and reducers.
> >
> > For example, MyWritable is the same as you had previously, then in your
> > mapper output that writable
> >
> > class MyMapper extends Mapper<LongWritable, Text, LongWritable,
> MyWritable>
> > {
> >
> >    private MyWritable writable =new MyWritable();
> >
> >    protected void map(LongWritable key, Text value, Context context)
> throws
> > IOException, InterruptedException {
> >        // parse text
> >        writable.setCounter(parseddata);
> >        writable.setTimestamp(parseddata);
> >
> >        // don't know what your key is
> >        context.write(key, writable);
> >    }
> > }
> >
> > and make sure you set the key/value output
> >
> > job.setMapOutputKeyClass(LongWritable .class);
> > job.setMapOutputValueClass(MyWritable.class);
> >
> > dave
> >
> >
> > On Wed, Feb 2, 2011 at 1:39 PM, Adeel Qureshi <adeelmahmood@gmail.com
> > >wrote:
> >
> > > i m reading text data and outputting text data so yeah its all text ..
> > the
> > > reason why i wanted to use custom writable classes is not for the
> mapper
> > > purposes .. you are right .. the easiest thing for is to receive the
> > > LongWritable and Text input in the mapper ... parse the text .. and
> deal
> > > with it .. but where I am having trouble is in passing the parsed
> > > information to the reducer .. right now I am putting a bunch of things
> as
> > > text and sending the same LongWritable and Text output to reducer but
> my
> > > text includes a bunch of things e.g. several fields separated by a
> > > delimiter
> > > .. this is the part that I am trying to improve .. instead of sending a
> > > bunch of delimited text I want to send an actual object to my reducer
> > >
> > > On Wed, Feb 2, 2011 at 12:33 PM, David Sinclair <
> > > dsinclair@chariotsolutions.com> wrote:
> > >
> > > > Are you storing your data as text or binary?
> > > >
> > > > If you are storing as text, your mapper is going to get Keys of
> > > > type LongWritable and values of type Text. Inside your mapper you
> would
> > > > parse out the strings and wouldn't be using your custom writable;
> that
> > is
> > > > unless you wanted your mapper/reducer to produce these.
> > > >
> > > > If you are storing as Binary, e.g. SequenceFiles, you use
> > > > the SequenceFileInputFormat and the sequence file reader will create
> > the
> > > > writables according to the mapper.
> > > >
> > > > dave
> > > >
> > > > On Wed, Feb 2, 2011 at 1:16 PM, Adeel Qureshi <
> adeelmahmood@gmail.com
> > > > >wrote:
> > > >
> > > > > okay so then the main question is how do I get the input line ..
so
> > > that
> > > > I
> > > > > could parse it .. I am assuming it will then be passed to me in via
> > > data
> > > > > input stream ..
> > > > >
> > > > > So in my readFields function .. I am assuming I will get the whole
> > line
> > > > ..
> > > > > then I can parse it out and set my params .. something like this
> > > > >
> > > > > readFields(){
> > > > >  String line = in.readLine(); read the whole line
> > > > >
> > > > >  //now apply the regular expression to parse it out
> > > > >  data = pattern.group(1);
> > > > >  time = pattern.group(2);
> > > > >  user = pattern.group(3);
> > > > > }
> > > > >
> > > > > Is that right ???
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Feb 2, 2011 at 12:11 PM, Vijay <techvd@gmail.com> wrote:
> > > > >
> > > > > > Hadoop is not going to parse the line for you. Your mapper will
> > take
> > > > the
> > > > > > line, parse it and then turn it into your Writable so the next
> > phase
> > > > can
> > > > > > just work with your object.
> > > > > >
> > > > > > Thanks,
> > > > > > Vijay
> > > > > > On Feb 2, 2011 9:51 AM, "Adeel Qureshi" <adeelmahmood@gmail.com>
> > > > wrote:
> > > > > > > thanks for your reply .. so lets say my input files are
> formatted
> > > > like
> > > > > > this
> > > > > > >
> > > > > > > each line looks like this
> > > > > > > DATE TIME SERVER USER URL QUERY PORT ...
> > > > > > >
> > > > > > > so to read this I would create a writable mapper
> > > > > > >
> > > > > > > public class MyMapper implements Writable {
> > > > > > > Date date
> > > > > > > long time
> > > > > > > String server
> > > > > > > String user
> > > > > > > String url
> > > > > > > String query
> > > > > > > int port
> > > > > > >
> > > > > > > readFields(){
> > > > > > > date = readDate(in); //not concerned with the actual date
> reading
> > > > > > function
> > > > > > > time = readLong(in);
> > > > > > > server = readText(in);
> > > > > > > .....
> > > > > > > }
> > > > > > > }
> > > > > > >
> > > > > > > but I still dont understand how is hadoop gonna know to
parse
> my
> > > line
> > > > > > into
> > > > > > > these tokens .. instead of map be using the whole line
as one
> > token
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Feb 2, 2011 at 11:42 AM, Harsh J <
> qwertymaniac@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > >> See it this way:
> > > > > > >>
> > > > > > >> readFields(...) provides a DataInput stream that reads
bytes
> > from
> > > a
> > > > > > >> binary stream, and write(...) provides a DataOutput
stream
> that
> > > > writes
> > > > > > >> bytes to a binary stream.
> > > > > > >>
> > > > > > >> Now your data-structure may be a complex one, perhaps
an array
> > of
> > > > > > >> items or a mapping of some, or just a set of different
types
> of
> > > > > > >> objects. All you need to do is to think about how would
you
> > > > > > >> _serialize_ your data structure into a binary stream,
so that
> > you
> > > > may
> > > > > > >> _de-serialize_ it back from the same stream when required.
> > > > > > >>
> > > > > > >> About what goes where, I think looking up the definition
of
> > > > > > >> 'serialization' will help. It is all in the ordering.
If you
> > wrote
> > > A
> > > > > > >> before B, you read A before B - simple as that.
> > > > > > >>
> > > > > > >> This, or you could use a neat serialization library
like
> Apache
> > > Avro
> > > > > > >> (http://avro.apache.org) and solve it in a simpler
way with a
> > > > schema.
> > > > > > >> I'd recommend learning/using Avro for all
> > > > > > >> serialization/de-serialization needs. Especially for
Hadoop
> > > > use-cases.
> > > > > > >>
> > > > > > >> On Wed, Feb 2, 2011 at 10:51 PM, Adeel Qureshi <
> > > > > adeelmahmood@gmail.com>
> > > > > > >> wrote:
> > > > > > >> > I have been trying to understand how to write
a simple
> custom
> > > > > writable
> > > > > > >> class
> > > > > > >> > and I find the documentation available very vague
and
> unclear
> > > > about
> > > > > > >> certain
> > > > > > >> > things. okay so here is the sample writable implementation
> in
> > > > > javadoc
> > > > > > of
> > > > > > >> > Writable interface
> > > > > > >> >
> > > > > > >> > public class MyWritable implements Writable {
> > > > > > >> > // Some data
> > > > > > >> > private int counter;
> > > > > > >> > private long timestamp;
> > > > > > >> >
> > > > > > >> > *public void write(DataOutput out) throws IOException
{
> > > > > > >> > out.writeInt(counter);
> > > > > > >> > out.writeLong(timestamp);
> > > > > > >> > }*
> > > > > > >> >
> > > > > > >> > * public void readFields(DataInput in) throws
IOException {
> > > > > > >> > counter = in.readInt();
> > > > > > >> > timestamp = in.readLong();
> > > > > > >> > }*
> > > > > > >> >
> > > > > > >> > public static MyWritable read(DataInput in) throws
> IOException
> > {
> > > > > > >> > MyWritable w = new MyWritable();
> > > > > > >> > w.readFields(in);
> > > > > > >> > return w;
> > > > > > >> > }
> > > > > > >> > }
> > > > > > >> >
> > > > > > >> > so in readFields function we are simply saying
read an int
> > from
> > > > the
> > > > > > >> > datainput and put that in counter .. and then
read a long
> and
> > > put
> > > > > that
> > > > > > in
> > > > > > >> > timestamp variable .. what doesnt makes sense
to me is what
> is
> > > the
> > > > > > format
> > > > > > >> of
> > > > > > >> > DataInput here .. what if there are multiple ints
and
> multiple
> > > > longs
> > > > > > ..
> > > > > > >> how
> > > > > > >> > is the correct int gonna go in counter .. what
if the data I
> > am
> > > > > > reading
> > > > > > >> in
> > > > > > >> > my mapper is a string line .. and I am using regular
> > expression
> > > to
> > > > > > parse
> > > > > > >> the
> > > > > > >> > tokens .. how do I specify which field goes where
.. simply
> > > saying
> > > > > > >> readInt
> > > > > > >> > or readText .. how does that gets connected to
the right
> stuff
> > > ..
> > > > > > >> >
> > > > > > >> > so in my case like I said I am reading from iis
log files
> > where
> > > my
> > > > > > mapper
> > > > > > >> > input is a log line which contains usual log information
> like
> > > > data,
> > > > > > time,
> > > > > > >> > user, server, url, qry, responseTme etc .. I want
to parse
> > these
> > > > > into
> > > > > > an
> > > > > > >> > object that can be passed to reducer instead of
dumping all
> > that
> > > > > > >> information
> > > > > > >> > as text ..
> > > > > > >> >
> > > > > > >> > I would appreciate any help.
> > > > > > >> > Thanks
> > > > > > >> > Adeel
> > > > > > >> >
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> Harsh J
> > > > > > >> www.harshj.com
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message