hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adeel Qureshi <adeelmahm...@gmail.com>
Subject Re: custom writable classes
Date Wed, 02 Feb 2011 18:16:55 GMT
okay so then the main question is how do I get the input line .. so that I
could parse it .. I am assuming it will then be passed to me in via data
input stream ..

So in my readFields function .. I am assuming I will get the whole line ..
then I can parse it out and set my params .. something like this

readFields(){
 String line = in.readLine(); read the whole line

 //now apply the regular expression to parse it out
 data = pattern.group(1);
 time = pattern.group(2);
 user = pattern.group(3);
}

Is that right ???



On Wed, Feb 2, 2011 at 12:11 PM, Vijay <techvd@gmail.com> wrote:

> Hadoop is not going to parse the line for you. Your mapper will take the
> line, parse it and then turn it into your Writable so the next phase can
> just work with your object.
>
> Thanks,
> Vijay
> On Feb 2, 2011 9:51 AM, "Adeel Qureshi" <adeelmahmood@gmail.com> wrote:
> > thanks for your reply .. so lets say my input files are formatted like
> this
> >
> > each line looks like this
> > DATE TIME SERVER USER URL QUERY PORT ...
> >
> > so to read this I would create a writable mapper
> >
> > public class MyMapper implements Writable {
> > Date date
> > long time
> > String server
> > String user
> > String url
> > String query
> > int port
> >
> > readFields(){
> > date = readDate(in); //not concerned with the actual date reading
> function
> > time = readLong(in);
> > server = readText(in);
> > .....
> > }
> > }
> >
> > but I still dont understand how is hadoop gonna know to parse my line
> into
> > these tokens .. instead of map be using the whole line as one token
> >
> >
> > On Wed, Feb 2, 2011 at 11:42 AM, Harsh J <qwertymaniac@gmail.com> wrote:
> >
> >> See it this way:
> >>
> >> readFields(...) provides a DataInput stream that reads bytes from a
> >> binary stream, and write(...) provides a DataOutput stream that writes
> >> bytes to a binary stream.
> >>
> >> Now your data-structure may be a complex one, perhaps an array of
> >> items or a mapping of some, or just a set of different types of
> >> objects. All you need to do is to think about how would you
> >> _serialize_ your data structure into a binary stream, so that you may
> >> _de-serialize_ it back from the same stream when required.
> >>
> >> About what goes where, I think looking up the definition of
> >> 'serialization' will help. It is all in the ordering. If you wrote A
> >> before B, you read A before B - simple as that.
> >>
> >> This, or you could use a neat serialization library like Apache Avro
> >> (http://avro.apache.org) and solve it in a simpler way with a schema.
> >> I'd recommend learning/using Avro for all
> >> serialization/de-serialization needs. Especially for Hadoop use-cases.
> >>
> >> On Wed, Feb 2, 2011 at 10:51 PM, Adeel Qureshi <adeelmahmood@gmail.com>
> >> wrote:
> >> > I have been trying to understand how to write a simple custom writable
> >> class
> >> > and I find the documentation available very vague and unclear about
> >> certain
> >> > things. okay so here is the sample writable implementation in javadoc
> of
> >> > Writable interface
> >> >
> >> > public class MyWritable implements Writable {
> >> > // Some data
> >> > private int counter;
> >> > private long timestamp;
> >> >
> >> > *public void write(DataOutput out) throws IOException {
> >> > out.writeInt(counter);
> >> > out.writeLong(timestamp);
> >> > }*
> >> >
> >> > * public void readFields(DataInput in) throws IOException {
> >> > counter = in.readInt();
> >> > timestamp = in.readLong();
> >> > }*
> >> >
> >> > public static MyWritable read(DataInput in) throws IOException {
> >> > MyWritable w = new MyWritable();
> >> > w.readFields(in);
> >> > return w;
> >> > }
> >> > }
> >> >
> >> > so in readFields function we are simply saying read an int from the
> >> > datainput and put that in counter .. and then read a long and put that
> in
> >> > timestamp variable .. what doesnt makes sense to me is what is the
> format
> >> of
> >> > DataInput here .. what if there are multiple ints and multiple longs
> ..
> >> how
> >> > is the correct int gonna go in counter .. what if the data I am
> reading
> >> in
> >> > my mapper is a string line .. and I am using regular expression to
> parse
> >> the
> >> > tokens .. how do I specify which field goes where .. simply saying
> >> readInt
> >> > or readText .. how does that gets connected to the right stuff ..
> >> >
> >> > so in my case like I said I am reading from iis log files where my
> mapper
> >> > input is a log line which contains usual log information like data,
> time,
> >> > user, server, url, qry, responseTme etc .. I want to parse these into
> an
> >> > object that can be passed to reducer instead of dumping all that
> >> information
> >> > as text ..
> >> >
> >> > I would appreciate any help.
> >> > Thanks
> >> > Adeel
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >> www.harshj.com
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message