hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adeel Qureshi <adeelmahm...@gmail.com>
Subject Re: custom writable classes
Date Wed, 02 Feb 2011 18:39:06 GMT
i m reading text data and outputting text data so yeah its all text .. the
reason why i wanted to use custom writable classes is not for the mapper
purposes .. you are right .. the easiest thing for is to receive the
LongWritable and Text input in the mapper ... parse the text .. and deal
with it .. but where I am having trouble is in passing the parsed
information to the reducer .. right now I am putting a bunch of things as
text and sending the same LongWritable and Text output to reducer but my
text includes a bunch of things e.g. several fields separated by a delimiter
.. this is the part that I am trying to improve .. instead of sending a
bunch of delimited text I want to send an actual object to my reducer

On Wed, Feb 2, 2011 at 12:33 PM, David Sinclair <
dsinclair@chariotsolutions.com> wrote:

> Are you storing your data as text or binary?
>
> If you are storing as text, your mapper is going to get Keys of
> type LongWritable and values of type Text. Inside your mapper you would
> parse out the strings and wouldn't be using your custom writable; that is
> unless you wanted your mapper/reducer to produce these.
>
> If you are storing as Binary, e.g. SequenceFiles, you use
> the SequenceFileInputFormat and the sequence file reader will create the
> writables according to the mapper.
>
> dave
>
> On Wed, Feb 2, 2011 at 1:16 PM, Adeel Qureshi <adeelmahmood@gmail.com
> >wrote:
>
> > okay so then the main question is how do I get the input line .. so that
> I
> > could parse it .. I am assuming it will then be passed to me in via data
> > input stream ..
> >
> > So in my readFields function .. I am assuming I will get the whole line
> ..
> > then I can parse it out and set my params .. something like this
> >
> > readFields(){
> >  String line = in.readLine(); read the whole line
> >
> >  //now apply the regular expression to parse it out
> >  data = pattern.group(1);
> >  time = pattern.group(2);
> >  user = pattern.group(3);
> > }
> >
> > Is that right ???
> >
> >
> >
> > On Wed, Feb 2, 2011 at 12:11 PM, Vijay <techvd@gmail.com> wrote:
> >
> > > Hadoop is not going to parse the line for you. Your mapper will take
> the
> > > line, parse it and then turn it into your Writable so the next phase
> can
> > > just work with your object.
> > >
> > > Thanks,
> > > Vijay
> > > On Feb 2, 2011 9:51 AM, "Adeel Qureshi" <adeelmahmood@gmail.com>
> wrote:
> > > > thanks for your reply .. so lets say my input files are formatted
> like
> > > this
> > > >
> > > > each line looks like this
> > > > DATE TIME SERVER USER URL QUERY PORT ...
> > > >
> > > > so to read this I would create a writable mapper
> > > >
> > > > public class MyMapper implements Writable {
> > > > Date date
> > > > long time
> > > > String server
> > > > String user
> > > > String url
> > > > String query
> > > > int port
> > > >
> > > > readFields(){
> > > > date = readDate(in); //not concerned with the actual date reading
> > > function
> > > > time = readLong(in);
> > > > server = readText(in);
> > > > .....
> > > > }
> > > > }
> > > >
> > > > but I still dont understand how is hadoop gonna know to parse my line
> > > into
> > > > these tokens .. instead of map be using the whole line as one token
> > > >
> > > >
> > > > On Wed, Feb 2, 2011 at 11:42 AM, Harsh J <qwertymaniac@gmail.com>
> > wrote:
> > > >
> > > >> See it this way:
> > > >>
> > > >> readFields(...) provides a DataInput stream that reads bytes from
a
> > > >> binary stream, and write(...) provides a DataOutput stream that
> writes
> > > >> bytes to a binary stream.
> > > >>
> > > >> Now your data-structure may be a complex one, perhaps an array of
> > > >> items or a mapping of some, or just a set of different types of
> > > >> objects. All you need to do is to think about how would you
> > > >> _serialize_ your data structure into a binary stream, so that you
> may
> > > >> _de-serialize_ it back from the same stream when required.
> > > >>
> > > >> About what goes where, I think looking up the definition of
> > > >> 'serialization' will help. It is all in the ordering. If you wrote
A
> > > >> before B, you read A before B - simple as that.
> > > >>
> > > >> This, or you could use a neat serialization library like Apache Avro
> > > >> (http://avro.apache.org) and solve it in a simpler way with a
> schema.
> > > >> I'd recommend learning/using Avro for all
> > > >> serialization/de-serialization needs. Especially for Hadoop
> use-cases.
> > > >>
> > > >> On Wed, Feb 2, 2011 at 10:51 PM, Adeel Qureshi <
> > adeelmahmood@gmail.com>
> > > >> wrote:
> > > >> > I have been trying to understand how to write a simple custom
> > writable
> > > >> class
> > > >> > and I find the documentation available very vague and unclear
> about
> > > >> certain
> > > >> > things. okay so here is the sample writable implementation in
> > javadoc
> > > of
> > > >> > Writable interface
> > > >> >
> > > >> > public class MyWritable implements Writable {
> > > >> > // Some data
> > > >> > private int counter;
> > > >> > private long timestamp;
> > > >> >
> > > >> > *public void write(DataOutput out) throws IOException {
> > > >> > out.writeInt(counter);
> > > >> > out.writeLong(timestamp);
> > > >> > }*
> > > >> >
> > > >> > * public void readFields(DataInput in) throws IOException {
> > > >> > counter = in.readInt();
> > > >> > timestamp = in.readLong();
> > > >> > }*
> > > >> >
> > > >> > public static MyWritable read(DataInput in) throws IOException
{
> > > >> > MyWritable w = new MyWritable();
> > > >> > w.readFields(in);
> > > >> > return w;
> > > >> > }
> > > >> > }
> > > >> >
> > > >> > so in readFields function we are simply saying read an int from
> the
> > > >> > datainput and put that in counter .. and then read a long and
put
> > that
> > > in
> > > >> > timestamp variable .. what doesnt makes sense to me is what is
the
> > > format
> > > >> of
> > > >> > DataInput here .. what if there are multiple ints and multiple
> longs
> > > ..
> > > >> how
> > > >> > is the correct int gonna go in counter .. what if the data I
am
> > > reading
> > > >> in
> > > >> > my mapper is a string line .. and I am using regular expression
to
> > > parse
> > > >> the
> > > >> > tokens .. how do I specify which field goes where .. simply saying
> > > >> readInt
> > > >> > or readText .. how does that gets connected to the right stuff
..
> > > >> >
> > > >> > so in my case like I said I am reading from iis log files where
my
> > > mapper
> > > >> > input is a log line which contains usual log information like
> data,
> > > time,
> > > >> > user, server, url, qry, responseTme etc .. I want to parse these
> > into
> > > an
> > > >> > object that can be passed to reducer instead of dumping all that
> > > >> information
> > > >> > as text ..
> > > >> >
> > > >> > I would appreciate any help.
> > > >> > Thanks
> > > >> > Adeel
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Harsh J
> > > >> www.harshj.com
> > > >>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message