hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shahab Yunus <shahab.yu...@gmail.com>
Subject Re: Not able to understand writing custom writable
Date Fri, 09 Aug 2013 19:37:19 GMT
The overarching responsibility of a record reader is to return one record
which in case of conventional, traditionally means one line. But as we see
in this case, it cannot be always true. An xml file can have physically
multiple lines but functionally they map to one record or one line. For
this, we have to write our own record reader to perform the mapping or
conversion from multiple physical lines/records to a single functional

In the constructor a split is being passed which represents a chunk of
contiguous records from the source xml file. For processing this split
which can contain many physical and functional records, bootstrapping is
being done, one-time variables are being initialized and total length of
the data is also been calculated. Also, a reader is being opened and when I
say reader it is for reading the normal XML file with the help of Java I/O
classes (or Hadoop's abstraction over them.) Also noticeable is the fact
that the whole file is being opened to read but then for this particular
split, the reading will start from the split's start point as evident by
the 'seek' method call. Each split gets the same file and opens it for
reading but actually starts reading only from the point where its split is
suppose to being, hence the call to 'seek' as mentioned.

In the 'next' method which is overriden and would be called by the
framework to read atomically, the next functional and physical record, it
is regular Java I/O and XML tag logic. Nothing specific to M/R. What you
are trying to do is to read everything between an XML tag (start and end)
specified by you and set in the constructor. Everything in between this
start and end tag (which can be more nested tags or just text) would be
considered one record. You are then simply using Java I/O classes and then
 String/Byte manipulation and comparison parsing and constructing your
record. When the 'read' stream is exhausted (fsin.getPos() < end) it means
that you have processed all the data for this split and we are done.

Given an xml file like this:
<tag1>     *//split 1*
</tag1*>  //**byte position of '>' is the key of 1st record*
<tag1>     *//split 2*
</tag1*> //**byte position of '>' is the key of 2nd record*
<tag1> *//split 3*
</tag1*>* *//**byte position of '>' is the key of 3rd record*

Let us say you have 3 splits. For the first split your start tag would be
'tag1' and end tag '/tag1' (the one which constitutes 'a' and 'b' sub-tags)
and after processing this split you will have 1 record. The first call to
the match (readUntilMatch) method with 'withinBlock=false' will just seek
to the end of the start tag (tag1) and will not buffer anything. The next
call to the same match method with the second parameter set to true now
will begin reading where the stat tag ended and continue reading till the
end tag (</tag1>) or end of split (or file if it was the last split) and
this time will save or buffer the data encountered which is exactly we want
i.e. data between the start and end tag which will form our value, our one
functional record. There is some logic for handling corner cases and sanity
checks in there as well.

You will also notice that createKey and createValue methods are overridden
as well which will be first called by the framework and then the call to
the 'next' method would be made. Notice that we are passing 'key' and
'value' to the 'next method and these objects are being then actually set
with the functional record that you compute after parsing the multi-line
XML chunk.

Something like:
LongWritable key = createKey();
Text createValue() = createValue();
if(next(key, value)) {
//continue reading more (functional) records
Also note that the key of the constructed and returned functional record is
the physical byte# or position streamed from the file. Which means that any
2 consecutive functional records will not have key differing by 1. It will
be most probably (I say 'probably' as I might be off about the edge/-1
cases here) the number of bytes read between the start and end tag. Also,
the physical line# of the end tag is being used and not of the start tag's.

The getProgress method, also called by the framework just gives you a
estimate using simple math about how much of the split has been processed.


On Fri, Aug 9, 2013 at 1:01 PM, jamal sasha <jamalshasha@gmail.com> wrote:

> Hi,
>    I am trying to understand, how to write my own writable.
> So basically trying to understand how to process records spanning multiple
> lines.
> Can some one break down to me, that what are the things needed to be
> considered in each method??
> I am trying to understand this example:
> https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java
> Can someone explain to me in simple language what is each code block
> suppose to do.
> My apologies for asking such a "vaguely" posed question?
> Thanks

View raw message