hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raghu Angadi <rang...@yahoo-inc.com>
Subject Re: \r\n problem in LineRecordReader.java
Date Wed, 13 Jun 2007 16:35:54 GMT
Hi Blowen,

Thanks for investigating this.

You should file a Jira on this (with your traces etc). Let me know if 
you want to me to file it.

 > and I think this precisely is why when you have mark()/reset() (as in
 > \r\n case), the read can be small.  I can track down java's
 > bufferedinputstream code to check for details, but it seems pretty
 > clear from actual code execution stack.

As you noted this could be caused by "marking" the input stream. I think 
Hadoop's dependence on BufferedInputStream read len size should be 
fixed. Alternately we could make this stream non-markable (We would 
still be depending on BufferedInputStreams behavior that is not part of 
its contract).

You might find HADOOP-1470 and HADOOP-1134(last 4-6 comments) relevant.


Bwolen Yang wrote:
>> taking values at runtime (i have it thru exceptions when the result is
>> 0 and print out he values).
> the \r\n problem was observed on the 0.13.0 release.
> To study the behavior, I instrument the hadoop source from the head of 
> the tree.
> More specifically, attached are two sample stacks.  (i have readbuffer
> throw when it gets 0 bytes, and have inputchecker catches the
> exception and rethrow both.  This way, I catch the values from both
> caller and callee.
> on a separate note, if (len>=bytesPerSum) the assumption exists, would
> it be ok to throw exceptions when violated?   most of time (e.g., in
> crawl/indexing), people won't notice some part of input data is
> getting throw away.   It would be a lot easier to debug as code
> changes (and assumption get violated), and the cost in this case is
> probably not too bad as good part of the cost is probably in networks
> and going to disk.

View raw message