hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Wiley <kwi...@keithwiley.com>
Subject Managing stdout in streaming
Date Tue, 01 Feb 2011 20:55:25 GMT
So streaming uses stdout to organize the mapper/reducer output, one record per line with each
key/val split at the first TAB.

(Presumably multiple TABS are permitted and become embedded in the value string, I haven't
experimented with this yet).

Obviously, one must be very careful not to write any debugging or logging output to stdout.
 It seems fairly straight-forward to simply use stderr instead, such that all associated output
appears in the job tracker logs.

Buuuuut, what if I'm using a third-party library and I can't tell it to send output elsewhere?
 I know that it is possible to redirect stdout using tricks like freopen(), but I believe
it can be quite tricky to redirect stdout back to its original stream.  So if I directed stdout
away from the original stream for processing, I'm not sure how I would latch it back onto
the stream for the purpose of generating my mapper/reducer output data (in the Hadoop streaming
TAB-delimited line-per-record format).

Any thoughts on this?  The cluster is running Linux incidentally.  I realize details like
that become important when one starts fiddling with redirecting streams and such.

Thank you.

Keith Wiley               kwiley@keithwiley.com               www.keithwiley.com

"What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered."
  -- Keith Wiley

View raw message