Mailing-List: contact chukwa-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: chukwa-user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns;
	h=received:user-agent:date:subject:from:to:message-id:
	thread-topic:thread-index:in-reply-to:mime-version:content-type:
	return-path:x-originalarrivaltime;
	b=scQNcr0+XVj1ZngHNcTFpggYI5KkF0k3TvxUfGJGOnoLyseKQClW+vCLoNgdln7a
User-Agent: Microsoft-Entourage/12.25.0.100505
Date: Wed, 21 Jul 2010 09:35:48 -0700
Subject: Re: ChukwaRecordOutputFormat only works with ChukwaRecordPartitioner
From: Eric Yang <eyang@yahoo-inc.com>
To: <chukwa-user@hadoop.apache.org>
Message-ID: <C86C6FF4.9D23%eyang@yahoo-inc.com>
Thread-Topic: ChukwaRecordOutputFormat only works with ChukwaRecordPartitioner
Thread-Index: Acso8sYL+MRS+swfy0GkMQXXJJUsVg==
In-Reply-To: <681A10D4-2345-4C9A-8445-E322CFE3E7E4@tynt.com>
Mime-version: 1.0
Content-type: multipart/alternative;
	boundary="B_3362549749_19744162"

> This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.

--B_3362549749_19744162
Content-type: text/plain;
	charset="ISO-8859-1"
Content-transfer-encoding: quoted-printable

I think this is in the right direction.  Does this filename convention
allows dfs =ADgetmerge to work on the directory?  If it does, then I am fine
with it.  If it doesn=B9t, it may be good to label output file name  as
MyDataType_20100720_0_35.R_part0 to align with default output name of
mapreduce.

Regards,
Eric

On 7/20/10 11:48 PM, "Corbin Hoenes" <corbin@tynt.com> wrote:

> I was looking at replacing the ChukwaRecordPartitioner with a
> HashbasedRecordParitioner. We discussed this earlier here.... there is an
> issue in JIRA: https://issues.apache.org/jira/browse/CHUKWA-481
>=20
> I patched chukwa to allow for a pluggable partitioner and configured chuk=
wa to
> use the hash based partitioner.  But it started failing to rename the
> _temporary files during the commit phase after the reduce was finished be=
cause
> now there were multiple reducers trying to move files to
> /chukwa/demuxProcessing/mrOutput with the same filename.   So I added a b=
it
> more to the filename in ChukwaRecordOutputFormat
>=20
> private String getParition(ChukwaRecordKey key, ChukwaRecord record) {
> return "part" + paritioner.getPartition(key, record,
> conf.getInt("mapred.reduce.tasks", 0));
> }
>=20
> @Override
> protected String generateFileNameForKeyValue(ChukwaRecordKey key,
> ChukwaRecord record, String name) {
>=20
> String output =3D RecordUtil.getClusterName(record) + "/"
> + key.getReduceType() + "/" + key.getReduceType() + "_" + getParition(key=
,
> record)
> + Util.generateTimeOutput(record.getTime());
>=20
> return output;
> }=20
>=20
> So my filenames are now
> /chukwa/demuxProcessing/mrOutput/MyCluster/MyDataType/MyDataType_part0_20=
10072
> 0_0_35.R.evt
>=20
> Just added the part to the filename and now when PostProcessorManager pic=
ks up
> that directory it can mv each file into the correctly time bucket in
> /chukwa/repos (it increments a count for each file in that directory.
>=20
> Is there a better solution--I am not sure how general purpose my solution=
 is.
>=20


--B_3362549749_19744162
Content-type: text/html;
	charset="ISO-8859-1"
Content-transfer-encoding: quoted-printable

<HTML>
<HEAD>
<TITLE>Re: ChukwaRecordOutputFormat only works with ChukwaRecordPartitioner=
</TITLE>
</HEAD>
<BODY>
<FONT FACE=3D"Calibri, Verdana, Helvetica, Arial"><SPAN STYLE=3D'font-size:11pt=
'>I think this is in the right direction. &nbsp;Does this filename conventio=
n allows dfs &#8211;getmerge to work on the directory? &nbsp;If it does, the=
n I am fine with it. &nbsp;If it doesn&#8217;t, it may be good to label outp=
ut file name &nbsp;as MyDataType_20100720_0_35.R_part0 to align with default=
 output name of mapreduce.<BR>
<BR>
Regards,<BR>
Eric<BR>
<BR>
On 7/20/10 11:48 PM, &quot;Corbin Hoenes&quot; &lt;<a href=3D"corbin@tynt.com=
">corbin@tynt.com</a>&gt; wrote:<BR>
<BR>
</SPAN></FONT><BLOCKQUOTE><FONT FACE=3D"Calibri, Verdana, Helvetica, Arial"><=
SPAN STYLE=3D'font-size:11pt'>I was looking at replacing the ChukwaRecordParti=
tioner with a HashbasedRecordParitioner. We discussed this earlier here.... =
there is an issue in JIRA: <a href=3D"https://issues.apache.org/jira/browse/CH=
UKWA-481">https://issues.apache.org/jira/browse/CHUKWA-481</a><BR>
<BR>
I patched chukwa to allow for a pluggable partitioner and configured chukwa=
 to use the hash based partitioner. &nbsp;But it started failing to rename t=
he _temporary files during the commit phase after the reduce was finished be=
cause now there were multiple reducers trying to move files to /chukwa/demux=
Processing/mrOutput with the same filename. &nbsp;&nbsp;So I added a bit mor=
e to the filename in ChukwaRecordOutputFormat<BR>
<BR>
<FONT COLOR=3D"#941965">private</FONT> String getParition(ChukwaRecordKey key=
, ChukwaRecord record) {<BR>
<FONT COLOR=3D"#941965">return</FONT> <FONT COLOR=3D"#3E38F5">&quot;part&quot;<=
/FONT> + <FONT COLOR=3D"#102BC3">paritioner</FONT>.getPartition(key, record, <=
FONT COLOR=3D"#102BC3">conf</FONT>.getInt(<FONT COLOR=3D"#3E38F5">&quot;mapred.r=
educe.tasks&quot;</FONT>, 0)); <BR>
}<BR>
<BR>
@Override<BR>
<FONT COLOR=3D"#941965">protected</FONT> String generateFileNameForKeyValue(C=
hukwaRecordKey key,<BR>
ChukwaRecord record, String name) {<BR>
<BR>
String output =3D RecordUtil.getClusterName(record) + <FONT COLOR=3D"#3E38F5">&=
quot;/&quot;<BR>
</FONT>+ key.getReduceType() + <FONT COLOR=3D"#3E38F5">&quot;/&quot;</FONT> +=
 key.getReduceType() + <FONT COLOR=3D"#3E38F5">&quot;_&quot;</FONT> + getParit=
ion(key, record)<BR>
+ Util.generateTimeOutput(record.getTime());<BR>
<BR>
<FONT COLOR=3D"#941965">return</FONT> output;<BR>
</SPAN></FONT><FONT SIZE=3D"1"><FONT FACE=3D"Monaco, Courier New"><SPAN STYLE=3D'=
font-size:8.5pt'>}</SPAN></FONT></FONT><FONT FACE=3D"Calibri, Verdana, Helveti=
ca, Arial"><SPAN STYLE=3D'font-size:11pt'> <BR>
<BR>
So my filenames are now /chukwa/demuxProcessing/mrOutput/MyCluster/MyDataTy=
pe/MyDataType_<B>part0</B>_20100720_0_35.R.evt<BR>
<BR>
Just added the part to the filename and now when PostProcessorManager picks=
 up that directory it can mv each file into the correctly time bucket in /ch=
ukwa/repos (it increments a count for each file in that directory.<BR>
<BR>
Is there a better solution--I am not sure how general purpose my solution i=
s.<BR>
<BR>
</SPAN></FONT></BLOCKQUOTE>
</BODY>
</HTML>


--B_3362549749_19744162--