Mailing-List: contact user-help@avro.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@avro.apache.org
Received-SPF: pass (nike.apache.org: domain of phearm@gmail.com designates
 209.85.219.53 as permitted sender)
From: Alan Paulsen <phearm@gmail.com>
To: <user@avro.apache.org>
References: 
 <SNT149-W28106541B12A52C0B17251D0150@phx.gbl>,<08a801cebe55$6324d960$296e8c20$@com>
 <SNT149-W447AD17ECF79E68197791BD0150@phx.gbl>
In-Reply-To: <SNT149-W447AD17ECF79E68197791BD0150@phx.gbl>
Subject: RE: Question related to
 AvroJob.setMapOutputSchema(org.apache.hadoop.mapred.JobConf job, Schema s)
Date: Wed, 2 Oct 2013 20:50:18 -0500
Message-ID: <09fe01cebfda$ea3fd270$bebf7750$@com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_NextPart_000_09FF_01CEBFB1.0169CA70"
Thread-Index: Ac6+mJS8j318iT82QY+j/M7IvLlasgBQWO2w
Content-Language: en-us

This is a multi-part message in MIME format.

------=_NextPart_000_09FF_01CEBFB1.0169CA70
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 7bit

Hi Yong,

 
Sorry for the delay.

 
You can specify all the schemas you will use in your driver.  You can use
SpecificRecord if you've generated the class files, GenericRecord can be
used without any code generation.  Take a look at some of the unit tests to
get a better understanding on using SpecificRecords and GenericRecords.

 
In other words, if you are generating your avro records in your reducers,
you can write to your multiple outputs using the appropriate output.

 
If you are generating the records in your mapper, you need to set the
MapOutputSchema to a union of all of your schemas, in order to pass along
the avro records to your reducers.    

 
Thanks,

 
Alan

 
From: java8964 java8964 [mailto:java8964@hotmail.com] 
Sent: Tuesday, October 01, 2013 6:22 AM
To: user@avro.apache.org
Subject: RE: Question related to
AvroJob.setMapOutputSchema(org.apache.hadoop.mapred.JobConf job, Schema s)

 
Hi, Alan:

 
Thanks for you suggestion. I will take a look about MultipleOutput.

 
But in this case, I still need to specify the schema in  my driver, right?
You mean I should use union schema in this case? But in my mapper, should I
use SpecificRecord or GenericRecord? I can use (K,V) in my reducer, but in
the mapper, I need the concrete Record object to serialize my data, right?

 
Yong

  _____  

From: phearm@gmail.com
To: user@avro.apache.org
Subject: RE: Question related to
AvroJob.setMapOutputSchema(org.apache.hadoop.mapred.JobConf job, Schema s)
Date: Mon, 30 Sep 2013 22:21:57 -0500

Hi Yong,

 
It sounds like you might need to use AvroMultipleOutputs here.  

 
You can set all five of your output schemas in your driver, then route your
message to the appropriate output in your reducer.  

 
See the following for mapred:
http://avro.apache.org/docs/1.7.5/api/java/org/apache/avro/mapred/AvroMultip
leOutputs.html

 
And the following for mapreduce:
http://avro.apache.org/docs/1.7.5/api/java/org/apache/avro/mapreduce/AvroMul
tipleOutputs.html

 
If your mapper is generating the Avro records, then you will probably have
to set AvroJob.setMapOutputSchema to a union of all five of your schemas.

 
Thanks,

 
Alan

 
From: java8964 java8964 [mailto:java8964@hotmail.com] 
Sent: Monday, September 30, 2013 9:37 PM
To: user@avro.apache.org
Subject: Question related to
AvroJob.setMapOutputSchema(org.apache.hadoop.mapred.JobConf job, Schema s)

 
Hi, 

 
I am new to user Avro. Currently, I am working on an existing project, and I
want to see if using Avro makes sense.

 
The project is to do some ETL around 5 data sets' data. The ETL logic is not
complex, it will do different transferring logic for different  data sets,
and partition the data daily in reducer.

 
There was one MR job to handle all 5 data sets originally. The data files
have the name convention to distinguish the data sets. So in the mapper, it
bases on the file name to understand what data set it is, and generate the
key as "datase_name + date" to partition the data set first by data set,
then daily.

 
Now if I want to store the data in Avro format, it is straight-forward to
write MR job for only one data set following a lot of online examples. I
have no problem to change the MR job to store the data as Avro format for
one data set.

 
But if I still want to use one MR job for all 5 data sets, I got a problem.

 
I tried both "SpecificRecord" and "GenericRecord", but I don't know how to
solve this problem.

 
For example, I created 5 avsc files for 5 data sets, and generate the Record
object for all of them. But in the mapper/reducer, I don't want to specify
any Record class, and this same mapper/reducer should be able to handle all
data sets. So I try to put SpecificRecord  class in my mapper/reducer, but
in this case, I don't have the SpecificRecord.SCHEMA$ to use in my driver of
AvroJob.setMapOutputSchema(conf, Schema), even though in my case, I really
prefer the "SpecificRecord".

 
So that makes me to try "GenericRecord". I change all my mapper and reducer
to use "GenericRecord" class. But still, I don't know what schema I should
use in my driver class for AvroJob.setMapOutputSchema(conf, Schema). The
problem is that is there a generic abstract schema class I can use in
AvroJob.setMapOutputSchema or AvroJob.setOutputSchema? My mapper class will
correctly generate either "GenericRecord" or "SpecificRecord" class at
runtime based on the file name, and reducer will write the correct
"GenericRecord" or "SpecificRecord" object to the right output location
without knowing the concrete Record object. But what stops me now is what
kind of schema object I can use in AvroJob. I don't know during the driver
stage what is my output schema, but the mapper/reducer will figure that out
at runtime. Can I do this in Avro?

 
Thanks

 
Yong


------=_NextPart_000_09FF_01CEBFB1.0169CA70
Content-Type: text/html;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" =
xmlns:o=3D"urn:schemas-microsoft-com:office:office" =
xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" =
xmlns=3D"http://www.w3.org/TR/REC-html40"><head><meta =
http-equiv=3DContent-Type content=3D"text/html; =
charset=3Dus-ascii"><meta name=3DGenerator content=3D"Microsoft Word 12 =
(filtered medium)"><!--[if !mso]><style>v\:* =
{behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]--><style><!--
/* Font Definitions */
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
	{font-family:Tahoma;
	panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0in;
	margin-bottom:.0001pt;
	font-size:12.0pt;
	font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
p
	{mso-style-priority:99;
	mso-margin-top-alt:auto;
	margin-right:0in;
	mso-margin-bottom-alt:auto;
	margin-left:0in;
	font-size:12.0pt;
	font-family:"Times New Roman","serif";}
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
	{mso-style-priority:99;
	mso-style-link:"Balloon Text Char";
	margin:0in;
	margin-bottom:.0001pt;
	font-size:8.0pt;
	font-family:"Tahoma","sans-serif";}
p.ecxmsonormal, li.ecxmsonormal, div.ecxmsonormal
	{mso-style-name:ecxmsonormal;
	mso-margin-top-alt:auto;
	margin-right:0in;
	mso-margin-bottom-alt:auto;
	margin-left:0in;
	font-size:12.0pt;
	font-family:"Times New Roman","serif";}
p.ecxmsochpdefault, li.ecxmsochpdefault, div.ecxmsochpdefault
	{mso-style-name:ecxmsochpdefault;
	mso-margin-top-alt:auto;
	margin-right:0in;
	mso-margin-bottom-alt:auto;
	margin-left:0in;
	font-size:12.0pt;
	font-family:"Times New Roman","serif";}
span.ecxmsohyperlink
	{mso-style-name:ecxmsohyperlink;}
span.ecxmsohyperlinkfollowed
	{mso-style-name:ecxmsohyperlinkfollowed;}
span.ecxemailstyle18
	{mso-style-name:ecxemailstyle18;}
p.ecxmsonormal1, li.ecxmsonormal1, div.ecxmsonormal1
	{mso-style-name:ecxmsonormal1;
	mso-margin-top-alt:auto;
	margin-right:0in;
	mso-margin-bottom-alt:auto;
	margin-left:0in;
	font-size:12.0pt;
	font-family:"Times New Roman","serif";}
span.ecxmsohyperlink1
	{mso-style-name:ecxmsohyperlink1;
	color:blue;
	text-decoration:underline;}
span.ecxmsohyperlinkfollowed1
	{mso-style-name:ecxmsohyperlinkfollowed1;
	color:purple;
	text-decoration:underline;}
span.ecxemailstyle181
	{mso-style-name:ecxemailstyle181;
	font-family:"Calibri","sans-serif";
	color:#1F497D;}
p.ecxmsochpdefault1, li.ecxmsochpdefault1, div.ecxmsochpdefault1
	{mso-style-name:ecxmsochpdefault1;
	mso-margin-top-alt:auto;
	margin-right:0in;
	mso-margin-bottom-alt:auto;
	margin-left:0in;
	font-size:10.0pt;
	font-family:"Times New Roman","serif";}
span.EmailStyle28
	{mso-style-type:personal-reply;
	font-family:"Calibri","sans-serif";
	color:#1F497D;}
span.BalloonTextChar
	{mso-style-name:"Balloon Text Char";
	mso-style-priority:99;
	mso-style-link:"Balloon Text";
	font-family:"Tahoma","sans-serif";}
.MsoChpDefault
	{mso-style-type:export-only;
	font-size:10.0pt;}
@page WordSection1
	{size:8.5in 11.0in;
	margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
	{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext=3D"edit">
<o:idmap v:ext=3D"edit" data=3D"1" />
</o:shapelayout></xml><![endif]--></head><body lang=3DEN-US link=3Dblue =
vlink=3Dpurple><div class=3DWordSection1><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>Hi Yong,<o:p></o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>Sorry for the delay.<o:p></o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>You can specify all the schemas you will use in your driver.&nbsp; =
You can use SpecificRecord if you've generated the class files, =
GenericRecord can be used without any code generation.&nbsp; Take a look =
at some of the unit tests to get a better understanding on using =
SpecificRecords and GenericRecords.<o:p></o:p></span></p><p =
class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>In other words, if you are generating your avro records in your =
reducers, you can write to your multiple outputs using the appropriate =
output.<o:p></o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>If you are generating the records in your mapper, you need to set the =
MapOutputSchema to a union of all of your schemas, in order to pass =
along the avro records to your reducers.&nbsp; =
&nbsp;&nbsp;<o:p></o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>Thanks,<o:p></o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>Alan<o:p></o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'><o:p>&nbsp;</o:p></span></p><div><div =
style=3D'border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in =
0in 0in'><p class=3DMsoNormal><b><span =
style=3D'font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span>=
</b><span style=3D'font-size:10.0pt;font-family:"Tahoma","sans-serif"'> =
java8964 java8964 [mailto:java8964@hotmail.com] <br><b>Sent:</b> =
Tuesday, October 01, 2013 6:22 AM<br><b>To:</b> =
user@avro.apache.org<br><b>Subject:</b> RE: Question related to =
AvroJob.setMapOutputSchema(org.apache.hadoop.mapred.JobConf job, Schema =
s)<o:p></o:p></span></p></div></div><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><div><p class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>Hi, =
Alan:<o:p></o:p></span></p><div><p class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'><o:p>&nbsp;</o:p></span></p>=
</div><div><p class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>Thanks for you suggestion. =
I will take a look about =
MultipleOutput.<o:p></o:p></span></p></div><div><p =
class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'><o:p>&nbsp;</o:p></span></p>=
</div><div><p class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>But in this case, I still =
need to specify the schema in &nbsp;my driver, right? You mean I should =
use union schema in this case? But in my mapper, should I use =
SpecificRecord or GenericRecord? I can use (K,V) in my reducer, but in =
the mapper, I need the concrete Record object to serialize my data, =
right?<o:p></o:p></span></p></div><div><p class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'><o:p>&nbsp;</o:p></span></p>=
</div><div><p class=3DMsoNormal style=3D'margin-bottom:12.0pt'><span =
style=3D'font-family:"Calibri","sans-serif"'>Yong<o:p></o:p></span></p><d=
iv><div class=3DMsoNormal align=3Dcenter =
style=3D'text-align:center'><span =
style=3D'font-family:"Calibri","sans-serif"'><hr size=3D2 width=3D"100%" =
align=3Dcenter id=3DstopSpelling></span></div><p class=3DMsoNormal =
style=3D'margin-bottom:12.0pt'><span =
style=3D'font-family:"Calibri","sans-serif"'>From: <a =
href=3D"mailto:phearm@gmail.com">phearm@gmail.com</a><br>To: <a =
href=3D"mailto:user@avro.apache.org">user@avro.apache.org</a><br>Subject:=
 RE: Question related to =
AvroJob.setMapOutputSchema(org.apache.hadoop.mapred.JobConf job, Schema =
s)<br>Date: Mon, 30 Sep 2013 22:21:57 -0500<o:p></o:p></span></p><div><p =
class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>Hi Yong,</span><span =
style=3D'font-family:"Calibri","sans-serif"'><o:p></o:p></span></p><p =
class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>&nbsp;</span><span =
style=3D'font-family:"Calibri","sans-serif"'><o:p></o:p></span></p><p =
class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>It sounds like you might need to use AvroMultipleOutputs here.&nbsp; =
</span><span =
style=3D'font-family:"Calibri","sans-serif"'><o:p></o:p></span></p><p =
class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>&nbsp;</span><span =
style=3D'font-family:"Calibri","sans-serif"'><o:p></o:p></span></p><p =
class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>You can set all five of your output schemas in your driver, then =
route your message to the appropriate output in your reducer.&nbsp; =
</span><span =
style=3D'font-family:"Calibri","sans-serif"'><o:p></o:p></span></p><p =
class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>&nbsp;</span><span =
style=3D'font-family:"Calibri","sans-serif"'><o:p></o:p></span></p><p =
class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>See the following for mapred: </span><span =
style=3D'font-family:"Calibri","sans-serif"'><a =
href=3D"http://avro.apache.org/docs/1.7.5/api/java/org/apache/avro/mapred=
/AvroMultipleOutputs.html" =
target=3D"_blank">http://avro.apache.org/docs/1.7.5/api/java/org/apache/a=
vro/mapred/AvroMultipleOutputs.html</a><o:p></o:p></span></p><p =
class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>&nbsp;<o:p></o:p></span></p>=
<p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>And the following for mapreduce: </span><span =
style=3D'font-family:"Calibri","sans-serif"'><a =
href=3D"http://avro.apache.org/docs/1.7.5/api/java/org/apache/avro/mapred=
uce/AvroMultipleOutputs.html" =
target=3D"_blank">http://avro.apache.org/docs/1.7.5/api/java/org/apache/a=
vro/mapreduce/AvroMultipleOutputs.html</a><o:p></o:p></span></p><p =
class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>&nbsp;</span><span =
style=3D'font-family:"Calibri","sans-serif"'><o:p></o:p></span></p><p =
class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>If your mapper is generating the Avro records, then you will probably =
have to set AvroJob.setMapOutputSchema to a union of all five of your =
schemas.</span><span =
style=3D'font-family:"Calibri","sans-serif"'><o:p></o:p></span></p><p =
class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>&nbsp;</span><span =
style=3D'font-family:"Calibri","sans-serif"'><o:p></o:p></span></p><p =
class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>Thanks,</span><span =
style=3D'font-family:"Calibri","sans-serif"'><o:p></o:p></span></p><p =
class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>&nbsp;</span><span =
style=3D'font-family:"Calibri","sans-serif"'><o:p></o:p></span></p><p =
class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>Alan</span><span =
style=3D'font-family:"Calibri","sans-serif"'><o:p></o:p></span></p><p =
class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>&nbsp;</span><span =
style=3D'font-family:"Calibri","sans-serif"'><o:p></o:p></span></p><div><=
div style=3D'border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt =
0in 0in 0in'><p class=3DMsoNormal><b><span =
style=3D'font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span>=
</b><span style=3D'font-size:10.0pt;font-family:"Tahoma","sans-serif"'> =
java8964 java8964 [<a =
href=3D"mailto:java8964@hotmail.com">mailto:java8964@hotmail.com</a>] =
<br><b>Sent:</b> Monday, September 30, 2013 9:37 PM<br><b>To:</b> <a =
href=3D"mailto:user@avro.apache.org">user@avro.apache.org</a><br><b>Subje=
ct:</b> Question related to =
AvroJob.setMapOutputSchema(org.apache.hadoop.mapred.JobConf job, Schema =
s)</span><span =
style=3D'font-family:"Calibri","sans-serif"'><o:p></o:p></span></p></div>=
</div><p class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>&nbsp;<o:p></o:p></span></p>=
<div><p class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>Hi,&nbsp;<o:p></o:p></span><=
/p><div><p class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>&nbsp;<o:p></o:p></span></p>=
</div><div><p class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>I am new to user Avro. =
Currently, I am working on an existing project, and I want to see if =
using Avro makes sense.<o:p></o:p></span></p></div><div><p =
class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>&nbsp;<o:p></o:p></span></p>=
</div><div><p class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>The project is to do some =
ETL around 5 data sets' data. The ETL logic is not complex, it will do =
different transferring logic for different &nbsp;data sets, and =
partition the data daily in reducer.<o:p></o:p></span></p></div><div><p =
class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>&nbsp;<o:p></o:p></span></p>=
</div><div><p class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>There was one MR job to =
handle all 5 data sets originally. The data files have the name =
convention to distinguish the data sets. So in the mapper, it bases on =
the file name to understand what data set it is, and generate the key as =
&quot;datase_name + date&quot; to partition the data set first by data =
set, then daily.<o:p></o:p></span></p></div><div><p =
class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>&nbsp;<o:p></o:p></span></p>=
</div><div><p class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>Now if I want to store the =
data in Avro format, it is straight-forward to write MR job for only one =
data set following a lot of online examples. I have no problem to change =
the MR job to store the data as Avro format for one data =
set.<o:p></o:p></span></p></div><div><p class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>&nbsp;<o:p></o:p></span></p>=
</div><div><p class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>But if I still want to use =
one MR job for all 5 data sets, I got a =
problem.<o:p></o:p></span></p></div><div><p class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>&nbsp;<o:p></o:p></span></p>=
</div><div><p class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>I tried both =
&quot;SpecificRecord&quot; and &quot;GenericRecord&quot;, but I don't =
know how to solve this problem.<o:p></o:p></span></p></div><div><p =
class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>&nbsp;<o:p></o:p></span></p>=
</div><div><p class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>For example, I created 5 =
avsc files for 5 data sets, and generate the Record object for all of =
them. But in the mapper/reducer, I don't want to specify any Record =
class, and this same mapper/reducer should be able to handle all data =
sets. So I try to put SpecificRecord &nbsp;class in my mapper/reducer, =
but in this case, I don't have the SpecificRecord.SCHEMA$ to use in my =
driver of AvroJob.setMapOutputSchema(conf, Schema), even though in my =
case, I really prefer the =
&quot;SpecificRecord&quot;.<o:p></o:p></span></p></div><div><p =
class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>&nbsp;<o:p></o:p></span></p>=
</div><div><p class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>So that makes me to try =
&quot;GenericRecord&quot;. I change all my mapper and reducer to use =
&quot;GenericRecord&quot; class. But still, I don't know what schema I =
should use in my driver class for AvroJob.setMapOutputSchema(conf, =
Schema). The problem is that is there a generic abstract schema class I =
can use in AvroJob.setMapOutputSchema or AvroJob.setOutputSchema? My =
mapper class will correctly generate either &quot;GenericRecord&quot; or =
&quot;SpecificRecord&quot; class at runtime based on the file name, and =
reducer will write the correct &quot;GenericRecord&quot; or =
&quot;SpecificRecord&quot; object to the right output location without =
knowing the concrete Record object. But what stops me now is what kind =
of schema object I can use in AvroJob. I don't know during the driver =
stage what is my output schema, but the mapper/reducer will figure that =
out at runtime. Can I do this in =
Avro?<o:p></o:p></span></p></div><div><p class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>&nbsp;<o:p></o:p></span></p>=
</div><div><p class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>Thanks<o:p></o:p></span></p>=
</div><div><p class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>&nbsp;<o:p></o:p></span></p>=
</div><div><p class=3DMsoNormal><span =
style=3D'font-family:"Calibri","sans-serif"'>Yong<o:p></o:p></span></p></=
div></div></div></div></div></div></div></body></html>
------=_NextPart_000_09FF_01CEBFB1.0169CA70--