Return-Path: X-Original-To: apmail-avro-user-archive@www.apache.org Delivered-To: apmail-avro-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BCE5410515 for ; Wed, 30 Apr 2014 06:41:52 +0000 (UTC) Received: (qmail 25596 invoked by uid 500); 30 Apr 2014 06:41:50 -0000 Delivered-To: apmail-avro-user-archive@avro.apache.org Received: (qmail 25398 invoked by uid 500); 30 Apr 2014 06:41:50 -0000 Mailing-List: contact user-help@avro.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@avro.apache.org Delivered-To: mailing list user@avro.apache.org Received: (qmail 25385 invoked by uid 99); 30 Apr 2014 06:41:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Apr 2014 06:41:49 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of raofengyun@gmail.com designates 209.85.220.176 as permitted sender) Received: from [209.85.220.176] (HELO mail-vc0-f176.google.com) (209.85.220.176) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Apr 2014 06:41:43 +0000 Received: by mail-vc0-f176.google.com with SMTP id lc6so1616398vcb.21 for ; Tue, 29 Apr 2014 23:41:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=mNs+yfAKkNq0m/7GgfgqpCHE0xmkD9w1Ir1Pl81L9Qw=; b=L+JscOUTp13ud3atOj5QuCIvdha4wp5csOOczhx8qdtYivA3RX2AeXFPrrkwIHlLXq oDZMhMO8lRlyDaDJ7ozz/npRx+SDpvA8LEQRVv5Dtii8mwKJw9Uxmexn5s+t9EkzKYGS 7kUTguzHYkWUmV5dcYxkjIaQTohFy7DxakqLtxZZkx+0Bzm5Hf7ke0xtypwLl9zpvarM xwYE3E5BxFRikXV6w3bsSC5ocnPJy/Zhkyv5vX+pwvE6//i/5RRm65qsv7hbhj4jTKdn yGfZPWgHNa49XmtsUh90FKXzf0HwV9uhvGjqgquO/NeGQKzNJnRD5YU2G1L+prCKALfV OE4w== MIME-Version: 1.0 X-Received: by 10.220.163.3 with SMTP id y3mr2306154vcx.7.1398840083016; Tue, 29 Apr 2014 23:41:23 -0700 (PDT) Received: by 10.220.232.68 with HTTP; Tue, 29 Apr 2014 23:41:22 -0700 (PDT) In-Reply-To: References: Date: Wed, 30 Apr 2014 14:41:22 +0800 Message-ID: Subject: Re: MapReduce: Using Avro Input/Output Formats without Specifying a schema From: Fengyun RAO To: user@avro.apache.org Content-Type: multipart/alternative; boundary=001a1133da663ef1cf04f83cd61d X-Virus-Checked: Checked by ClamAV on apache.org --001a1133da663ef1cf04f83cd61d Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable take MapReduce for example, which requires Runner, Mapper, Reducer the Mapper requires outputting a single Type (or a single Avro schema). If you have a set of CSV files with different schemas, what output type would you expect? If all the CSV files share the same schema, you could dynamically create the schema in the Runner before submitting a MR job. If you look into the Schema.java, you would find create(), createRecord(), etc. APIs. you could simply read one CSV file head, and create the schema using these APIs. e.g. AvroJob.setMapOutputKeySchema(job, Schema.create(Schema.Type.STRING)); creates a schema with only a String field. 2014-04-30 4:56 GMT+08:00 Ryan Tabora : > Hi all, > > Whether you=E2=80=99re using Hive or MapReduce, avro input/output formats= require > you to specify a schema at the beginning of the job or the table definiti= on > in order to work with them. Is there any way to configure the jobs in a w= ay > that the input/output formats can dynamically determine the schema from t= he > data itself? > > Think about a job like this. I have a set of CSV files that I want to > serialize into avro files. These CSV files are self describing and each C= SV > file has a unique schema. If I want to write a job that scans over all of > this data and serialize it into avro I can=E2=80=99t do that with today= =E2=80=99s tools (as > far as I know). If I can=E2=80=99t specify the schema up front, what can = I do? Am I > forced to write my own avro input/output formats? > > The avro schema is stored within the avro data file itself, why can=E2=80= =99t > these input/output formats be smart enough to figure that out? Am I > fundamentally doing something against the principles of the avro format? = I > would be surprised if no one has run into this issue before. > > Regards, > Ryan Tabora > --001a1133da663ef1cf04f83cd61d Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
take MapReduce for example, which requires Runner, Mapper,= Reducer

the Mapper requires outputting a single Type (o= r a single Avro schema).=C2=A0

If you have a set o= f CSV files with different schemas, what output type would you expect?

If all the CSV files share the same schema, you could d= ynamically create the schema in the Runner before submitting a MR job.
If you look into the Schema.java, you would find create(), createReco= rd(), etc. APIs.
you could simply read one CSV file head, and create the schema using t= hese APIs.
e.g.=C2=A0
=C2=A0 =C2=A0 AvroJob.setMap= OutputKeySchema(job, Schema.create(Schema.Type.STRING));
cr= eates a schema with only a String field.



2014-04-30 4:56 GMT+08:00 Ryan Tabora <ratabora@gmail.com>= :
Hi all,

Whether you=E2=80=99re using Hive or MapReduce, avro input/output formats r= equire you to specify a schema at the beginning of the job or the table def= inition in order to work with them. Is there any way to configure the jobs = in a way that the input/output formats can dynamically determine the schema= from the data itself?

Think about a job like this. I have a set of CSV files that I want to seria= lize into avro files. These CSV files are self describing and each CSV file= has a unique schema. If I want to write a job that scans over all of this = data and serialize it into avro I can=E2=80=99t do that with today=E2=80=99= s tools (as far as I know). If I can=E2=80=99t specify the schema up front,= what can I do? Am I forced to write my own avro input/output formats?

The avro schema is stored within the avro data file itself, why can=E2=80= =99t these input/output formats be smart enough to figure that out? Am I fu= ndamentally doing something against the principles of the avro format? I wo= uld be surprised if no one has run into this issue before.

Regards,=
Ryan Tabora=

--001a1133da663ef1cf04f83cd61d--