incubator-hcatalog-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Weise <...@yahoo-inc.com>
Subject Re: HCatOutputFormat schema issues
Date Tue, 01 Nov 2011 17:42:48 GMT
We should fix the documentation then?

http://incubator.apache.org/hcatalog/docs/r0.2.0/inputoutput.html


On 11/1/11 9:13 AM, "Ashutosh Chauhan" <hashutosh@apache.org> wrote:

Hey Charles,

After you have done HCatOutputFormat.setOutput(), you can do HCatOutputFormat.getTableSchema()
which will return you the schema of table which you can then use without requiring you to
manually construct the Schema.

Hope it helps,
Ashutosh

On Mon, Oct 31, 2011 at 20:18, Charles Menguy <cmenguy@proclivitysystems.com> wrote:
Hi Ashutosh,

Thank you very much for your answer.

I can certainly understand your argument. Is there however a way to get the schema from the
output table, so we could potentially create a dynamic mapping of fields you want to write
to and the actual schema? If not, is there any standard way to be able to accomplish what
I described, other than hardcoding the positions of the columns in the code (bad for code
reusability)? Any alternative would be helpful as well.

Thanks in advance !

Charles

On Mon, Oct 31, 2011 at 8:37 PM, Ashutosh Chauhan <hashutosh@apache.org> wrote:
Hey Charles,

Yeah, you need to call setOutputSchema() on HCatOutputFormat explicitly. Though we could assume
defaults we don't because of the following reason. While writing rows they may either contain
partition columns or they may not. HCatOutputFormat will transparently weed out partition
columns if they are present in the row. If we assume defaults then we have to assume that
data does not contain partition columns (we dont store partition columns in data) which is
a dangerous assumption to make which will screw things up when we read back. So, instead we
ask user to set the schema. You are also correct order of columns should be same as the one
you have declared while creating tables.

Hope it helps,
Ashutosh


On Mon, Oct 31, 2011 at 14:54, Charles Menguy <cmenguy@proclivitysystems.com> wrote:
Hi,

I've been playing with HCatalog for the past couple weeks now, and I have a few questions
regarding schemas in MR jobs.

>From what I read in the documentation, schemas are optional, and if not specified it defaults
to the table level schemas. Here are some extracts from the documentation:
You can use the setOutputSchema method to include a projection schema, to specify specific
output fields. If a schema is not specified, this default to the table level schema.
The schema for the data being written out is specified by the setSchema method. If this is
not called on the HCatOutputFormat, then by default it is assumed that the the partition has
the same schema as the current table level schema

Now when I try to omit the schema for HCatInputFormat, it works fine and assumes the default.
But when I try to omit the schema for HCatOutputFormat, I get the following error: org.apache.hcatalog.common.HCatException
: 9001 : Exception occurred while processing HCat request : It seems that setSchema() is not
called on HCatOutputFormat. Please make sure that method is called.
>From what I read, it expects that I explicitely define the schema with HCatOutputFormat.setSchema(...),
but this is exactly what I would like to omit to assume defaults.

This is actually important because it seems that to define the schema, you have to know the
order of your table columns in which you specify your List<HCatFieldSchema>, which may
not always be obvious.

Here is how I create my output table in Hive, which works fine when I'm manipulating it while
specifying the schema:
hive> create table inventory(word STRING, author STRING, frequency INT) stored as RCFILE;

I would like to know if I'm doing something wrong, or if this is simply something not yet
implemented in 0.2? Any thoughts would be useful.

Thanks,

Charles







Mime
View raw message