incubator-hcatalog-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashutosh Chauhan <hashut...@apache.org>
Subject Re: HCatOutputFormat schema issues
Date Tue, 01 Nov 2011 17:51:30 GMT
Yeah. Created https://issues.apache.org/jira/browse/HCATALOG-150

Ashutosh

On Tue, Nov 1, 2011 at 10:42, Thomas Weise <thw@yahoo-inc.com> wrote:

>  We should fix the documentation then?
>
> http://incubator.apache.org/hcatalog/docs/r0.2.0/inputoutput.html
>
>
>
> On 11/1/11 9:13 AM, "Ashutosh Chauhan" <hashutosh@apache.org> wrote:
>
> Hey Charles,
>
> After you have done HCatOutputFormat.setOutput(), you can do
> HCatOutputFormat.getTableSchema() which will return you the schema of table
> which you can then use without requiring you to manually construct the
> Schema.
>
> Hope it helps,
> Ashutosh
>
> On Mon, Oct 31, 2011 at 20:18, Charles Menguy <
> cmenguy@proclivitysystems.com> wrote:
>
> Hi Ashutosh,
>
> Thank you very much for your answer.
>
> I can certainly understand your argument. Is there however a way to get
> the schema from the output table, so we could potentially create a
> dynamic mapping of fields you want to write to and the actual schema? If
> not, is there any standard way to be able to accomplish what I described,
> other than hardcoding the positions of the columns in the code (bad for
> code reusability)? Any alternative would be helpful as well.
>
> Thanks in advance !
>
> Charles
>
> On Mon, Oct 31, 2011 at 8:37 PM, Ashutosh Chauhan <hashutosh@apache.org>
> wrote:
>
> Hey Charles,
>
> Yeah, you need to call setOutputSchema() on HCatOutputFormat explicitly.
> Though we could assume defaults we don't because of the following reason.
> While writing rows they may either contain partition columns or they may
> not. HCatOutputFormat will transparently weed out partition columns if they
> are present in the row. If we assume defaults then we have to assume that
> data does not contain partition columns (we dont store partition columns in
> data) which is a dangerous assumption to make which will screw things up
> when we read back. So, instead we ask user to set the schema. You are also
> correct order of columns should be same as the one you have declared while
> creating tables.
>
> Hope it helps,
> Ashutosh
>
>
> On Mon, Oct 31, 2011 at 14:54, Charles Menguy <
> cmenguy@proclivitysystems.com> wrote:
>
> Hi,
>
> I've been playing with HCatalog for the past couple weeks now, and I have
> a few questions regarding schemas in MR jobs.
>
> From what I read in the documentation, schemas are optional, and if not
> specified it defaults to the table level schemas. Here are some extracts
> from the documentation:
> You can use the setOutputSchema method to include a projection schema, to
> specify specific output fields. If a schema is not specified, this default
> to the table level schema.
> The schema for the data being written out is specified by the setSchema method.
> If this is not called on the HCatOutputFormat, then by default it is
> assumed that the the partition has the same schema as the current table
> level schema
>
> Now when I try to omit the schema for HCatInputFormat, it works fine and
> assumes the default.
> But when I try to omit the schema for HCatOutputFormat, I get the
> following error: org.apache.hcatalog.common.HCatException : 9001 :
> Exception occurred while processing HCat request : It seems that
> setSchema() is not called on HCatOutputFormat. Please make sure that method
> is called.
> From what I read, it expects that I explicitely define the schema with
> HCatOutputFormat.setSchema(...), but this is exactly what I would like to
> omit to assume defaults.
>
> This is actually important because it seems that to define the schema, you
> have to know the order of your table columns in which you specify your
> List<HCatFieldSchema>, which may not always be obvious.
>
> Here is how I create my output table in Hive, which works fine when I'm
> manipulating it while specifying the schema:
> hive> create table inventory(word STRING, author STRING, frequency INT)
> stored as RCFILE;
>
> I would like to know if I'm doing something wrong, or if this is simply
> something not yet implemented in 0.2? Any thoughts would be useful.
>
> Thanks,
>
> Charles
>
>
>
>
>
>
>

Mime
View raw message