incubator-hcatalog-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charles Menguy <cmen...@proclivitysystems.com>
Subject Re: HCatOutputFormat schema issues
Date Wed, 02 Nov 2011 14:17:27 GMT
This works fine when using a non partitioned table, I can just set the
schema to the schema of the table using something like
HCatOutputFormat.setSchema(job, HCatOutputFormat.getTableSchema(job));

For a partitioned table however, as you explained, the getTableSchema call
will only return the non partition column, and this method will fail as
expected, because you have to specifically ask it to add the partition
columns in the schema, and this works fine. For this, I currently manually
add the partitions to the table schema, which is a bit tedious. Is there by
any chance a way to get the list of partition from HCatOutputFormat or
anywhere else, so I can just get the list of partitions from the table
schema, add them to the actual schema, set the schema, and be done? Or will
I still have to do it manually?

I also noticed that there is no way to get the actual schema from the
HCatOutputFormat. You can get the table schema by calling getTableSchema,
which is great, but I don't see a way to get the actual schema we are
setting this way. This is not critical, but I just wanted to mention it.

Thanks for the support on this particular issue, that was very helpful  !

Charles

On Tue, Nov 1, 2011 at 2:29 PM, Ashutosh Chauhan <hashutosh@apache.org>wrote:

> Sure. Try it out and let us know how it goes. In the meanwhile, we will
> get docs fixed.
>
> Ashutosh
>
> On Tue, Nov 1, 2011 at 10:59, Charles Menguy <
> cmenguy@proclivitysystems.com> wrote:
>
>> Thanks for the information Ashutosh, I'll try what you're suggesting but
>> this sounds like a good solution for now.
>>
>> And yes I agree with Thomas, it would be a good idea to fix the following
>> line in the documentation as this is pretty confusing:
>> The schema for the data being written out is specified by the setSchema method.
>> If this is not called on the HCatOutputFormat, then by default it is
>> assumed that the the partition has the same schema as the current table
>> level schema.
>>
>> Thanks for the help !
>>
>> Charles
>>
>> On Tue, Nov 1, 2011 at 1:42 PM, Thomas Weise <thw@yahoo-inc.com> wrote:
>>
>>>  We should fix the documentation then?
>>>
>>> http://incubator.apache.org/hcatalog/docs/r0.2.0/inputoutput.html
>>>
>>>
>>>
>>> On 11/1/11 9:13 AM, "Ashutosh Chauhan" <hashutosh@apache.org> wrote:
>>>
>>> Hey Charles,
>>>
>>> After you have done HCatOutputFormat.setOutput(), you can do
>>> HCatOutputFormat.getTableSchema() which will return you the schema of table
>>> which you can then use without requiring you to manually construct the
>>> Schema.
>>>
>>> Hope it helps,
>>> Ashutosh
>>>
>>> On Mon, Oct 31, 2011 at 20:18, Charles Menguy <
>>> cmenguy@proclivitysystems.com> wrote:
>>>
>>> Hi Ashutosh,
>>>
>>> Thank you very much for your answer.
>>>
>>> I can certainly understand your argument. Is there however a way to get
>>> the schema from the output table, so we could potentially create a
>>> dynamic mapping of fields you want to write to and the actual schema? If
>>> not, is there any standard way to be able to accomplish what I described,
>>> other than hardcoding the positions of the columns in the code (bad for
>>> code reusability)? Any alternative would be helpful as well.
>>>
>>> Thanks in advance !
>>>
>>> Charles
>>>
>>> On Mon, Oct 31, 2011 at 8:37 PM, Ashutosh Chauhan <hashutosh@apache.org>
>>> wrote:
>>>
>>> Hey Charles,
>>>
>>> Yeah, you need to call setOutputSchema() on HCatOutputFormat explicitly.
>>> Though we could assume defaults we don't because of the following reason.
>>> While writing rows they may either contain partition columns or they may
>>> not. HCatOutputFormat will transparently weed out partition columns if they
>>> are present in the row. If we assume defaults then we have to assume that
>>> data does not contain partition columns (we dont store partition columns in
>>> data) which is a dangerous assumption to make which will screw things up
>>> when we read back. So, instead we ask user to set the schema. You are also
>>> correct order of columns should be same as the one you have declared while
>>> creating tables.
>>>
>>> Hope it helps,
>>> Ashutosh
>>>
>>>
>>> On Mon, Oct 31, 2011 at 14:54, Charles Menguy <
>>> cmenguy@proclivitysystems.com> wrote:
>>>
>>> Hi,
>>>
>>> I've been playing with HCatalog for the past couple weeks now, and I
>>> have a few questions regarding schemas in MR jobs.
>>>
>>> From what I read in the documentation, schemas are optional, and if not
>>> specified it defaults to the table level schemas. Here are some extracts
>>> from the documentation:
>>> You can use the setOutputSchema method to include a projection schema,
>>> to specify specific output fields. If a schema is not specified, this
>>> default to the table level schema.
>>> The schema for the data being written out is specified by the setSchema method.
>>> If this is not called on the HCatOutputFormat, then by default it is
>>> assumed that the the partition has the same schema as the current table
>>> level schema
>>>
>>> Now when I try to omit the schema for HCatInputFormat, it works fine and
>>> assumes the default.
>>> But when I try to omit the schema for HCatOutputFormat, I get the
>>> following error: org.apache.hcatalog.common.HCatException : 9001 :
>>> Exception occurred while processing HCat request : It seems that
>>> setSchema() is not called on HCatOutputFormat. Please make sure that method
>>> is called.
>>> From what I read, it expects that I explicitely define the schema with
>>> HCatOutputFormat.setSchema(...), but this is exactly what I would like to
>>> omit to assume defaults.
>>>
>>> This is actually important because it seems that to define the schema,
>>> you have to know the order of your table columns in which you specify your
>>> List<HCatFieldSchema>, which may not always be obvious.
>>>
>>> Here is how I create my output table in Hive, which works fine when I'm
>>> manipulating it while specifying the schema:
>>> hive> create table inventory(word STRING, author STRING, frequency INT)
>>> stored as RCFILE;
>>>
>>> I would like to know if I'm doing something wrong, or if this is simply
>>> something not yet implemented in 0.2? Any thoughts would be useful.
>>>
>>> Thanks,
>>>
>>> Charles
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Mime
View raw message