incubator-hcatalog-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Travis Crawford <traviscrawf...@gmail.com>
Subject Re: Attach metadata to files in HDFS
Date Wed, 27 Jun 2012 21:03:40 GMT
Yeah sounds like per-country partitions would work here!

--travis


On Wed, Jun 27, 2012 at 1:45 PM, agateaaa <agateaaa@gmail.com> wrote:
> Thanks for your response.
>
> We convert the log files into comma delimited format fairly early on in the
> ingestion process
> so we were looking at HCatalog as a way to associate additional metadata say
> 'countryfield' with
> the data.
>
> From your response looks like having a partition on 'countryfield' is the
> way to do it.
>
> Yes usingĀ  per partition properties is not probably useful as we need to be
> able to also
> query the data based on the additional metadata
>
> Agatea
>
>
> On Wed, Jun 27, 2012 at 1:12 PM, Travis Crawford <traviscrawford@gmail.com>
> wrote:
>>
>> In your system, what sort of serde are you using? It needs to produce
>> records with the correct schema, so you could do something like this:
>>
>> * Set the table schema to contain all possible fields, even ones
>> records might not contain.
>> * Use a serde that knows how to parse all possible records, and fills
>> in missing fields.
>>
>> Each partition would then have a schema matching the table and things
>> would work. We do something similar, but use thrift as our records.
>> Since thrift objects can have optional fields, and is backwards
>> compatible, the thrift serde handles producing messages of the correct
>> schema.
>>
>> The HiveMetaStore does have per-partition properties, but I don't
>> recommend storing per-file metadata there.
>>
>> --travis
>>
>>
>> On Wed, Jun 27, 2012 at 10:20 AM, agateaaa <agateaaa@gmail.com> wrote:
>> > Hi,
>> >
>> > I am evaluating HCatalog and have a specific use case. I like the fact
>> > that
>> > HCatalog gives a consistent interface to the data in hdfs
>> > across different tools like hive, pig and map reduce
>> >
>> > We want to be able to associate metadata with the log files that we are
>> > currently storing on hdfs.
>> >
>> > We are pulling in thousands of log files and since the data in the log
>> > files
>> > lacks certain
>> > fields we end up adding those fields to the data before ingesting them
>> > in
>> > hdfs before processing them further.
>> >
>> > I was reading through the documentation, mailing lists and articles on
>> > HCatalog I could find [1] and [2] below
>> > which imply that it is possible to associate metadata with your data
>> > using
>> > HCatalog.
>> >
>> > My questions are
>> >
>> > 1.) Can I define a schema and associate it with individual files or
>> > group of
>> > files on hdfs ?
>> >
>> > 2.) Can I change this metadata schema over time and not affect existing
>> > files?
>> >
>> > 3.) Are these metadata fields available in pig scripts processing that
>> > data
>> > so we could filter data using fields in
>> > the metadata defined for these files?
>> >
>> >
>> > I have used hive before and one possible solution I see is to use
>> > partitions
>> > to define your metadata fields but I was just
>> > wondering if there is any other HCatalog way of defining this metadata
>> > which
>> > does not involve partitions.
>> >
>> > Thanks in advance for your help,
>> >
>> > Agatea
>> >
>> > [1]
>> >
>> > http://developer.yahoo.com/blogs/hadoop/posts/2011/04/hcatalog-tables-and-metadata-for-hadoop/
>> > [2]
>> >
>> > http://www.mail-archive.com/hcatalog-user@incubator.apache.org/msg00004.html
>> >
>
>

Mime
View raw message