incubator-hcatalog-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Travis Crawford <traviscrawf...@gmail.com>
Subject Re: Attach metadata to files in HDFS
Date Wed, 27 Jun 2012 20:12:10 GMT
In your system, what sort of serde are you using? It needs to produce
records with the correct schema, so you could do something like this:

* Set the table schema to contain all possible fields, even ones
records might not contain.
* Use a serde that knows how to parse all possible records, and fills
in missing fields.

Each partition would then have a schema matching the table and things
would work. We do something similar, but use thrift as our records.
Since thrift objects can have optional fields, and is backwards
compatible, the thrift serde handles producing messages of the correct
schema.

The HiveMetaStore does have per-partition properties, but I don't
recommend storing per-file metadata there.

--travis


On Wed, Jun 27, 2012 at 10:20 AM, agateaaa <agateaaa@gmail.com> wrote:
> Hi,
>
> I am evaluating HCatalog and have a specific use case. I like the fact that
> HCatalog gives a consistent interface to the data in hdfs
> across different tools like hive, pig and map reduce
>
> We want to be able to associate metadata with the log files that we are
> currently storing on hdfs.
>
> We are pulling in thousands of log files and since the data in the log files
> lacks certain
> fields we end up adding those fields to the data before ingesting them in
> hdfs before processing them further.
>
> I was reading through the documentation, mailing lists and articles on
> HCatalog I could find [1] and [2] below
> which imply that it is possible to associate metadata with your data using
> HCatalog.
>
> My questions are
>
> 1.) Can I define a schema and associate it with individual files or group of
> files on hdfs ?
>
> 2.) Can I change this metadata schema over time and not affect existing
> files?
>
> 3.) Are these metadata fields available in pig scripts processing that data
> so we could filter data using fields in
> the metadata defined for these files?
>
>
> I have used hive before and one possible solution I see is to use partitions
> to define your metadata fields but I was just
> wondering if there is any other HCatalog way of defining this metadata which
> does not involve partitions.
>
> Thanks in advance for your help,
>
> Agatea
>
> [1]
> http://developer.yahoo.com/blogs/hadoop/posts/2011/04/hcatalog-tables-and-metadata-for-hadoop/
> [2]
> http://www.mail-archive.com/hcatalog-user@incubator.apache.org/msg00004.html
>

Mime
View raw message