incubator-hcatalog-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From agateaaa <agate...@gmail.com>
Subject Re: Attach metadata to files in HDFS
Date Wed, 27 Jun 2012 20:45:46 GMT
Thanks for your response.

We convert the log files into comma delimited format fairly early on in the
ingestion process
so we were looking at HCatalog as a way to associate additional metadata
say 'countryfield' with
the data.

>From your response looks like having a partition on 'countryfield' is the
way to do it.

Yes using  per partition properties is not probably useful as we need to be
able to also
query the data based on the additional metadata

Agatea

On Wed, Jun 27, 2012 at 1:12 PM, Travis Crawford
<traviscrawford@gmail.com>wrote:

> In your system, what sort of serde are you using? It needs to produce
> records with the correct schema, so you could do something like this:
>
> * Set the table schema to contain all possible fields, even ones
> records might not contain.
> * Use a serde that knows how to parse all possible records, and fills
> in missing fields.
>
> Each partition would then have a schema matching the table and things
> would work. We do something similar, but use thrift as our records.
> Since thrift objects can have optional fields, and is backwards
> compatible, the thrift serde handles producing messages of the correct
> schema.
>
> The HiveMetaStore does have per-partition properties, but I don't
> recommend storing per-file metadata there.
>
> --travis
>
>
> On Wed, Jun 27, 2012 at 10:20 AM, agateaaa <agateaaa@gmail.com> wrote:
> > Hi,
> >
> > I am evaluating HCatalog and have a specific use case. I like the fact
> that
> > HCatalog gives a consistent interface to the data in hdfs
> > across different tools like hive, pig and map reduce
> >
> > We want to be able to associate metadata with the log files that we are
> > currently storing on hdfs.
> >
> > We are pulling in thousands of log files and since the data in the log
> files
> > lacks certain
> > fields we end up adding those fields to the data before ingesting them in
> > hdfs before processing them further.
> >
> > I was reading through the documentation, mailing lists and articles on
> > HCatalog I could find [1] and [2] below
> > which imply that it is possible to associate metadata with your data
> using
> > HCatalog.
> >
> > My questions are
> >
> > 1.) Can I define a schema and associate it with individual files or
> group of
> > files on hdfs ?
> >
> > 2.) Can I change this metadata schema over time and not affect existing
> > files?
> >
> > 3.) Are these metadata fields available in pig scripts processing that
> data
> > so we could filter data using fields in
> > the metadata defined for these files?
> >
> >
> > I have used hive before and one possible solution I see is to use
> partitions
> > to define your metadata fields but I was just
> > wondering if there is any other HCatalog way of defining this metadata
> which
> > does not involve partitions.
> >
> > Thanks in advance for your help,
> >
> > Agatea
> >
> > [1]
> >
> http://developer.yahoo.com/blogs/hadoop/posts/2011/04/hcatalog-tables-and-metadata-for-hadoop/
> > [2]
> >
> http://www.mail-archive.com/hcatalog-user@incubator.apache.org/msg00004.html
> >
>

Mime
View raw message