hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Moore, Douglas" <>
Subject Re: Question
Date Fri, 05 Dec 2014 19:10:17 GMT
We use Hive to manage 100's of millions machine log data files. These files are semi-structured.
Semi-structured in that we don't care about the full structure of the file up front, nor do
they have a format that's easy to understand.

Even data with less structure (e.g. Medical notes) there is always metadata about the data
and context.
This metadata and the 'blob' of data can fit in a row of a Hive table. We use UDFs and UDTFs
to parse the blob portion of the data on an as needed basis.
Another pattern is using a sequence file. The value contains the blob, the key contains the
concatenated metadata object (think Avro encoding).

Storage can be on HDFS or in HBase. The choice depends more on read and write access pattern
requirements more than what level of structure the data has. The processing tool (Pig / Hive
/ Map Reduce) choice is better influenced by the type of data flows (data pipelines) you need
to build more so than how much structure the data has. The one exception is nested data, I
find Pig handles this more easily than Hive does.

The trick to managing semi-structured data via Hive/Pig is through the use of UDFs for parsing
what you need when you need it. All of the tools above support UDFs. Map Reduce does it too
because it's already operating at the 'assembly language' level anyways.

- Douglas

From: Bill Busch <<>>
Reply-To: <<>>
Date: Wed, 3 Dec 2014 20:59:46 -0500
To: "<>" <<>>
Subject: RE: Question

MapReduce can be used for both structure and unstructured data.   Hive is a storage and retrieval
mechanism (e.g. database).   The trouble with RDBMS is that you either have to parse the unstructured
data into a structured row /column format OR store it as an object.  There are issues both
performance and semantically .  Hence, there is a whole world of NoSQL databases out there
that have been developed that are not row-column structured.  These databases can handle more
schema-less/unstructured objects and will allow you to more eloquently manipulate your information.
     I would check out the Wikipedia page on NoSQL databases and focus on Key - Value, Columnar,
or Document databases.

Date: Thu, 4 Dec 2014 07:06:16 +0530
Subject: Re: Question

Thanks Gabriel for the prompt response

I see in online blogs saying  MapReduce for Unstructured Data , Pig for Semi Sturctured Data
and Hive is only for Structured Data. Can you please justify this?

Thanks in advance

On Thu, Dec 4, 2014 at 6:56 AM, Gabriel Eisbruch <<>>
Hi Mohan,
   We are using hive for unstructured (or semi structured data) using map columns, for example,
we use for fixed data standard columns and form dynamic data map columns.


2014-12-03 22:19 GMT-03:00 Mohan Krishna <<>>:
Hive is  for only structured data or it handles Unstructured data as well ?

View raw message