hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Sichi <>
Subject Re: SerDe and Rows
Date Wed, 26 May 2010 02:39:29 GMT
Sorry for my slow response; answers below.

On May 20, 2010, at 5:23 PM, Sanjit Jhala wrote:

> Thanks John, that does look quite interesting. It looks like in addition to containing
a bunch of cells, the row class needs to provide some mechanism (eg a map) to efficiently
lookup the cell corresponding to a given qualified column (ie column family + qualifier).
In the case where a Hive column matches an entire column family, do you use this same map
using the property that the column family is a prefix of the map key or is there an additional
map that maps the column family to a set of qualifiers or directly to a set of cells ?

There is a separate map (LazyHBaseCellMap).  LazyHBaseRow instantiates this for Hive column
values which correspond to HBase column families.

> The wiki also indicates that in future multiple versions of a cell could be exposed to
the storage handler since Hive can deal with non-unique rows. I can definitely see how you
should be able to  store non-unique Hive rows in Hypertable (since Hypertable supports multi-versioned
cells), however since the fundamental unit of storage in the BigTable design is a cell, I
don't understand how you propose to map multiple cell versions back to non-unique Hive rows.
Maybe you're thinking of mapping them to a single Hive row, where the columns are of the List
type? And then maybe the query language allows you to filter by the first, last or any value
in the list?

Yeah, I realized this recently when I started thinking about it again :)

Exposing per-cell timestamps is possible, and there are a number of ways to do it, including
the one you mention.  But they're all unwieldy, so we should probably defer them until there's
a very good use case.

A simpler scheme I'm thinking about is to map a Hive partition to a particular timestamp.
 Then for queries, this will specify a point-in-time (we would need to validate that only
equality predicates are used on the partition key since returning multiple versions of a row
isn't well-defined as you correctly point out).  For inserts, all cells created would get
the same timestamp.  Maybe this would cover the majority of use-cases?


View raw message