hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <>
Subject [jira] Commented: (HIVE-207) Change SerDe API to allow skipping unused columns
Date Mon, 05 Jan 2009 22:31:44 GMT


Zheng Shao commented on HIVE-207:

Our current SerDe framework is designed for allowing lazy initialization. That's why we allow
the objects inside the memory to be heterogeneous and allow users to specify the object inspector
to get the fields out of the object.

The major difficulty that you will see when implementing a new SerDe is probably you need
to parse and understand the DDL (which is in thrift). The only easy way for that is to reuse
the DynamicSerDe code, and write a new Protocol instead of a new serde. Then you can reuse
the code in DynamicSerDe to parse the thrift DDL. You may want to take a look at TBinaryProtocol.
(Let us know if you have any other good ideas to represent the types of columns without thrift

Your idea of skipping columns is an alternative way of achieving efficiency. The good thing
is that you can still enjoy the majority of the efficiencies (through pruning columns) while
have a simple homogeneous in-memory representation. The bad thing is that there are some potential
optimizations that your framework won't be able to do: 1. for different rows, we might want
to deserialize different columns because there is an IF or CASE statement; 2. there are some
operations that can be calculated without deserializing the whole field: size of the list,
sub-field of a field, which are very common if the field is of complex type.

As a result, the use of ObjectInspector provides the best potential performance.

> Change SerDe API to allow skipping unused columns
> -------------------------------------------------
>                 Key: HIVE-207
>                 URL:
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor, Serializers/Deserializers
>            Reporter: David Phillips
> A deserializer shouldn't have to deserialize columns that are never used by the query
processor.  A serializer shouldn't have to examine unused columns that are known to always
be null.
> As an example, we store data as a Protocol Buffer structure with ~60 fields.  Running
a "select count(1)" currently requires deserializing all fields, which includes checking if
they exist and formatting the data appropriately.  This is expensive and unnecessary.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message