hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Phillips (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-207) Change SerDe API to allow skipping unused columns
Date Mon, 05 Jan 2009 20:57:44 GMT

    [ https://issues.apache.org/jira/browse/HIVE-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660922#action_12660922
] 

David Phillips commented on HIVE-207:
-------------------------------------

Thanks, I'll give that a try.

I haven't dug into the ObjectInspector stuff, but my initial impression is that it feels overly
complex and backwards.  Perhaps part of it is the standard deserializers being spread over
multiple classes.  It also seems strange that the deserializer can override the declared column
types of the table.  My deserializer returns a MetadataListStructObjectInspector, causing
all column types to be string.

Here's a rough idea for a new interface:

{noformat}
interface ColumnSet {
  String[] getTableColumnNames();
  ColumnType[] getTableColumnTypes();
  int[] getUsedColumns();
  void setColumnValue(int n, Object o);
}

interface Deserializer {
  void initialize(Configuration conf, Properties tbl, ColumnSet cols);
  void deserialize(Writable blob, ColumnSet cols);
}
{noformat}

The deserializer would call setColumnValue() for each non-null column from getUsedColumns()
index list.  The ColumnSet would be pre-initialized to null for all values.  The deserializer
wouldn't need to worry about caching objects, implementing complex interfaces, etc.  It simply
makes a single call for each column value.

There might be an overloaded setColumnValue() for standard types like int, Integer, String,
etc.  Creating the actual ColumnSet object dynamically at runtime might have some performance
advantages.

> Change SerDe API to allow skipping unused columns
> -------------------------------------------------
>
>                 Key: HIVE-207
>                 URL: https://issues.apache.org/jira/browse/HIVE-207
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor, Serializers/Deserializers
>            Reporter: David Phillips
>
> A deserializer shouldn't have to deserialize columns that are never used by the query
processor.  A serializer shouldn't have to examine unused columns that are known to always
be null.
> As an example, we store data as a Protocol Buffer structure with ~60 fields.  Running
a "select count(1)" currently requires deserializing all fields, which includes checking if
they exist and formatting the data appropriately.  This is expensive and unnecessary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message