hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-207) Change SerDe API to allow skipping unused columns
Date Thu, 08 Jan 2009 20:14:59 GMT

    [ https://issues.apache.org/jira/browse/HIVE-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662103#action_12662103
] 

Zheng Shao commented on HIVE-207:
---------------------------------

@Jeff: see http://wiki.apache.org/hadoop/Hive/DeveloperGuide#head-075e4c5524138d2674250e664dfb0f40ed57f9ca

@Joydeep: I see. You mean to make the "SQL DDL" -> hierarchical type information (TypeInfo
classes) translation a job of the shared utility code? I like this idea. It saves developer
a lot of time in understanding "thrift DDL".

{code}
class DDLColumnInfo {
  String columnName;
  TypeInfo columnType;
}

interface SerDe {
  /** List<DDLColumnInfo> will provide the column information from SQL DDL.
   *  If the user created the table with no column information, we will pass null.
   *  The ObjectInspector returned by getObjectInspector() needs to have the same column names
and types as the List<DDLColumnInfo> (if not null).
   */
  void intialize(Configuration, Properties, List<DDLColumnInfo> );

  ObjectInspector getObjectInspector() throws SerDeException;
}
{code}

By adding an additional parameter List<DDLColumnInfo>, the developers of SerDe do not
need to parse the SQL DDL or Thrift DDL.

We already have TypeInfo classes and we just need to move them from ql to serde. It seems
trivial to do and all future SerDe can take advantage of List<DDLColumnInfo>. (Although
I don't want to change DynamicSerDe at this point unless necessary).

Can you confirm this is what we want?


> Change SerDe API to allow skipping unused columns
> -------------------------------------------------
>
>                 Key: HIVE-207
>                 URL: https://issues.apache.org/jira/browse/HIVE-207
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor, Serializers/Deserializers
>            Reporter: David Phillips
>
> A deserializer shouldn't have to deserialize columns that are never used by the query
processor.  A serializer shouldn't have to examine unused columns that are known to always
be null.
> As an example, we store data as a Protocol Buffer structure with ~60 fields.  Running
a "select count(1)" currently requires deserializing all fields, which includes checking if
they exist and formatting the data appropriately.  This is expensive and unnecessary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message