hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-207) Change SerDe API to allow skipping unused columns
Date Thu, 08 Jan 2009 23:42:59 GMT

    [ https://issues.apache.org/jira/browse/HIVE-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662175#action_12662175
] 

Joydeep Sen Sarma commented on HIVE-207:
----------------------------------------

I like Zheng's proposal.

The only thing that concerns me is whether the SerDe might find it expensive to bootstrap
off DDLColumnInfo. for example - the following would be equivalent - but would allow the SerDe
to cache some transformed/serialized version off the DDLInfo in it's own properties that 
might be easier (ie. cheaper) for the SerDe to bootstrap off:

- at table creation time - we know the serde/deserializer
- we call a new method in the Deserializer that translates DDLInfo to some properties:

interface SerDe {
...
Properties schemaToProperties(List<DDLColumnInfo) throws UnsupportedSchemaException;
...
}

we store the returned Properties as part of SerDeProperties in metastore (that's available
in the serde initialize call). The initialize() call signature can be what Zheng proposed.
But if the SerDe wants - it can cache a serialized version of the schema in the properties
that it finds easier to handle. This will also provide an opportunity for the SerDe to reject
any Hive Schemas that it cannot support (for example - if a SerDe cannot support maps - it
can reject DDL statements with maps in this step).

Thoughts?




> Change SerDe API to allow skipping unused columns
> -------------------------------------------------
>
>                 Key: HIVE-207
>                 URL: https://issues.apache.org/jira/browse/HIVE-207
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor, Serializers/Deserializers
>            Reporter: David Phillips
>
> A deserializer shouldn't have to deserialize columns that are never used by the query
processor.  A serializer shouldn't have to examine unused columns that are known to always
be null.
> As an example, we store data as a Protocol Buffer structure with ~60 fields.  Running
a "select count(1)" currently requires deserializing all fields, which includes checking if
they exist and formatting the data appropriately.  This is expensive and unnecessary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message