hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Zhang (JIRA)" <>
Subject [jira] Commented: (HIVE-1900) a mapper should be able to span multiple partitions
Date Fri, 07 Jan 2011 18:59:48 GMT


Ning Zhang commented on HIVE-1900:

I remember I had encountered the problem before. Enabling a mapper to read from multiple partitions
is trivial but there are some pitfalls to watch:

 1) partitioning columns are not present in the data file itself. The partitioning column
value is appended during the RecordReader (or something like that). It assumes that all records
come from the same partition. The assumption will be broken here. An example query you can
try is 

   select ds, count(1) from srcpart where ds is not null group by ds;

 2) The merge job should be treated specially to not allow combined input from multiple partitions.

 3) Auto-gathering stats from the FileSinkOperator need to be address for the case so that
stats are maintained for multiple partitions. 

> a mapper should be able to span multiple partitions
> ---------------------------------------------------
>                 Key: HIVE-1900
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
> Currently, a  mapper only spans a single partition which creates a problem in the presence
of many
> small partitions (which is becoming a common usecase in facebook).
> If the plan is the same, a mapper should be able to span files across multiple partitions

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message