hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Namit Jain (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-1197) create a new input format where a mapper spans a file
Date Fri, 26 Feb 2010 17:47:28 GMT

    [ https://issues.apache.org/jira/browse/HIVE-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838959#action_12838959
] 

Namit Jain commented on HIVE-1197:
----------------------------------

Currently, the split that a mapper processes is determined by a variety of parameters, including
the dfs block size, min split size etc.

It might be useful to have an option when the users wants a mapper so scan 1 file. This will
be specially useful for sort-merge join.
If the data is partitioned into various buckets, and each bucket us sorted, the sort merge
join can join the different buckets together.

For example, consider the following scenario:

table T1: sorted and bucketed by column 'key' into 1000 buckets
table T2: sorted and bucketed by column 'key' into 1000 buckets


and the query:


select * from T1 join T2 on key
mapjoin.

Instead of joining the table T1 with T2, the 1000 buckets can be joined with each other individually.
Since the data is sorted on the join key, sort-merge join can be used.
Say the buckets are named: b0001, b0002 .. b1000
Say table T1 is the big table, and the buckets from T2 are being read as part of the mapper
which is spawned to process T1,
under the current approach, it will be very difficult to perform outer joins.

For example, if bucket b1 for T1 contains:


1
2
5
6
9
16
22
30

and the corresponding bucket for T2 contains:

2
4
8


If there are 2 mappers for bucket b1 for T1, processing 4 records each ((1,2,5,6) and (9.16.22.30)
respectively.
It will be very difficult to perform a outer join. The mapper will need to peek into the previous
record
and the next record respectively.

Moreover, it will be very difficult to ensure that the result also has 1000 buckets. Another
map-reduce job
will be needed for the same.

This can be easily solved if we are guaranteed that the whole bucket (or the file corresponding
to the bucket),
will be processed by a single mapper.
 

> create a new input format where a mapper spans a file
> -----------------------------------------------------
>
>                 Key: HIVE-1197
>                 URL: https://issues.apache.org/jira/browse/HIVE-1197
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Siying Dong
>             Fix For: 0.6.0
>
>
> This will be needed for Sort merge joins.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message