hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thulasi Ram Naidu Peddineni <>
Subject Running hive on large number of files in S3
Date Thu, 20 Oct 2011 20:10:18 GMT
Hi All,
    I have a use-case where I will be joining table1 with table2.
These are external tables with data in S3. table2 has many partitions
(say 10K) size being around 2GB and table1 has around 5-10 partitions
around 1-2MB. When I am joining these two tables, I observed that it
is taking lot of time to execute the query (more than 20 minutes).
>From my observation, the actual job execution is not taking lot of
time but the bottle neck is starting the job itself. This makes me
think that hive prefetching all the data from S3 and then do the
processing. Can some one explain me why is hive job is not starting
ontime on an external table with too many-partitions ?
  One more observation here is, if I reduce the number of partitions
with same amount of data, the whole query is executing faster.

And what is the recommended way in such a scenario.

Thulasi Ram P

View raw message