hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerome Boulon <>
Subject Re: Running hive on large number of files in S3
Date Thu, 20 Oct 2011 20:16:09 GMT
I don't think that your job is actually prefetching the data while you're
If you have a large number of partitions then getting the list of files to
compute the split
(aka prefetching the filenames from S3) is what is taking for ever.
If you have a premium support from amazon you may want to ask for help in
this area.


On 10/20/11 1:10 PM, "Thulasi Ram Naidu Peddineni"
<> wrote:

>Hi All,
>    I have a use-case where I will be joining table1 with table2.
>These are external tables with data in S3. table2 has many partitions
>(say 10K) size being around 2GB and table1 has around 5-10 partitions
>around 1-2MB. When I am joining these two tables, I observed that it
>is taking lot of time to execute the query (more than 20 minutes).
>From my observation, the actual job execution is not taking lot of
>time but the bottle neck is starting the job itself. This makes me
>think that hive prefetching all the data from S3 and then do the
>processing. Can some one explain me why is hive job is not starting
>ontime on an external table with too many-partitions ?
>  One more observation here is, if I reduce the number of partitions
>with same amount of data, the whole query is executing faster.
>And what is the recommended way in such a scenario.
>Thulasi Ram P

View raw message