hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andre Araujo <ara...@pythian.com>
Subject Re: Hive dynamic partitions generate multiple files
Date Wed, 29 Jan 2014 09:43:35 GMT
Hi, Cosmin,

Functionally the the subsequent queries will work just fine (they will
return the correct results). But you're correct in saying that it's not
optimal.
If the jobs always generate very small files you might end up with a huge
number of small files, which will have a impact on the name nodes memory
usage as well.
In that case I think you could periodically "coalesce" the recent
partitions. Once a week/month you can select from the more recent
partitions and insert overwrite, which will convert all those small files
in bigger ones.

However, if the jobs are creating files that are already around the cluster
block size, it should be fine to leave them as is.

Maybe someone else has some other ideas...


On 29 January 2014 18:05, Cosmin Cătălin Sanda <cosmincatalin@gmail.com>wrote:

> Hi Andre,
>
> The reason is that I want those partitions to go into other queries. If
> the individual files are only a few MB than the performance will be
> sub-optimal. As far as I understood, the individual files need to be at
> least around 140MB for the Maps to work properly.
>
> ------------------------------------
> *Cosmin Catalin SANDA*
> Software Systems Engineer
> Phone: +45.27.30.60.35
>
>
>
> On Wed, Jan 29, 2014 at 2:53 AM, Andre Araujo <araujo@pythian.com> wrote:
>
>> Why do you need exactly one file? This is transparent to Hive and it
>> should treat it seamlessly. Unless you have external requirements (reading
>> files from somewhere else) it shouldn't matter.
>>
>> HDFS support to file append is not a solid standard afaik, and will
>> depend on the distribution and version you're using. In some versions file
>> append is not available an the only way to add data to an existing Hive
>> table is to create an additional file under the table's directory in HDFS.
>> I haven't looked at the code but it may be that Hive developers chose this
>> to be the default way for appending data so it works with all HDFS
>> distributions and versions.
>>
>> If you need to merge multiple files under the same partition you can
>> select everything from that partition an INSERT OVERWRITE the data again.
>>
>> But again, unless you have requirements external to Hive, you shouldn't
>> be concerned about that.
>>
>>
>> On 29 January 2014 11:32, Cosmin Cătălin Sanda <cosmincatalin@gmail.com>wrote:
>>
>>> Hi Andre,
>>>
>>> So the thing is like this: the first time the query runs, it generates
>>> one file per dynamic partition, The next time the query runs and it needs
>>> to write to the same partition, it will generate another file instead of
>>> merging with the existing one.
>>>
>>> Eg:
>>> 1.The partitioned S3 path looks like this s3://bucket/export/2014/01/23
>>> 2. I run the query on some data and I ultimately end up having a file in
>>> the above mentioned partition.
>>> 3. I run the same query on some other data which ends up writing to the
>>> same partition as above, only it doesn't take the existing file from there
>>> and merges with it, it will generate a second file in the same partition.
>>>
>>>
>>> ------------------------------------
>>> *Cosmin Catalin SANDA*
>>> Software Systems Engineer
>>> Phone: +45.27.30.60.35
>>>
>>>
>>>
>>> On Wed, Jan 29, 2014 at 1:16 AM, Andre Araujo <araujo@pythian.com>wrote:
>>>
>>>> Hi, Cosmin,
>>>>
>>>> Have you tried using DISTRIBUTE BY to distribute the query's data by
>>>> the partitioning columns?
>>>> That way all the data for each partition should be sent to the same
>>>> reducer and should be written to a single file in each partition, I think.
>>>>
>>>> If your data is being distributed by a different criteria, you will
>>>> potentially have multiple reducers writing to the same partitions.
>>>>
>>>> Andre
>>>>
>>>>
>>>>
>>>> On 29 January 2014 10:51, Cosmin Cătălin Sanda <cosmincatalin@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>  I have a number of Hive jobs that run during a day. Each individual
>>>>> job is outputting data to Amazon S3. The Hive jobs use dynamic partitioning.
>>>>>
>>>>> The problem is that when different jobs need to write to the same
>>>>> dynamic partition, they will each generate one file.
>>>>>
>>>>> What I would like is for the subsequent jobs to load the existing data
>>>>> and merge it with the new data. Can this be achieved somehow? Is there
an
>>>>> option that needs to be enabled? I already set:
>>>>>
>>>>> SET hive.merge.mapredfiles = true;
>>>>> SET hive.exec.dynamic.partition = true;
>>>>> SET hive.exec.dynamic.partition.mode = nonstrict;
>>>>>
>>>>> I should mention that the query that actually outputs to S3 is an INSERT
>>>>> INTO TABLE query. The Hive version is 0.8.1
>>>>>
>>>>>
>>>>> Thank you,
>>>>> Cosmin
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> André Araújo
>>>> Big Data Consultant/Solutions Architect
>>>> The Pythian Group - Australia - www.pythian.com
>>>>
>>>> Office (calls from within Australia): 1300 366 021 x1270
>>>> Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696
>>>> x1270
>>>> Mobile: +61 410 323 559
>>>> Fax: +61 2 9805 0544
>>>> IM: pythianaraujo @ AIM/MSN/Y! or araujo@pythian.com @ GTalk
>>>>
>>>> “Success is not about standing at the top, it's the steps you leave
>>>> behind.” — Iker Pou (rock climber)
>>>>
>>>> --
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> André Araújo
>> Big Data Consultant/Solutions Architect
>> The Pythian Group - Australia - www.pythian.com
>>
>> Office (calls from within Australia): 1300 366 021 x1270
>> Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696
>> x1270
>> Mobile: +61 410 323 559
>> Fax: +61 2 9805 0544
>> IM: pythianaraujo @ AIM/MSN/Y! or araujo@pythian.com @ GTalk
>>
>> “Success is not about standing at the top, it's the steps you leave
>> behind.” — Iker Pou (rock climber)
>>
>> --
>>
>>
>>
>>
>


-- 
André Araújo
Big Data Consultant/Solutions Architect
The Pythian Group - Australia - www.pythian.com

Office (calls from within Australia): 1300 366 021 x1270
Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696   x1270
Mobile: +61 410 323 559
Fax: +61 2 9805 0544
IM: pythianaraujo @ AIM/MSN/Y! or araujo@pythian.com @ GTalk

“Success is not about standing at the top, it's the steps you leave behind.”
— Iker Pou (rock climber)

-- 


--




Mime
View raw message