hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guy Bayes <>
Subject Re: load data unit of work
Date Wed, 15 Jun 2011 19:28:55 GMT
I think if you load a file, validate it, and then* alter table add partition
*to the final table at the end, in the event of crash you only have a
partially loaded etl file that no one will be querying anyway.

That should work, though I am not speaking from personal experience, at
least not with HIVE

On Wed, Jun 15, 2011 at 12:11 PM, W S Chung <> wrote:

> If the failure of the loading is severe enough, like the whole machine
> crashes, that there might not be an opportunity to catch the exception and
> cleanup the partition right away. The best I can think of is to cleanup the
> partition in a background job reasonably regularly. In that case, before the
> cleanup, is there anyway I can prevent any query from seeing the data in the
> partition that should not be there?
> Or will this really happens? If the metadata is only updated after the
> successful load, the partition may not exist unless the load runs till its
> end.
> On Tue, Jun 14, 2011 at 12:21 PM, Guy Bayes <> wrote:
>> easiest way to achieve a level of robustness is probably to load into a
>> partition and then truncate the partition on the event of failure
>> Cleaning up after an incomplete load is a problem in many traditional
>> rdbm's,  you can not always rely on rollback functionality
>> No explicit delete's in HIVE though so whatever you need to do to massage
>> and clean the data file is best done prior to inserting it into it's final
>> destination.
>> Many of the things you bring up are more ETL best practices then
>> properties of an RDBMS implementation though.
>>  Guy
>> On Tue, Jun 14, 2011 at 8:57 AM, W S Chung <> wrote:
>>> My question is a "what if" question, not a production issue. It seems
>>> natural, when replacing traditional database with hive, to ask
>>> how much robustness is sacrificed for scalability. My concern is that if
>>> a file is partially loaded, there might not be an easy way to clean up the
>>> already loaded data before re-loading the data. The lack of unique index
>>> also does not make it easy to avoid duplicate data either, although
>>> duplicated data can perhaps be deleted after the load.
>>> On Mon, Jun 13, 2011 at 7:12 PM, Martin Konicek <
>>>> wrote:
>>>> Hi,
>>>> I think this is a problem with open source in general and sometimes it
>>>> can be very frustrating.
>>>> However, your question is more of a "what if" question - you're not in
>>>> the trouble of finding a horrible bug after you deployed to production, am
>>>> right?
>>>> Regarding your question, I would guess that if LOAD DATA INPATH crashes
>>>> while moving files into the Hive warehouse, the data which was moved will
>>>> appear as legitimate loaded data. Or the files will be moved but the
>>>> metadata will not be updated. In any case, you should detect the crash and
>>>> redo the operation. The easiest answer might actually be to look into the
>>>> source code - sometimes it can be easier to find than one would expect.
>>>> Not a complete answer, but hope this helps a bit.
>>>> Martin
>>>> On 14/06/2011 00:47, W S Chung wrote:
>>>>> I submit a question like this before, but somehow that question is
>>>>> never delivered. I can even find my question in google. Since I cannot
>>>>> any admin e-mail/feedback form on the hive website that I can ask why
>>>>> last question is not delivered. There is not much option other than to
>>>>> the question again and hope that the question get through this time.
>>>>> for the double posting if you have seen my last e-mail.
>>>>> What is the behaviour if  a client of hive crashes in the middle of
>>>>> running a "load data inpath" for either a local file or a file on HDFS?
>>>>> the file be partially loaded in the db? Thanks.

View raw message