orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dain Sundstrom <d...@iq80.com>
Subject Re: The Orc magic string
Date Sat, 15 Jun 2019 19:25:03 GMT
Is this expected behavior of ORC acid writers?  If so, is it documented somewhere?

-dain

----
Dain Sundstrom
Co-founder @ Presto Software Foundation, Co-creator of Presto (https://prestosql.io)

> On Jun 14, 2019, at 6:17 PM, Owen O'Malley <owen.omalley@gmail.com> wrote:
> 
> The hive acid format uses a side file that provides a sequence of the 8 byte file offsets
for completed file footers. If the file is there, it passes the last offset to the reader
and it will treat that as the end of the file. 
> 
> In the case where you don't have that, searching for the string “\003ORC” works really
well for finding the tails. In the corrupted files I've seen I've never needed more than that.

> 
> .. Owen
> 
>> On Jun 14, 2019, at 09:52, Xiening Dai <xndai.git@live.com> wrote:
>> 
>> Hi all,
>> 
>> In Orc appending scenario, the append operation (including writing the additional
data and the new footer) needs to be atomic. Otherwise if it failed in between, the file tail
would be unrecognizable. Unfortunately not all file system can garantee atomic write. When
failure does happen, in order to recover the data before append, we would need to locate the
previous file footer by searching backward. And the only way to search for the footer is by
looking for the “ORC” magic string. But the current magic string only has three characters
and it’s likely the same string appears in user data which will result in parsing a wrong
footer, and the behavior is undefined.
>> 
>> So I am thinking that if we can change the magic string into some 16-byte UUID. This
way we can safely use it to locate the footer. The idea is very similar to the sync maker
in Avro.
>> 
>> Thanks.


Mime
View raw message