orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Owen O'Malley <owen.omal...@gmail.com>
Subject Re: The Orc magic string
Date Sat, 15 Jun 2019 01:17:51 GMT
The hive acid format uses a side file that provides a sequence of the 8 byte file offsets for
completed file footers. If the file is there, it passes the last offset to the reader and
it will treat that as the end of the file. 

In the case where you don't have that, searching for the string “\003ORC” works really
well for finding the tails. In the corrupted files I've seen I've never needed more than that.


.. Owen

> On Jun 14, 2019, at 09:52, Xiening Dai <xndai.git@live.com> wrote:
> 
> Hi all,
> 
> In Orc appending scenario, the append operation (including writing the additional data
and the new footer) needs to be atomic. Otherwise if it failed in between, the file tail would
be unrecognizable. Unfortunately not all file system can garantee atomic write. When failure
does happen, in order to recover the data before append, we would need to locate the previous
file footer by searching backward. And the only way to search for the footer is by looking
for the “ORC” magic string. But the current magic string only has three characters and
it’s likely the same string appears in user data which will result in parsing a wrong footer,
and the behavior is undefined.
> 
> So I am thinking that if we can change the magic string into some 16-byte UUID. This
way we can safely use it to locate the footer. The idea is very similar to the sync maker
in Avro.
> 
> Thanks.

Mime
View raw message