orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <owen.omal...@gmail.com>
Subject Re: The Orc magic string
Date Fri, 21 Jun 2019 15:34:36 GMT
It is expected, but like most of Hive's ACID layout is badly documented.
The code is in OrcAcidUtils
<https://github.com/apache/orc/blob/1c5a020382059b9fea3344ffe428b1f8986b0a12/java/core/src/java/org/apache/orc/impl/OrcAcidUtils.java#L42>
.

.. Owen


On Sat, Jun 15, 2019 at 12:25 PM Dain Sundstrom <dain@iq80.com> wrote:

> Is this expected behavior of ORC acid writers?  If so, is it documented
> somewhere?
>
> -dain
>
> ----
> Dain Sundstrom
> Co-founder @ Presto Software Foundation, Co-creator of Presto (
> https://prestosql.io)
>
> > On Jun 14, 2019, at 6:17 PM, Owen O'Malley <owen.omalley@gmail.com>
> wrote:
> >
> > The hive acid format uses a side file that provides a sequence of the 8
> byte file offsets for completed file footers. If the file is there, it
> passes the last offset to the reader and it will treat that as the end of
> the file.
> >
> > In the case where you don't have that, searching for the string
> “\003ORC” works really well for finding the tails. In the corrupted files
> I've seen I've never needed more than that.
> >
> > .. Owen
> >
> >> On Jun 14, 2019, at 09:52, Xiening Dai <xndai.git@live.com> wrote:
> >>
> >> Hi all,
> >>
> >> In Orc appending scenario, the append operation (including writing the
> additional data and the new footer) needs to be atomic. Otherwise if it
> failed in between, the file tail would be unrecognizable. Unfortunately not
> all file system can garantee atomic write. When failure does happen, in
> order to recover the data before append, we would need to locate the
> previous file footer by searching backward. And the only way to search for
> the footer is by looking for the “ORC” magic string. But the current magic
> string only has three characters and it’s likely the same string appears in
> user data which will result in parsing a wrong footer, and the behavior is
> undefined.
> >>
> >> So I am thinking that if we can change the magic string into some
> 16-byte UUID. This way we can safely use it to locate the footer. The idea
> is very similar to the sync maker in Avro.
> >>
> >> Thanks.
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message