From user-return-949-archive-asf-public=cust-asf.ponee.io@arrow.apache.org  Tue Jan 26 18:30:42 2021
Return-Path: <user-return-949-archive-asf-public=cust-asf.ponee.io@arrow.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37])
	by mx-eu-01.ponee.io (Postfix) with ESMTPS id 47813180633
	for <archive-asf-public@cust-asf.ponee.io>; Tue, 26 Jan 2021 19:30:42 +0100 (CET)
Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153])
	by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id 8EB0564E5C
	for <archive-asf-public@cust-asf.ponee.io>; Tue, 26 Jan 2021 18:30:41 +0000 (UTC)
Received: (qmail 48219 invoked by uid 500); 26 Jan 2021 18:30:40 -0000
Mailing-List: contact user-help@arrow.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@arrow.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@arrow.apache.org>
List-Post: <mailto:user@arrow.apache.org>
List-Id: <user.arrow.apache.org>
Reply-To: user@arrow.apache.org
Delivered-To: mailing list user@arrow.apache.org
Received: (qmail 48209 invoked by uid 99); 26 Jan 2021 18:30:40 -0000
Received: from spamproc1-he-fi.apache.org (HELO spamproc1-he-fi.apache.org) (95.217.134.168)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Jan 2021 18:30:40 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamproc1-he-fi.apache.org (ASF Mail Server at spamproc1-he-fi.apache.org) with ESMTP id D6379C0115
	for <user@arrow.apache.org>; Tue, 26 Jan 2021 18:30:39 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamproc1-he-fi.apache.org
X-Spam-Flag: NO
X-Spam-Score: -0.001
X-Spam-Level:
X-Spam-Status: No, score=-0.001 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, SPF_PASS=-0.001]
	autolearn=disabled
Authentication-Results: spamproc1-he-fi.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-he-de.apache.org ([116.203.227.195])
	by localhost (spamproc1-he-fi.apache.org [95.217.134.168]) (amavisd-new, port 10024)
	with ESMTP id xXGIvsDjTYgY for <user@arrow.apache.org>;
	Tue, 26 Jan 2021 18:30:38 +0000 (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::b2d; helo=mail-yb1-xb2d.google.com; envelope-from=nugend@gmail.com; receiver=<UNKNOWN> 
Received: from mail-yb1-xb2d.google.com (mail-yb1-xb2d.google.com [IPv6:2607:f8b0:4864:20::b2d])
	by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 965707FA08
	for <user@arrow.apache.org>; Tue, 26 Jan 2021 18:30:38 +0000 (UTC)
Received: by mail-yb1-xb2d.google.com with SMTP id i141so17733028yba.0
        for <user@arrow.apache.org>; Tue, 26 Jan 2021 10:30:38 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=wJWUjwXkMfovhEHGC5s+oaDwy1gdW3Bnx5vvT7ovKBY=;
        b=vDrUItQf/Ts8pNDaHCG1ZF8g0bqcAaXcGOt4yyDhFh2kvwC3HGbVGjEOuvQCBN95CM
         zbMaBvXjhTpd7+T6TKkOo55Lf234K8f7esNNd141K2y33PBLWrYwsS+a/Yoxz0Kig6jW
         4/psdj18ERpbXZQ/gr3JKwiv3tAzDZ8DhMr61npzK554y+j2z/wORrFsKPgBWCi/L8Qq
         +gz6O2fQnbZbRJlCoNYujQjHd21i8wjvgZkqdvLnkrW1o9ZLV2Q5KJGsKKoljrcRsys/
         twZARjaez0DAl6MPAcQ5CXAcYSaenDoX782BKbVT+6w2WwwP83VWNYpnfWVAHggafBFI
         hw4w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=wJWUjwXkMfovhEHGC5s+oaDwy1gdW3Bnx5vvT7ovKBY=;
        b=s59nSVKTqtR+QjP7YE3Me8IjCDtR6EjexDKH8Fel9c1Y4RgEBEx6TMvRAQL+xu9HCc
         Rj6mva6IJ94daswySurVx/UbPyo8gwDtW0DZh5zGD7Fcx0fnBCPX06d+Md7iAsh4ytKg
         CDptVgzUeX86y9vZTFIudHNmKtjAuuAlblQPdGEL7snQ4DlU1mKfkE+S0irtNaegQxpZ
         rLu8//dnxtIGK03BgaaLjqFi4g/KBYP75JF8XUvCjzqMfkAhECAY2n2tV+VMCjeV5l63
         dgpqNeJhij03mg9XYGqr34b/hGqkEoIHZrSwHRDxSzVFwfGxxG/Y/i2FMiDP8RXzJXdw
         CA3w==
X-Gm-Message-State: AOAM531u5BBwRfwZePX8Opr7c5glJJcW3yeuDoETB5TDrqqdrflLPBNP
	GuXBwaUz3Gy8bj4iZ+ONlQoYD00GFEjCY+20zeM2OH8RLVM=
X-Google-Smtp-Source: ABdhPJz8Boq8xjgouMEufoYJ+L0tmZV1D5eHX2oaabH2dNguwf4MoDAHvcdTLuYypoMLgtcZBTDvlG1rr3fF28ADWO4=
X-Received: by 2002:a25:af42:: with SMTP id c2mr10124278ybj.516.1611685830467;
 Tue, 26 Jan 2021 10:30:30 -0800 (PST)
MIME-Version: 1.0
References: <CALFZbYDARE05SpFs0VwkMUkZtU491pmdxRPUdTh=6sRgNv6voA@mail.gmail.com>
 <CAJPUwMBZBR6BJy4eV88MbTGV8BzhF0UDWDM=6fEPUPVsf36U_g@mail.gmail.com>
 <c44707cb-0d99-b4cc-3bfd-79c54e320ada@crvm.io> <a011831f-3e6f-4dcb-902f-fd3fdbbaa5cb@Spark>
 <1fb8087e-1463-857a-507c-5aa63002a47c@crvm.io> <b563b429-a757-42c8-895d-7092ad6e4fb4@Spark>
 <5d11bdb4-1cf8-116d-828e-9931ddb53430@crvm.io>
In-Reply-To: <5d11bdb4-1cf8-116d-828e-9931ddb53430@crvm.io>
From: Daniel Nugent <nugend@gmail.com>
Date: Tue, 26 Jan 2021 13:30:19 -0500
Message-ID: <CAPZSVyo=UFkEtWMaXwFsV_gNWWpxqp6veJA3iji5kuUaHBsWJg@mail.gmail.com>
Subject: Re: Question the nature of the "Zero Copy" advantages of Apache Arrow
To: user@arrow.apache.org
Content-Type: multipart/alternative; boundary="0000000000006a229405b9d1d999"

--0000000000006a229405b9d1d999
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Right, I recongize that using mmap directly isn't necessarily the most
straightforward, which is why I suggested using a RAM disk with
uncompressed arrow files. It saves the trouble of having to deal with
passing addresses around and puts a nice file system API on top of any
dataset operations that arrow already supports that you might want to do
(you can get a reasonable approach to appends using this, for example).

But if you've already got an on-disk, uncompressed arrow buffer that's
bigger than memory, the arrow api should take care of using the mmap system
calls to load it into memory (at least I think this is currently supported
for all the arrow libraries? You may have to double check. I know it's in
C/C++/Python for sure, probably Rust and I think R?).

Then you're only dealing with virtual allocations and you can load that
larger than memory file in as many analytics packages as you like and there
will only be one copy of any portion of that file in memory at any given
time.

-Dan Nugent


On Tue, Jan 26, 2021 at 1:15 PM Thomas Browne <thomas@crvm.io> wrote:

> Yes I think the term "zero copy" was confusing to me. It doesn't quite do
> what it says on the tin since if I understand correctly the term still
> allows for an actually copy still to occur, it's just a direct binary cop=
y
> without a [de]serialisation process.
>
> I hear you on plasma.
>
> On the issue of MAP_SHARED, got it, but that means I'm having to talk C
> from other languages.
>
> I think Jorge's answer (
> https://arrow.apache.org/docs/format/CDataInterface.html) is pretty good
> though. Good enough for me anyway. Thanks everyone.
> On 26/01/2021 18:09, Daniel Nugent wrote:
>
> I think you might be a bit confused about what zero copy means if that=E2=
=80=99s
> what you=E2=80=99re concerned about. If you have a bigger than memory fil=
e, then
> Plasma wasn=E2=80=99t going to help since its design always involved copy=
ing the
> arrow buffers to memory.
>
> If you have larger than memory arrow files in the first place, just open
> them using mmap (should be automatically done for non-compressed arrow
> files).
>
> --
> -Dan Nugent
> On Jan 26, 2021, 13:07 -0500, Thomas Browne <thomas@crvm.io>
> <thomas@crvm.io>, wrote:
>
> don't I lose the benefit of mmapping huge files with a ramdisk? Cos the
> file has to now fit on my ramdisk.
>
> Personally working with financial tick data which can be enormous.
> On 26/01/2021 18:00, Daniel Nugent wrote:
>
> Is there a problem with just using a RAM disk as the method for sharing
> the arrow buffers? It just seems easier and less finicky than a separate
> API to program against.
>
> It also makes storing the data permanently a lot  more straightforward, I
> think.
>
> --
> -Dan Nugent
> On Jan 26, 2021, 12:47 -0500, Thomas Browne <thomas@crvm.io>
> <thomas@crvm.io>, wrote:
>
> So one of the big advantages of Arrow is the common format in memory, on
> the wire, across languages.
>
> I get that this makes it very easy and fast to transfer data between
> nodes, and between languages, which will all share the in-memory format
> and therefore the (often expensive) serialisation step is removed.
>
> However, is it true that one of the core objectives of the project is
> also to allow shared memory objects across different languages on the
> same node? For example, a fast C-based ingest system constantly
> populates a pyarrow buffer, which can be read directly by any other
> application on that node, through pointer sharing?
>
> If this is a core objective, what is the canonical way for brokering the
> "pointers" to this data between languages? Is it the Plasma store? And
> if so, are there plans for Plasma to move be implemented in other client
> languages?
>
> In short. Is Plasma (or if not Plasma, the functionality it provides
> implemented some other way), a core objective of the project?
>
> Or instead is Flight supposed to be used between languages on the same
> node, and if so, does Flight provide true zero-copy (ie - the same
> buffer, not copying the buffer) if run between processes on the same node=
?
>
> Many thanks.
>
>

--0000000000006a229405b9d1d999
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Right, I recongize that using mmap directly isn&#39;t=
 necessarily the most straightforward, which is why I suggested using a RAM=
 disk with uncompressed arrow files. It saves the trouble of having to deal=
 with passing addresses around and puts a nice file system API on top of an=
y dataset operations that arrow already supports that you might want to do =
(you can get a reasonable approach to appends using this, for example).<br>=
</div><div><br></div><div>But if you&#39;ve already got an on-disk, uncompr=
essed arrow buffer that&#39;s bigger than memory, the arrow api should take=
 care of using the mmap system calls to load it into memory (at least I thi=
nk this is currently supported for all the arrow libraries? You may have to=
 double check. I know it&#39;s in C/C++/Python for sure, probably Rust and =
I think R?).<br></div><div><br></div><div>Then you&#39;re only dealing with=
 virtual allocations and you can load that larger than memory file in as ma=
ny analytics packages as you like and there will only be one copy of any po=
rtion of that file in memory at any given time.<br></div><div><br></div><di=
v><div><div dir=3D"ltr" class=3D"gmail_signature" data-smartmail=3D"gmail_s=
ignature">-Dan Nugent</div></div><br></div></div><br><div class=3D"gmail_qu=
ote"><div dir=3D"ltr" class=3D"gmail_attr">On Tue, Jan 26, 2021 at 1:15 PM =
Thomas Browne &lt;<a href=3D"mailto:thomas@crvm.io">thomas@crvm.io</a>&gt; =
wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0=
px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
 =20
   =20
 =20
  <div>
    <p>Yes I think the term &quot;zero copy&quot; was confusing to me. It d=
oesn&#39;t
      quite do what it says on the tin since if I understand correctly
      the term still allows for an actually copy still to occur, it&#39;s
      just a direct binary copy without a [de]serialisation process. <br>
      <br>
      I hear you on plasma. <br>
      <br>
      On the issue of MAP_SHARED, got it, but that means I&#39;m having to
      talk C from other languages. <br>
      <br>
      I think Jorge&#39;s answer
      (<a href=3D"https://arrow.apache.org/docs/format/CDataInterface.html"=
 target=3D"_blank">https://arrow.apache.org/docs/format/CDataInterface.html=
</a>) is
      pretty good though. Good enough for me anyway. Thanks everyone. <br>
    </p>
    <div>On 26/01/2021 18:09, Daniel Nugent
      wrote:<br>
    </div>
    <blockquote type=3D"cite">
     =20
     =20
      <div name=3D"messageBodySection">
        <div dir=3D"auto">I think you might be a bit confused about what
          zero copy means if that=E2=80=99s what you=E2=80=99re concerned a=
bout. If you
          have a bigger than memory file, then Plasma wasn=E2=80=99t going =
to
          help since its design always involved copying the arrow
          buffers to memory.<br>
          <br>
          If you have larger than memory arrow files in the first place,
          just open them using mmap (should be automatically done for
          non-compressed arrow files).</div>
      </div>
      <div name=3D"messageSignatureSection"><br>
        --<br>
        -Dan Nugent</div>
      <div name=3D"messageReplySection">On Jan 26, 2021, 13:07 -0500,
        Thomas Browne <a href=3D"mailto:thomas@crvm.io" target=3D"_blank">&=
lt;thomas@crvm.io&gt;</a>, wrote:<br>
        <blockquote type=3D"cite" style=3D"border-left:thin solid grey;marg=
in:5px;padding-left:10px">
          <p>don&#39;t I lose the benefit of mmapping huge files with a
            ramdisk? Cos the file has to now fit on my ramdisk.<br>
            <br>
            Personally working with financial tick data which can be
            enormous.<br>
          </p>
          <div>On 26/01/2021 18:00, Daniel
            Nugent wrote:<br>
          </div>
          <blockquote type=3D"cite">
           =20
            <div name=3D"messageBodySection">
              <div dir=3D"auto">Is there a problem with just using a RAM
                disk as the method for sharing the arrow buffers? It
                just seems easier and less finicky than a separate API
                to program against.<br>
                <br>
                It also makes storing the data permanently a lot=C2=A0=C2=
=A0more
                straightforward, I think.</div>
            </div>
            <div name=3D"messageSignatureSection"><br>
              --<br>
              -Dan Nugent</div>
            <div name=3D"messageReplySection">On Jan 26, 2021, 12:47
              -0500, Thomas Browne <a href=3D"mailto:thomas@crvm.io" target=
=3D"_blank">&lt;thomas@crvm.io&gt;</a>,
              wrote:<br>
              <blockquote type=3D"cite" style=3D"border-left:thin solid gre=
y;margin:5px;padding-left:10px">So one of the big
                advantages of Arrow is the common format in memory, on<br>
                the wire, across languages.<br>
                <br>
                I get that this makes it very easy and fast to transfer
                data between<br>
                nodes, and between languages, which will all share the
                in-memory format<br>
                and therefore the (often expensive) serialisation step
                is removed.<br>
                <br>
                However, is it true that one of the core objectives of
                the project is<br>
                also to allow shared memory objects across different
                languages on the<br>
                same node? For example, a fast C-based ingest system
                constantly<br>
                populates a pyarrow buffer, which can be read directly
                by any other<br>
                application on that node, through pointer sharing?<br>
                <br>
                If this is a core objective, what is the canonical way
                for brokering the<br>
                &quot;pointers&quot; to this data between languages? Is it =
the
                Plasma store? And<br>
                if so, are there plans for Plasma to move be implemented
                in other client<br>
                languages?<br>
                <br>
                In short. Is Plasma (or if not Plasma, the functionality
                it provides<br>
                implemented some other way), a core objective of the
                project?<br>
                <br>
                Or instead is Flight supposed to be used between
                languages on the same<br>
                node, and if so, does Flight provide true zero-copy (ie
                - the same<br>
                buffer, not copying the buffer) if run between processes
                on the same node?<br>
                <br>
                Many thanks.<br>
              </blockquote>
            </div>
          </blockquote>
        </blockquote>
      </div>
    </blockquote>
  </div>

</blockquote></div>

--0000000000006a229405b9d1d999--