From user-return-1072-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Mon Mar 15 16:41:03 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-ec2-va.apache.org (mxout1-ec2-va.apache.org [3.227.148.255]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 23F27180642 for ; Mon, 15 Mar 2021 17:41:03 +0100 (CET) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-ec2-va.apache.org (ASF Mail Server at mxout1-ec2-va.apache.org) with SMTP id 1C52441CCD for ; Mon, 15 Mar 2021 16:41:02 +0000 (UTC) Received: (qmail 7862 invoked by uid 500); 15 Mar 2021 16:41:01 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 7852 invoked by uid 99); 15 Mar 2021 16:41:01 -0000 Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Mar 2021 16:41:01 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id 087E61FF46F for ; Mon, 15 Mar 2021 16:41:01 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org X-Spam-Flag: NO X-Spam-Score: 0 X-Spam-Level: X-Spam-Status: No, score=0 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamproc1-he-de.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([116.203.227.195]) by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024) with ESMTP id twwC3c0EYoJ8 for ; Mon, 15 Mar 2021 16:41:00 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2a00:1450:4864:20::634; helo=mail-ej1-x634.google.com; envelope-from=emkornfield@gmail.com; receiver= Received: from mail-ej1-x634.google.com (mail-ej1-x634.google.com [IPv6:2a00:1450:4864:20::634]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 80AE77FB7A for ; Mon, 15 Mar 2021 16:41:00 +0000 (UTC) Received: by mail-ej1-x634.google.com with SMTP id dx17so67412324ejb.2 for ; Mon, 15 Mar 2021 09:41:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:reply-to:from:date:message-id :subject:to; bh=mQidf0QPdz6KTpZCoddQ+XbSmhCHGpy1sRbBN4mfwiQ=; b=W9ywYVNSbWmQYe/uEg+UBNZ4CiV504rCDBgVY50ocPllU1vB6sB9JsOGydIFihAyJs zspAJgIytQandgTfHxgOfaRvqGJPXxPRbB1goMHZWkALOJzN2ibHC4wqwdMVUqyUwYPp oL1mEYGv9RHgKiID2Ci7oyvXxZb0NuOrzz2zEQpIN/hc7suWXsUPYbiyGAvdbdvWAjC6 VEqn35wXdKwjgP1q9AWAUxZcGxVTmTj9jwRQsSvLFfJ1Y9R051TgjhIhrUs39t6ElHOp tiJRwvhjGFViX7iWdtRPS3Vc6hNIKhlwBQV0wLKVzD4KAEbFPd4aJW0+Jd42ECWjL2Dh paEA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:reply-to :from:date:message-id:subject:to; bh=mQidf0QPdz6KTpZCoddQ+XbSmhCHGpy1sRbBN4mfwiQ=; b=fH25iXO0F6ILbdmt6B3ISn/B8xJHmlhUHduOFU7qv4A1cTPwOLhpNT3yhB7I1PfBLy UkvPqHO0gqv5fQRdQ8Ue7myhWP6vC/GBD0bLfcBMOubHr3Jl7zpZTIVIb2r679agpn86 fuPVLP0qtajOPiE71vta01/DAXLsFXobfUXXVjWlmjQdOT654EgFggzCm8r5vzrjBUgD Kvo+6gQEv2wbU00VmSsrj0LFoIvoCoyvXgBGzFgep93KKD6nFbyu/0ERTRW1LZ78UcL0 sunOpVe3ytZZfRDsApi29ErOcLw2dGgFPzjcZTB40yprcKBSTn73+v6ckDdxKvREd5X1 3jMA== X-Gm-Message-State: AOAM530el6ToJ9E9ICSSdft7ujXzW6ASfIdaD+bBhP0C2W4GWc+I+TV2 KJhW8DgzgYFK7IHvY0WuGjg3J1w5smvxesJTHeg/hEb8 X-Google-Smtp-Source: ABdhPJwBT0jfRVn3yoe6HRsdK/DFvZlrTlO8Evkc8idg9Bjq3FQvHn/cfV6cqNCQI3Oa0DwlhzsVvgTuRK04tlUCP8c= X-Received: by 2002:a17:906:73c2:: with SMTP id n2mr24692135ejl.224.1615826459859; Mon, 15 Mar 2021 09:40:59 -0700 (PDT) MIME-Version: 1.0 References: <3FFEE84E-6C7E-4495-9D75-294B4B6A7CAA@huggingface.co> In-Reply-To: Reply-To: emkornfield@gmail.com From: Micah Kornfield Date: Mon, 15 Mar 2021 09:40:49 -0700 Message-ID: Subject: Re: [Python] Why is the access to values in ChunkedArray O(n_chunks) ? To: user@arrow.apache.org Content-Type: multipart/alternative; boundary="000000000000289db605bd95eacd" --000000000000289db605bd95eacd Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I am a little surprised that the difference is as pronounced in your benchmark for only 1024 chunks though, I'm not familiar enough with the memory mapped code path to be able to answer why that would be the case. On Mon, Mar 15, 2021 at 9:37 AM Micah Kornfield wrote: > Hi Quentin, > >> > More specifically, it looks like the time it takes to access an exampl= e >> of a ChunkedArray is a function of the index of the chunk in which the >> example is at. > > > This is accurate the O(1) access applies to a single Array. Chunked > arrays are stored as a C++ vector of Arrays. There is currently no > indexing structure on top of the vector to allow for anything better than > O(chunk) to an arbitrary element. > > This would probably be a good addition to the library. In the short ter= m > as a work-around you could build the index externally, and use that. (chu= nk > methods are exposed on ChunkedArray). > > -Micah > > > > > On Mon, Mar 15, 2021 at 9:12 AM Quentin Lhoest > wrote: > >> Hi ! >> >> I=E2=80=99m working on the huggingface/datasets library and we are using= Arrow >> for natural language processing datasets. >> Our datasets are usually text datasets, often bigger than memory. >> >> We use memory mapping of arrow files to load the datasets. >> >> > Unfortunately it looks like accessing examples is not O(1) >> >> More specifically, it looks like the time it takes to access an example >> of a ChunkedArray is a function of the index of the chunk in which the >> example is at. >> If the example is in the last chunk, then it takes more time to access i= t >> than accessing an example in the first chunk. >> >> For example, with a Table consisting of 1 column =E2=80=9Ctext=E2=80=9D = defined by: >> - 1024 chunks >> - each chunk is 1024 rows >> - each row is a text of 1024 characters >> >> Then the time it takes to access one example are: >> >> ``` >> >> Time to access example at i=3D0% : 6.7=CE=BCs >> Time to access example at i=3D10% : 7.2=CE=BCs >> Time to access example at i=3D20% : 9.1=CE=BCs >> Time to access example at i=3D30% : 11.4=CE=BCs >> Time to access example at i=3D40% : 13.8=CE=BCs >> Time to access example at i=3D50% : 16.2=CE=BCs >> Time to access example at i=3D60% : 18.7=CE=BCs >> Time to access example at i=3D70% : 21.1=CE=BCs >> Time to access example at i=3D80% : 26.8=CE=BCs >> Time to access example at i=3D90% : 25.2=CE=BCs >> >> ``` >> >> The time measured are the average times to do `table[=E2=80=9Ctext=E2=80= =9D][j]` >> depending on the index we want to fetch (from the first example at 0% to >> the example at 90% of the length of the table). >> >> You can take a look at the code that produces this benchmark here >> . >> >> > Why do I observe this behavior ? Is there a way to get O(1) access >> without bringing the whole table in memory ? >> >> >> Thanks in advance for you help ! >> >> Quentin >> > --000000000000289db605bd95eacd Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I am a little surprised that the difference is as pronounc= ed in your benchmark for only 1024 chunks though, I'm not familiar enou= gh with the memory mapped code path to be able to answer why that would be = the case.

On Mon, Mar 15, 2021 at 9:37 AM Micah Kornfield <emkornfield@gmail.com> wrote:
Hi Qu= entin,=C2=A0
> Mo= re specifically, it looks like the time it takes to access an example of a = ChunkedArray is a function of the index of the chunk in which the example i= s at.
=C2=A0
This is accurate the O(1) access ap= plies to a single Array.=C2=A0 Chunked arrays are stored as a C++ vector of= Arrays.=C2=A0 There is currently no indexing structure on top of the vecto= r to allow for anything better than O(chunk) to an arbitrary element.
=

This would probably be a good addition to the library.= =C2=A0 =C2=A0In the short term as a work-around you could build the index e= xternally, and use that. (chunk methods are exposed on ChunkedArray).
=

-Micah



On Mon, Mar 15, 2021 at 9:12 AM Quentin Lhoest <quentin@huggingface.co= > wrote:
Hi !

I=E2=80=99m working on the huggingface/datasets li= brary and we are using Arrow for natural language processing datasets.
Our datasets are usually text datasets, often bigger than memory.

We use memory mapping of arrow files to load the data= sets.

> Unfortunately it looks like accessing e= xamples is not O(1)

More specifically, it looks li= ke the time it takes to access an example of a ChunkedArray is a function o= f the index of the chunk in which the example is at.
If the examp= le is in the last chunk, then it takes more time to access it than accessin= g an example in the first chunk.

For example, with= a Table consisting of 1 column =E2=80=9Ctext=E2=80=9D defined by:
- 1024 chunks
- each chunk is 1024 rows
- each row is= a text of 1024 characters

Then the time it takes = to access one example are:

```
Time to access example at i=
=3D0%	: 6.7=CE=BCs
Time to access example at i=3D10%	: 7.2=CE=BCs
Time to access example at i=3D20%	: 9.1=CE=BCs
Time to access example at i=3D30%	: 11.4=CE=BCs
Time to access example at i=3D40%	: 13.8=CE=BCs
Time to access example at i=3D50%	: 16.2=CE=BCs
Time to access example at i=3D60%	: 18.7=CE=BCs
Time to access example at i=3D70%	: 21.1=CE=BCs
Time to access example at i=3D80%	: 26.8=CE=BCs
Time to access example at i=3D90%	: 25.2=CE=BCs
```

The time measured are the average times to do `table[=E2= =80=9Ctext=E2=80=9D][j]` depending on the index we want to fetch (from the = first example at 0% to the example at 90% of the length of the table).

You can take a look at the code that produces this ben= chmark=C2=A0her= e.

> Why do I observe this behavior ? Is th= ere a way to get O(1) access without bringing the whole table in memory ?


Thanks in advance for you help !

Quentin
--000000000000289db605bd95eacd--