arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: [Python] Why is the access to values in ChunkedArray O(n_chunks) ?
Date Mon, 15 Mar 2021 18:02:09 GMT
>
> Do you know if the iteration is done on the python side, or on the C++
> side ?

It appears to be in cython [1] which looking at the definition, I would
expect to compile down to pretty straightforward C code.

May I create a post on JIRA about adding an indexing structure for
> ChunkedArray ?

Yes, please do, I'm sure others will have thoughts on the best way to
incorporate this (it might also be a good first contribution to the project
if you are interested).  I think also providing some context from this
thread on the relative slowness would be good (the bottleneck still might
be something else, that others more familiar with the code could point to).


[1]
https://github.com/apache/arrow/blob/8e8a0009c44b96e1eb34f87e9fd631ac9309479d/python/pyarrow/table.pxi#L164

On Mon, Mar 15, 2021 at 10:11 AM Quentin Lhoest <quentin@huggingface.co>
wrote:

> Thanks for your response !
>
> This is accurate the O(1) access applies to a single Array.  Chunked
> arrays are stored as a C++ vector of Arrays.  There is currently no
> indexing structure on top of the vector to allow for anything better than
> O(chunk) to an arbitrary element.
>
>
> That makes sense, thanks.
>
> I am a little surprised that the difference is as pronounced in your
> benchmark for only 1024 chunks though, I'm not familiar enough with the
> memory mapped code path to be able to answer why that would be the case.
>
>
> For information, I’m getting approximately the same results without memory
> mapping though:
> ```
>
> Time to access example at i=0%	: 9.2μs
> Time to access example at i=10%	: 7.6μs
> Time to access example at i=20%	: 8.7μs
> Time to access example at i=30%	: 10.8μs
> Time to access example at i=40%	: 13.0μs
> Time to access example at i=50%	: 15.4μs
> Time to access example at i=60%	: 17.6μs
> Time to access example at i=70%	: 20.5μs
> Time to access example at i=80%	: 22.9μs
> Time to access example at i=90%	: 24.1μs
>
> ```
>
> From the previous code I just replaced `pa.memory_map(…)` by `open(…,
> “rb”)`
>
> Also, regarding the time difference, it must come from an iteration over
> the chunks until the requested example is reached.
>
> In terms of timing, 20μs is a reasonable time for a list of 1000 elements
> in python, since for loops are pretty slow.
> I would expect it to be much faster if it was done on the C++ side though.
>
> Do you know if the iteration is done on the python side, or on the C++
> side ?
>
> This would probably be a good addition to the library.   In the short term
> as a work-around you could build the index externally, and use that. (chunk
> methods are exposed on ChunkedArray).
>
>
> May I create a post on JIRA about adding an indexing structure for
> ChunkedArray ?
>
>
> Best,
>
> Quentin
>
> On Mar 15, 2021, at 5:40 PM, Micah Kornfield <emkornfield@gmail.com>
> wrote:
>
> I am a little surprised that the difference is as pronounced in your
> benchmark for only 1024 chunks though, I'm not familiar enough with the
> memory mapped code path to be able to answer why that would be the case.
>
> On Mon, Mar 15, 2021 at 9:37 AM Micah Kornfield <emkornfield@gmail.com>
> wrote:
> Hi Quentin,
> > More specifically, it looks like the time it takes to access an example
> of a ChunkedArray is a function of the index of the chunk in which the
> example is at.
>
> This is accurate the O(1) access applies to a single Array.  Chunked
> arrays are stored as a C++ vector of Arrays.  There is currently no
> indexing structure on top of the vector to allow for anything better than
> O(chunk) to an arbitrary element.
>
> This would probably be a good addition to the library.   In the short term
> as a work-around you could build the index externally, and use that. (chunk
> methods are exposed on ChunkedArray).
>
> -Micah
>
>
>
>
> On Mon, Mar 15, 2021 at 9:12 AM Quentin Lhoest <quentin@huggingface.co>
> wrote:
> Hi !
>
> I’m working on the huggingface/datasets library and we are using Arrow for
> natural language processing datasets.
> Our datasets are usually text datasets, often bigger than memory.
>
> We use memory mapping of arrow files to load the datasets.
>
> > Unfortunately it looks like accessing examples is not O(1)
>
> More specifically, it looks like the time it takes to access an example of
> a ChunkedArray is a function of the index of the chunk in which the example
> is at.
> If the example is in the last chunk, then it takes more time to access it
> than accessing an example in the first chunk.
>
> For example, with a Table consisting of 1 column “text” defined by:
> - 1024 chunks
> - each chunk is 1024 rows
> - each row is a text of 1024 characters
>
> Then the time it takes to access one example are:
>
> ```
> Time to access example at i=0% : 6.7μs
> Time to access example at i=10% : 7.2μs
> Time to access example at i=20% : 9.1μs
> Time to access example at i=30% : 11.4μs
> Time to access example at i=40% : 13.8μs
> Time to access example at i=50% : 16.2μs
> Time to access example at i=60% : 18.7μs
> Time to access example at i=70% : 21.1μs
> Time to access example at i=80% : 26.8μs
> Time to access example at i=90% : 25.2μs
>
> ```
>
> The time measured are the average times to do `table[“text”][j]` depending
> on the index we want to fetch (from the first example at 0% to the example
> at 90% of the length of the table).
>
> You can take a look at the code that produces this benchmark here.
>
> > Why do I observe this behavior ? Is there a way to get O(1) access
> without bringing the whole table in memory ?
>
>
> Thanks in advance for you help !
>
> Quentin
>
>
>

Mime
View raw message