arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacob Quinn <quinn.jac...@gmail.com>
Subject Re: [Python] Why is the access to values in ChunkedArray O(n_chunks) ?
Date Mon, 15 Mar 2021 18:03:06 GMT
In case it's helpful at all, on the Julia side, we have an index included
<https://github.com/JuliaData/SentinelArrays.jl/blob/fe14a82b815438ee2e04b59bf7f337feb1ffd022/src/chainedvector.jl#L14>
with the ChainedVector type (the Julia equivalent of ChunkedArray), where
each element of the `inds` field is the last index of the corresponding
array chunk. We then use that index when indexing
<https://github.com/JuliaData/SentinelArrays.jl/blob/fe14a82b815438ee2e04b59bf7f337feb1ffd022/src/chainedvector.jl#L32>
with `searchsortedfirst`, which does a binary search over the indices. We
also overload iteration
<https://github.com/JuliaData/SentinelArrays.jl/blob/fe14a82b815438ee2e04b59bf7f337feb1ffd022/src/chainedvector.jl#L91>
so
that sequential access just moves from chunk to chunk without having to pay
the binary search indexing cost.

Anyway, maybe that at least sparks some ideas.

-Jacob

On Mon, Mar 15, 2021 at 11:11 AM Quentin Lhoest <quentin@huggingface.co>
wrote:

> Thanks for your response !
>
> This is accurate the O(1) access applies to a single Array.  Chunked
> arrays are stored as a C++ vector of Arrays.  There is currently no
> indexing structure on top of the vector to allow for anything better than
> O(chunk) to an arbitrary element.
>
>
> That makes sense, thanks.
>
> I am a little surprised that the difference is as pronounced in your
> benchmark for only 1024 chunks though, I'm not familiar enough with the
> memory mapped code path to be able to answer why that would be the case.
>
>
> For information, I’m getting approximately the same results without memory
> mapping though:
> ```
>
> Time to access example at i=0%	: 9.2μs
> Time to access example at i=10%	: 7.6μs
> Time to access example at i=20%	: 8.7μs
> Time to access example at i=30%	: 10.8μs
> Time to access example at i=40%	: 13.0μs
> Time to access example at i=50%	: 15.4μs
> Time to access example at i=60%	: 17.6μs
> Time to access example at i=70%	: 20.5μs
> Time to access example at i=80%	: 22.9μs
> Time to access example at i=90%	: 24.1μs
>
> ```
>
> From the previous code I just replaced `pa.memory_map(…)` by `open(…,
> “rb”)`
>
> Also, regarding the time difference, it must come from an iteration over
> the chunks until the requested example is reached.
>
> In terms of timing, 20μs is a reasonable time for a list of 1000 elements
> in python, since for loops are pretty slow.
> I would expect it to be much faster if it was done on the C++ side though.
>
> Do you know if the iteration is done on the python side, or on the C++
> side ?
>
> This would probably be a good addition to the library.   In the short term
> as a work-around you could build the index externally, and use that. (chunk
> methods are exposed on ChunkedArray).
>
>
> May I create a post on JIRA about adding an indexing structure for
> ChunkedArray ?
>
>
> Best,
>
> Quentin
>
> On Mar 15, 2021, at 5:40 PM, Micah Kornfield <emkornfield@gmail.com>
> wrote:
>
> I am a little surprised that the difference is as pronounced in your
> benchmark for only 1024 chunks though, I'm not familiar enough with the
> memory mapped code path to be able to answer why that would be the case.
>
> On Mon, Mar 15, 2021 at 9:37 AM Micah Kornfield <emkornfield@gmail.com>
> wrote:
> Hi Quentin,
> > More specifically, it looks like the time it takes to access an example
> of a ChunkedArray is a function of the index of the chunk in which the
> example is at.
>
> This is accurate the O(1) access applies to a single Array.  Chunked
> arrays are stored as a C++ vector of Arrays.  There is currently no
> indexing structure on top of the vector to allow for anything better than
> O(chunk) to an arbitrary element.
>
> This would probably be a good addition to the library.   In the short term
> as a work-around you could build the index externally, and use that. (chunk
> methods are exposed on ChunkedArray).
>
> -Micah
>
>
>
>
> On Mon, Mar 15, 2021 at 9:12 AM Quentin Lhoest <quentin@huggingface.co>
> wrote:
> Hi !
>
> I’m working on the huggingface/datasets library and we are using Arrow for
> natural language processing datasets.
> Our datasets are usually text datasets, often bigger than memory.
>
> We use memory mapping of arrow files to load the datasets.
>
> > Unfortunately it looks like accessing examples is not O(1)
>
> More specifically, it looks like the time it takes to access an example of
> a ChunkedArray is a function of the index of the chunk in which the example
> is at.
> If the example is in the last chunk, then it takes more time to access it
> than accessing an example in the first chunk.
>
> For example, with a Table consisting of 1 column “text” defined by:
> - 1024 chunks
> - each chunk is 1024 rows
> - each row is a text of 1024 characters
>
> Then the time it takes to access one example are:
>
> ```
> Time to access example at i=0% : 6.7μs
> Time to access example at i=10% : 7.2μs
> Time to access example at i=20% : 9.1μs
> Time to access example at i=30% : 11.4μs
> Time to access example at i=40% : 13.8μs
> Time to access example at i=50% : 16.2μs
> Time to access example at i=60% : 18.7μs
> Time to access example at i=70% : 21.1μs
> Time to access example at i=80% : 26.8μs
> Time to access example at i=90% : 25.2μs
>
> ```
>
> The time measured are the average times to do `table[“text”][j]` depending
> on the index we want to fetch (from the first example at 0% to the example
> at 90% of the length of the table).
>
> You can take a look at the code that produces this benchmark here.
>
> > Why do I observe this behavior ? Is there a way to get O(1) access
> without bringing the whole table in memory ?
>
>
> Thanks in advance for you help !
>
> Quentin
>
>
>

Mime
View raw message