arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: [Python] Why is the access to values in ChunkedArray O(n_chunks) ?
Date Mon, 15 Mar 2021 16:37:52 GMT
Hi Quentin,

> > More specifically, it looks like the time it takes to access an example
> of a ChunkedArray is a function of the index of the chunk in which the
> example is at.


This is accurate the O(1) access applies to a single Array.  Chunked arrays
are stored as a C++ vector of Arrays.  There is currently no indexing
structure on top of the vector to allow for anything better than O(chunk)
to an arbitrary element.

This would probably be a good addition to the library.   In the short term
as a work-around you could build the index externally, and use that. (chunk
methods are exposed on ChunkedArray).

-Micah




On Mon, Mar 15, 2021 at 9:12 AM Quentin Lhoest <quentin@huggingface.co>
wrote:

> Hi !
>
> I’m working on the huggingface/datasets library and we are using Arrow for
> natural language processing datasets.
> Our datasets are usually text datasets, often bigger than memory.
>
> We use memory mapping of arrow files to load the datasets.
>
> > Unfortunately it looks like accessing examples is not O(1)
>
> More specifically, it looks like the time it takes to access an example of
> a ChunkedArray is a function of the index of the chunk in which the example
> is at.
> If the example is in the last chunk, then it takes more time to access it
> than accessing an example in the first chunk.
>
> For example, with a Table consisting of 1 column “text” defined by:
> - 1024 chunks
> - each chunk is 1024 rows
> - each row is a text of 1024 characters
>
> Then the time it takes to access one example are:
>
> ```
>
> Time to access example at i=0%	: 6.7μs
> Time to access example at i=10%	: 7.2μs
> Time to access example at i=20%	: 9.1μs
> Time to access example at i=30%	: 11.4μs
> Time to access example at i=40%	: 13.8μs
> Time to access example at i=50%	: 16.2μs
> Time to access example at i=60%	: 18.7μs
> Time to access example at i=70%	: 21.1μs
> Time to access example at i=80%	: 26.8μs
> Time to access example at i=90%	: 25.2μs
>
> ```
>
> The time measured are the average times to do `table[“text”][j]` depending
> on the index we want to fetch (from the first example at 0% to the example
> at 90% of the length of the table).
>
> You can take a look at the code that produces this benchmark here
> <https://pastebin.com/pSkYHQn9>.
>
> > Why do I observe this behavior ? Is there a way to get O(1) access
> without bringing the whole table in memory ?
>
>
> Thanks in advance for you help !
>
> Quentin
>

Mime
View raw message