arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: question about read/write feather format and memory mapped write
Date Mon, 03 Aug 2020 03:24:18 GMT
>
> I found that C++ read data buffer is
> [nullptr,  500 number, nullptr, 500 number], if chunk_size =10,
> I got [nullptr, 40 number, nullptr, 40 number ...] ,  which makes me
> confusing , why a usefulless nullptr Buffer before every Buffer ?

The buffer is ArrayData() reflect the Arrow layout.  The nullptr elides
validity buffers where there are no null values.

Regarding pre-allocation, this has been discussed before but no-one has
contributed any implementation for it. The last conversation was [1]. It
doesn't mention memory mapping but I think that could potentially be fit in
with the right abstractions.

[1] https://www.mail-archive.com/dev@arrow.apache.org/msg19862.html

On Sun, Aug 2, 2020 at 6:42 PM comic fans <comicfans44@gmail.com> wrote:

> Hello everyone,  I'm trying to write out a dataframe in feather format
> from R and read it in C++,
>
> my R code looks like this:
>
> arrow::write_feather(data.frame(a=1:1000, b= 1000:1),
>  'arrow.data', chunk_size=500, compression= 'uncompressed')
>
> and my C++ code looks like this:
>
>      auto column0 = table->column(0);
>      for(int i=0; i< column0->num_chunks();++i){
>          auto array = column0->chunk(i);
>          auto buffers = array->data()->buffers;
>          for(int j=0;j<buffers.size();++j){
>              if(!buffers[j]){
>                  std::cout<<j<<" null"<<std::endl;
>              }else{
>                  std::cout<<j<<" "<<buffers[j]->size()<<std::endl;
>              }
>          }
>      }
>
> I found that C++ read data buffer is
> [nullptr,  500 number, nullptr, 500 number], if chunk_size =10,
> I got [nullptr, 40 number, nullptr, 40 number ...] ,  which makes me
> confusing , why a usefulless nullptr Buffer before every Buffer ?
>
> another question is how to use arrow as a zero-copy TSDB, my
> intention:
>
> 1. historic and new written data must be in contiguous memory ,
>     can not be chunked (so I can't makes historic readonly part
>      and newly writable part in different buffer)
> 2. historic data may be very big so I need it memory mapped
> 3. I also want to use memory map to persist new written data
>    (don't have strict transaction requirements, OS scheduled flush
>     is OK to me)
> 4. how many new data to write is known, so preallocated memory
>     mapped file is OK.
> 5. all components live in same process, no cross-process
>     communicate needed (so apache plasma not needed)
> 6. easily exchange data with R
>
> firstly I think arrow is a good fit ,  but with some docs reading , I
> realize the buffer in arrow can't be modified, if I a feather file
> with array size preallocated, all data became readonly when reload
> it (through memory mapped file interface) . I abuse arrow by
> const cast the data pointer and write into it , since it's memory
> mapped, modification do change the file as I intend, but I'd like
> to know if there is better way to achieve my goal ? does arrow
>  intend to support such usecase and I missed some API ?
> any advise will be helpful.
>

Mime
View raw message