arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From comic fans <comicfan...@gmail.com>
Subject question about read/write feather format and memory mapped write
Date Mon, 03 Aug 2020 01:42:01 GMT
Hello everyone,  I'm trying to write out a dataframe in feather format
from R and read it in C++,

my R code looks like this:

arrow::write_feather(data.frame(a=1:1000, b= 1000:1),
 'arrow.data', chunk_size=500, compression= 'uncompressed')

and my C++ code looks like this:

     auto column0 = table->column(0);
     for(int i=0; i< column0->num_chunks();++i){
         auto array = column0->chunk(i);
         auto buffers = array->data()->buffers;
         for(int j=0;j<buffers.size();++j){
             if(!buffers[j]){
                 std::cout<<j<<" null"<<std::endl;
             }else{
                 std::cout<<j<<" "<<buffers[j]->size()<<std::endl;
             }
         }
     }

I found that C++ read data buffer is
[nullptr,  500 number, nullptr, 500 number], if chunk_size =10,
I got [nullptr, 40 number, nullptr, 40 number ...] ,  which makes me
confusing , why a usefulless nullptr Buffer before every Buffer ?

another question is how to use arrow as a zero-copy TSDB, my
intention:

1. historic and new written data must be in contiguous memory ,
    can not be chunked (so I can't makes historic readonly part
     and newly writable part in different buffer)
2. historic data may be very big so I need it memory mapped
3. I also want to use memory map to persist new written data
   (don't have strict transaction requirements, OS scheduled flush
    is OK to me)
4. how many new data to write is known, so preallocated memory
    mapped file is OK.
5. all components live in same process, no cross-process
    communicate needed (so apache plasma not needed)
6. easily exchange data with R

firstly I think arrow is a good fit ,  but with some docs reading , I
realize the buffer in arrow can't be modified, if I a feather file
with array size preallocated, all data became readonly when reload
it (through memory mapped file interface) . I abuse arrow by
const cast the data pointer and write into it , since it's memory
mapped, modification do change the file as I intend, but I'd like
to know if there is better way to achieve my goal ? does arrow
 intend to support such usecase and I missed some API ?
any advise will be helpful.

Mime
View raw message