arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From comic fans <comicfan...@gmail.com>
Subject Re: question about read/write feather format and memory mapped write
Date Mon, 03 Aug 2020 04:07:22 GMT
Great thanks,  I need to read RecordBatch docs to see if it fits my
requirements.

On Mon, Aug 3, 2020 at 11:25 AM Micah Kornfield <emkornfield@gmail.com> wrote:
>>
>> I found that C++ read data buffer is
>> [nullptr,  500 number, nullptr, 500 number], if chunk_size =10,
>> I got [nullptr, 40 number, nullptr, 40 number ...] ,  which makes me
>> confusing , why a usefulless nullptr Buffer before every Buffer ?
>
> The buffer is ArrayData() reflect the Arrow layout.  The nullptr elides validity buffers
where there are no null values.
>
> Regarding pre-allocation, this has been discussed before but no-one has contributed any
implementation for it. The last conversation was [1]. It doesn't mention memory mapping but
I think that could potentially be fit in with the right abstractions.
>
> [1] https://www.mail-archive.com/dev@arrow.apache.org/msg19862.html
>
> On Sun, Aug 2, 2020 at 6:42 PM comic fans <comicfans44@gmail.com> wrote:
>>
>> Hello everyone,  I'm trying to write out a dataframe in feather format
>> from R and read it in C++,
>>
>> my R code looks like this:
>>
>> arrow::write_feather(data.frame(a=1:1000, b= 1000:1),
>>  'arrow.data', chunk_size=500, compression= 'uncompressed')
>>
>> and my C++ code looks like this:
>>
>>      auto column0 = table->column(0);
>>      for(int i=0; i< column0->num_chunks();++i){
>>          auto array = column0->chunk(i);
>>          auto buffers = array->data()->buffers;
>>          for(int j=0;j<buffers.size();++j){
>>              if(!buffers[j]){
>>                  std::cout<<j<<" null"<<std::endl;
>>              }else{
>>                  std::cout<<j<<" "<<buffers[j]->size()<<std::endl;
>>              }
>>          }
>>      }
>>
>> I found that C++ read data buffer is
>> [nullptr,  500 number, nullptr, 500 number], if chunk_size =10,
>> I got [nullptr, 40 number, nullptr, 40 number ...] ,  which makes me
>> confusing , why a usefulless nullptr Buffer before every Buffer ?
>>
>> another question is how to use arrow as a zero-copy TSDB, my
>> intention:
>>
>> 1. historic and new written data must be in contiguous memory ,
>>     can not be chunked (so I can't makes historic readonly part
>>      and newly writable part in different buffer)
>> 2. historic data may be very big so I need it memory mapped
>> 3. I also want to use memory map to persist new written data
>>    (don't have strict transaction requirements, OS scheduled flush
>>     is OK to me)
>> 4. how many new data to write is known, so preallocated memory
>>     mapped file is OK.
>> 5. all components live in same process, no cross-process
>>     communicate needed (so apache plasma not needed)
>> 6. easily exchange data with R
>>
>> firstly I think arrow is a good fit ,  but with some docs reading , I
>> realize the buffer in arrow can't be modified, if I a feather file
>> with array size preallocated, all data became readonly when reload
>> it (through memory mapped file interface) . I abuse arrow by
>> const cast the data pointer and write into it , since it's memory
>> mapped, modification do change the file as I intend, but I'd like
>> to know if there is better way to achieve my goal ? does arrow
>>  intend to support such usecase and I missed some API ?
>> any advise will be helpful.

Mime
View raw message