arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nirmala S <nanna.tech.st...@gmail.com>
Subject Re: Caching layer using arrow
Date Mon, 08 Apr 2019 14:59:48 GMT
Sure, will try to contribute.

Using ORC adaptor, we just have the columns, a typical case is underlying schema is made up
of multiple columns of different data types (date, float, int, string). Is there any optimisation
to read the data row-wise without actually actually reading the whole file as a Table ? 

I looked into below

ORCFileReader::Read(..) gives a table 
ORCFileReader::ReadStripe gives RecordBatch on which I can operate at column level.

Is there a way where in I can get some thing similar to RecordBatch, but as a row ?


> On 29-Mar-2019, at 8:23 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
> 
> hi,
> 
> On Fri, Mar 29, 2019 at 9:49 AM Nirmala S <nanna.tech.stuff@gmail.com> wrote:
>> 
>> Thanks Wes. I do have couple more questions,
>> - When a table is read using ORC adaptor, it gets read into a memory pool(in my case
default_memory_pool). How to free this area once the file is processed ?
> 
> With the default memory pool, the memory is freed automatically when
> the RecordBatch data structures are destructed.
> 
>> - Is there any way to read the ORC file metadata from adaptor ?
> 
> Doesn't look like it yet. This would be a nice contribution to the library
> 
>> 
>> 
>>> On 29-Mar-2019, at 7:18 AM, Wes McKinney <wesmckinn@gmail.com> wrote:
>>> 
>>> The Arrow APIs are batch-based, so if you want to go record-by-record
>>> you would need to develop an interface on top of the
>>> arrow::RecordBatch data structure
>>> 
>>> On Wed, Mar 27, 2019 at 2:06 AM Nirmala S <nanna.tech.stuff@gmail.com>
wrote:
>>>> 
>>>> Now I see there is a ORC adaptor for Arrow which can read ORC file as a table.
With this in place, I intend to use TableBatchReader to read it.
>>>> 
>>>> How to get a single record from TableBatchReader ?
>>>> 
>>>> 
>>>>> On 22-Mar-2019, at 12:18 AM, Wes McKinney <wesmckinn@gmail.com>
wrote:
>>>>> 
>>>>> hi Nirmala,
>>>>> 
>>>>> There aren't any tools in the libraries to help you "out of the box",
>>>>> so you'll probably have to devise your own metadata storage and state
>>>>> management scheme for such a system.
>>>>> 
>>>>> best
>>>>> Wes
>>>>> 
>>>>> On Thu, Mar 21, 2019 at 9:53 AM Nirmala S <nanna.tech.stuff@gmail.com>
wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>>      I am trying to build a caching layer using Arrow on top of ORC
files. The application will ask for a column(which can be of any data type - fixed, variable
length) of data from the cache, the cache needs to check it’s metadata to see if the column
is already present. If yes, it can return the data to application. If not the data needs to
be fetched from ORC files, cached and then returned to application. The application is multi-threaded
and is based on C++. Application has a read-only workload.
>>>>>> 
>>>>>>      This being the case what is the best method to maintain the
metadata and the data in Arrow, is there any good practise ?
>>>>>> 
>>>>>>      If cache size is smaller than the ORC file size, should I be
putting in a logic to swap the data using some algorithm like LRU or is this already present
in Arrow ?
>>>>>> 
>>>>>> 
>>>>>> Thanks in advance
>>>>>> Nirmala
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>> 


Mime
View raw message