arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fernando Herrera <fernando.j.herr...@gmail.com>
Subject Re: [RUST] Reading parquet
Date Sun, 24 Jan 2021 12:40:56 GMT
Thanks Andrew,

I did read the examples that you mentioned and I don't think they will help
me with what I want to do. I need to create two hash maps from the parquet
file to do further comparisons on those maps. In both cases I need to
create a set of unique ngrams from strings stored in the parquet file.

By the way, would it make sense to create a struct Table similar to the one
in pyarrow to collect several Record Batches?

Also, how is an object that implements Array <dyn Array> downcasted to
other types of Arrays. I'm doing it now using as_any and then down ref to
the type I want. But I have to write the type in the code and I want to
find a way for it to be done automatically.

Thanks,
Fernando

On Sun, 24 Jan 2021, 12:01 Andrew Lamb, <alamb@influxdata.com> wrote:

> Hi Fernando,
>
> Keeping the data in memory as `RecordBatch`es sounds like the way to go if
> you want it all to be in memory.
>
> Another way to work in Rust with data from parquet files is to use the
> `DataFusion` library; Depending on your needs it might save you some time
> building up your analytics (e.g. it has aggregations, filtering and sorting
> built it).
>
> Here are some examples of how to use DataFusion with a parquet file (with
> the dataframe and the SQL api):
>
> https://github.com/apache/arrow/blob/master/rust/datafusion/examples/dataframe.rs
>
> https://github.com/apache/arrow/blob/master/rust/datafusion/examples/parquet_sql.rs
>
> If you already have RecordBatches you can register an in memory table as
> well.
>
> Hope that helps,
> Andrew
>
>
> On Sat, Jan 23, 2021 at 7:33 AM Fernando Herrera <
> fernando.j.herrera@gmail.com> wrote:
>
>> Hi all,
>>
>> A quick question regarding reading a parquet file. What is the best way
>> to read a parquet file and keep it in memory to do data analysis?
>>
>> What I'm doing now is using the record reader from the
>> ParquetFileArrowReader and then I read all the record batches from the
>> file. I keep the batches in memory in a vector of record batches. This way
>> I have access to them to do some aggregations I need from the file.
>>
>> Is there another way to do this?
>>
>> Thanks,
>> Fernando
>>
>

Mime
View raw message