arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xavier Lange (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ARROW-5153) [Rust] Use IntoIter trait for write_batch/write_mini_batch
Date Tue, 09 Apr 2019 19:17:00 GMT
Xavier Lange created ARROW-5153:
-----------------------------------

             Summary: [Rust] Use IntoIter trait for write_batch/write_mini_batch
                 Key: ARROW-5153
                 URL: https://issues.apache.org/jira/browse/ARROW-5153
             Project: Apache Arrow
          Issue Type: Improvement
            Reporter: Xavier Lange


Writing data to a parquet file requires a lot of copying and intermediate Vec creation. Take
a record struct like:

{{struct MyData {}}{{  name: String,}}{{  address: Option<String>}}{{}}}

Over the course of working sets of this data, you'll have the bulk data Vec<MyData>, 
the names column in a Vec<&String>, the address column in a Vec<Option<String>>.
This puts extra memory pressure on the system, at the minimum we have to allocate a Vec the
same size as the bulk data even if we are using references.

What I'm proposing is to use an IntoIter style. This will maintain backward compat as a slice
automatically implements IntoIter. Where ColumnWriterImpl#write_batch goes from "values: &[T::T]"to
values "values: IntoIter<Item=T::T>". Then you can do things like

{{  write_batch(bulk.iter().map(|x| x.name), None, None)}}{{  write_batch(bulk.iter().map(|x|
x.address), Some(bulk.iter().map(|x| x.is_some())), None)}}

and you can see there's no need for an intermediate Vec, so no short-term allocations to write
out the data.

I am writing data with many columns and I think this would really help to speed things up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message