arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yogesh Tewari (Jira)" <j...@apache.org>
Subject [jira] [Created] (ARROW-7048) [Java] Support for combining multiple vectors under VectorSchemaRoot
Date Sat, 02 Nov 2019 02:49:00 GMT
Yogesh Tewari created ARROW-7048:
------------------------------------

             Summary: [Java] Support for combining multiple vectors under VectorSchemaRoot
                 Key: ARROW-7048
                 URL: https://issues.apache.org/jira/browse/ARROW-7048
             Project: Apache Arrow
          Issue Type: New Feature
          Components: Java
            Reporter: Yogesh Tewari


Hi,

 

pyarrow.Table.combine_chunks provides a nice functionality of combining multiple batch records
under a single pyarrow.Table.

 

I am currently working on a downstream application which reads data from BigQuery. BigQuery
storage api supports data output in Arrow format but streams data in many batches of size
1024 or less number of rows.

It would be really nice to have Arrow Java api provide this functionality under an abstraction
like VectorSchemaRoot.

After getting guidance from [~emkornfield@gmail.com], I tried to write my own implementation
by copying data vector by vector using TransferPair's copyValueSafe

But, unless I am missing some thing obvious, turns out it only copies one value at a time.
That means a lot of looping trying copyValueSafe millions of rows from source vector index
to target vector index. Ideally I would want to concatenate/link the underlying buffers rather
than copying one cell at a time.

 

Eg, if I have :
{code:java}
List<VectorSchemaRoot> batchList = new ArrayList<>();
try (ArrowStreamReader reader = new ArrowStreamReader(new ByteArrayInputStream(out.toByteArray()),
allocator)) {
    Schema schema = reader.getVectorSchemaRoot().getSchema();
    for (int i = 0; i < 5; i++) {
        // This will be loaded with new values on every call to loadNextBatch
        VectorSchemaRoot readBatch = reader.getVectorSchemaRoot();
        reader.loadNextBatch();
        batchList.add(readBatch);
    }
}

//VectorSchemaRoot.combineChunks(batchList, newVectorSchemaRoot);{code}
 

A method like VectorSchemaRoot.combineChunks(List<VectorSchemaRoot>)?

I did read the VectorSchemaRoot discussion on https://issues.apache.org/jira/browse/ARROW-6896 and
am not sure if its the right thing to use here.

 

 

PS. Feel free to update the title of this feature request to more appropriate wordings.

 

Cheers,

Yogesh

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message