orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiening Dai (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ORC-262) Support async prefetch in Orc reader
Date Tue, 07 Nov 2017 20:36:00 GMT
Xiening Dai created ORC-262:

             Summary: Support async prefetch in Orc reader
                 Key: ORC-262
                 URL: https://issues.apache.org/jira/browse/ORC-262
             Project: ORC
          Issue Type: Improvement
          Components: C++
            Reporter: Xiening Dai

Currently RowReader::next() method reads a batch of rows and return them to be processed by
runtime. The function call is synchronized, meaning that the execution thread is blocked while
reader is loading data from disk. We could potentially parallelize the execution and data
loading through async prefetch using logic described as below.

In SeekableFileInputStream::Next(), we firstly check if the requested data block is already
prefetched, if yes, we simply return the buffer to the caller, otherwise we issue a sync call
to read data from file stream. No matter how we load the requested data block, we always issue
another async call to prefetch the next block within current stream. 

Additionally orc::InputStream will need a new method that does the async read for a given
offset and length.

According to our experiment, async prefetch can significantly reduce the IO wait time on a
heavy loaded distributed file system. By carefully choosing the prefetch data block size,
we can maximize the parallelization of runtime execution and data loading, and achieve a relatively
high cache hit rate (~85%).

This message was sent by Atlassian JIRA

View raw message