hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-11245) LLAP: Fix the LLAP to ORC APIs
Date Wed, 12 Aug 2015 01:22:46 GMT

    [ https://issues.apache.org/jira/browse/HIVE-11245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692715#comment-14692715
] 

Sergey Shelukhin commented on HIVE-11245:
-----------------------------------------

Most of the work was done in 3 sub-tasks.
1) 3 groups of things were added to storage API.
a) DiskRange; ORC already depends on it, so it was an oversight on master that it was not
moved to storage-api. It has been moved on llap branch.
b) EncodedColumnBatch and MemoryBuffer. Same as moving VRB and *ColumnVector for encoded data.
c) DataCache, Pool and Allocator APIs (the only import in any of them is MemoryBuffer, so
they are very generic). The right place to implement format-agnostic cache, allocator, and
object pool is Hive, and input formats can use these deep inside the core functionality, where
Hive has no insight. Therefore it makes sense to have connective interfaces.

2) ....orc.encoded package was created with full separate path for "record reader", as discussed,
although I don't think it was necessary. That required making some things in RecordReaderUtils,
etc. public because Java visibility model is stupid.
It contains 9 files, most of which are very small.
* EncodedOrcFile - equivalent to OrcFile, static factory for Reader.
* Reader - interface, equivalent to orc.Reader, produces EncodedReader.
* EncodedReader - interface, equivalent to RecordReader (although not in signatures), for
reading encoded data.
* Consumer - interface used in EncodedReader call to return data asynchronously (logically,
a queue for returned data with "done" and "error" markers).
* OrcBatchKey, OrcCacheKey - simple DSes to use as keys when passing data and for cache.
* ReaderImpl - equivalent to orc.ReaderImpl, the Reader interface implementation.
* EncodedReaderImpl - equivalent to RecordReaderImpl (although not in signatures), main class
that contains the code. Package-private, so it's not even visible.
* CacheChunk - part of EncodedReaderImpl that has to be visible for tests, so it's in separate
file.

3) The remaining item is moving TreeReader bits that depend on orc.encoded package, into encoded
package. Myself or [~prasanth_j] can do this.

> LLAP: Fix the LLAP to ORC APIs
> ------------------------------
>
>                 Key: HIVE-11245
>                 URL: https://issues.apache.org/jira/browse/HIVE-11245
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Owen O'Malley
>            Assignee: Sergey Shelukhin
>            Priority: Blocker
>
> Currently the LLAP branch has refactored the ORC code to have different code paths depending
on whether the data is coming from the cache or a FileSystem.
> We need to introduce a concept of a DataSource that is responsible for getting the necessary
bytes regardless of whether they are coming from a FileSystem, in memory cache, or both.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message