hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt McCline (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-15147) LLAP: use LLAP cache for non-columnar formats in a somewhat general way
Date Tue, 29 Nov 2016 07:41:59 GMT

    [ https://issues.apache.org/jira/browse/HIVE-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15704543#comment-15704543
] 

Matt McCline commented on HIVE-15147:
-------------------------------------

#1 Ok, we can take the core logic of VectorMapOperator that handles vectorizing any input
format and make it available as a shared class.

#2 I bumped into the LLAP stage being after vectorization when I was trying to understand
anomalies in EXPLAIN VECTORIZATION.  I thought Vectorization was the last stage -- but it
is not.  And, the Vectorizer is a little dangerous because it starts modifying the operators
when there is a small chance even after the Vectorizer validation stage that the vertex cannot
be vectorized.  So, it might make sense to build the vectorized operator tree separate from
the original operator input tree.  The reason that wasn't done originally I suspect is all
the OperatorDesc objects would need to be clone-able.

If the vectorized operator tree was separate, you could change your mind later.  I'm not sure
this is a good idea.

> LLAP: use LLAP cache for non-columnar formats in a somewhat general way
> -----------------------------------------------------------------------
>
>                 Key: HIVE-15147
>                 URL: https://issues.apache.org/jira/browse/HIVE-15147
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-15147.WIP.noout.patch
>
>
> The primary goal for the first pass is caching text files. Nothing would prevent other
formats from using the same path, in principle, although, as was originally done with ORC,
it may be better to have native caching support optimized for each particular format.
> Given that caching pure text is not smart, and we already have ORC-encoded cache that
is columnar due to ORC file structure, we will transform data into columnar ORC.
> The general idea is to treat all the data in the world as merely ORC that was compressed
with some poor compression codec, such as csv. Using the original IF and serde, as well as
an ORC writer (with some heavyweight optimizations disabled, potentially), we can "uncompress"
the csv/whatever data into its "original" ORC representation, then cache it efficiently, by
column, and also reuse a lot of the existing code.
> Various other points:
> 1) Caching granularity will have to be somehow determined (i.e. how do we slice the file
horizontally, to avoid caching entire columns). As with ORC uncompressed files, the specific
offsets don't really matter as long as they are consistent between reads. The problem is that
the file offsets will actually need to be propagated to the new reader from the original inputformat.
Row counts are easier to use but there's a problem of how to actually map them to missing
ranges to read from disk.
> 2) Obviously, for row-based formats, if any one column that is to be read has been evicted
or is otherwise missing, "all the columns" have to be read for the corresponding slice to
cache and read that one column. The vague plan is to handle this implicitly, similarly to
how ORC reader handles CB-RG overlaps - it will just so happen that a missing column in disk
range list to retrieve will expand the disk-range-to-read into the whole horizontal slice
of the file.
> 3) Granularity/etc. won't work for gzipped text. If anything at all is evicted, the entire
file has to be re-read. Gzipped text is a ridiculous feature, so this is by design.
> 4) In future, it would be possible to also build some form or metadata/indexes for this
cached data to do PPD, etc. This is out of the scope for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message