hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-15147) LLAP: use LLAP cache for non-columnar formats in a somewhat general way
Date Tue, 29 Nov 2016 02:54:59 GMT

     [ https://issues.apache.org/jira/browse/HIVE-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sergey Shelukhin updated HIVE-15147:
------------------------------------
    Attachment: HIVE-15147.WIP.noout.patch

Very early WIP patch. This adds a test with a huge out file, I am excluding the out file for
now, since it is 2Mb and will change before commit.

This contains the requisite infrastructure and the basic pipeline that seems to work. Main
remaining items:
1) Wire up actual cache instead of just allocator, and reenable refcount usage.
2) Unexpected problem - determine what to do with vectorizability. The problem is that right
now we can vectorize the pipeline for any random InputFormat/serde only as long as we can
run it in LLAP with this; but we only decide to run in LLAP if the pipeline is vectorized.
So this creates a catch-22 - we assume in vectorizer we will run in LLAP, but if we won't
for some reason, we need to go back and unvectorize; or, if we decide on LLAP status first,
we'd have to trust in vectorizer. Perhaps we can have LLAP pre-decider before vectorization.
Alternatively, we can have a converter that will have the same logic as the LLAP IO change,
from the operators' vantage point - so, even if we cannot use LLAP IO, we can still run the
vectorized pipeline. [~mmccline] has some feature that allows one to vectorize the non-vectorizable
inputs IIRC, we can just use that.
3) Decide on how to split the file horizontally. Right now the entire text file is treated
as a single giant RG. File offsets would need to be exposed somehow; then we can cache data
based on those and use those before reading. It's trivial to trick the source IF into reading
the parts of the file that supports that (e.g. text) - by faking the splits with the intended
offsets. However, it's hard to get the offsets. We might need to override or C/P the text
input format/RR for that. 
4) Optionally, go above what (3) would require in terms of metadata cache.

cc [~gopalv]

> LLAP: use LLAP cache for non-columnar formats in a somewhat general way
> -----------------------------------------------------------------------
>
>                 Key: HIVE-15147
>                 URL: https://issues.apache.org/jira/browse/HIVE-15147
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-15147.WIP.noout.patch
>
>
> The primary goal for the first pass is caching text files. Nothing would prevent other
formats from using the same path, in principle, although, as was originally done with ORC,
it may be better to have native caching support optimized for each particular format.
> Given that caching pure text is not smart, and we already have ORC-encoded cache that
is columnar due to ORC file structure, we will transform data into columnar ORC.
> The general idea is to treat all the data in the world as merely ORC that was compressed
with some poor compression codec, such as csv. Using the original IF and serde, as well as
an ORC writer (with some heavyweight optimizations disabled, potentially), we can "uncompress"
the csv/whatever data into its "original" ORC representation, then cache it efficiently, by
column, and also reuse a lot of the existing code.
> Various other points:
> 1) Caching granularity will have to be somehow determined (i.e. how do we slice the file
horizontally, to avoid caching entire columns). As with ORC uncompressed files, the specific
offsets don't really matter as long as they are consistent between reads. The problem is that
the file offsets will actually need to be propagated to the new reader from the original inputformat.
Row counts are easier to use but there's a problem of how to actually map them to missing
ranges to read from disk.
> 2) Obviously, for row-based formats, if any one column that is to be read has been evicted
or is otherwise missing, "all the columns" have to be read for the corresponding slice to
cache and read that one column. The vague plan is to handle this implicitly, similarly to
how ORC reader handles CB-RG overlaps - it will just so happen that a missing column in disk
range list to retrieve will expand the disk-range-to-read into the whole horizontal slice
of the file.
> 3) Granularity/etc. won't work for gzipped text. If anything at all is evicted, the entire
file has to be re-read. Gzipped text is a ridiculous feature, so this is by design.
> 4) In future, it would be possible to also build some form or metadata/indexes for this
cached data to do PPD, etc. This is out of the scope for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message