hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-15147) LLAP: use LLAP cache for non-columnar formats in a somewhat general way
Date Tue, 08 Nov 2016 01:27:58 GMT

     [ https://issues.apache.org/jira/browse/HIVE-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sergey Shelukhin updated HIVE-15147:
------------------------------------
    Description: 
The primary goal for the first pass is caching text formats. Nothing would prevent other formats
from using the same path, in principle, although, as was originally done with ORC, it may
be better to have native caching support optimized for each particular format.
Given that caching pure text is not smart, and we already have ORC-encoded cache that is columnar
due to ORC file structure, we will try to reuse that. The general idea is to treat all the
data in the world as merely ORC that was compressed with some poor compression codec, such
as csv. Using the original IF and serde, as well as ORC writer (with some heavyweight optimizations
removed, potentially), we can "uncompress" the data into "original" ORC, then reuse a lot
of the existing code.
Various other points:
1) Granularity in the file will have to be somehow determined (horizontal slicing of the file,
to avoid caching entire columns). We can base it on arbitrary disk offsets determined during
reading, but they will actually have to be propagated to the reader from the original inputformat.
Row counts are easier to use but there's a problem of how to actually map them to missing
ranges to read from disk.
2) Obviously for row-based formats, if any one column one needs is evicted, "all the columns"
have to be read for the corresponding slice. The vague plan is to handle this implicitly,
similarly to how ORC reader handles CB-RG overlaps - it will just so happen that a missing
column will expand the disk-range-to-read into the whole horizontal slice of the file.
3) Granularity/etc. won't work for gzipped text. If anything at all is evicted, the entire
file has to be re-read. Gzipped text is a ridiculous feature, so this is by design.
4) In future, it would be possible to also build some form or metadata/indexes for this cached
data to do PPD, etc. This is out of the scope of this stage.

  was:
The primary target for the first pass is caching text formats. Nothing would prevent other
formats from using the same path, in principle, although, as was originally done with ORC,
it may be better to have native caching support optimized for each particular format.
Given that caching pure text is not smart, and we already have ORC-encoded cache that is columnar
due to ORC file structure, we will try to reuse that. The general idea is to treat all the
data in the world as merely ORC that was compressed with some poor compression codec, such
as csv. Using the original IF and serde, as well as ORC writer (with some heavyweight optimizations
removed, potentially), we can "uncompress" the data into "original" ORC, then reuse a lot
of the existing code.
Various other points:
1) Granularity in the file will have to be somehow determined (horizontal slicing of the file,
to avoid caching entire columns). We can base it on arbitrary disk offsets determined during
reading, but they will actually have to be propagated to the reader from the original inputformat.
Row counts are easier to use but there's a problem of how to actually map them to missing
ranges to read from disk.
2) Obviously for row-based formats, if any one column one needs is evicted, "all the columns"
have to be read for the corresponding slice. The vague plan is to handle this implicitly,
similarly to how ORC reader handles CB-RG overlaps - it will just so happen that a missing
column will expand the disk-range-to-read into the whole horizontal slice of the file.
3) Granularity/etc. won't work for gzipped text. If anything at all is evicted, the entire
file has to be re-read. Gzipped text is a ridiculous feature, so this is by design.
4) In future, it would be possible to also build some form or metadata/indexes for this cached
data to do PPD, etc. This is out of the scope of this stage.


> LLAP: use LLAP cache for non-columnar formats in a somewhat general way
> -----------------------------------------------------------------------
>
>                 Key: HIVE-15147
>                 URL: https://issues.apache.org/jira/browse/HIVE-15147
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>
> The primary goal for the first pass is caching text formats. Nothing would prevent other
formats from using the same path, in principle, although, as was originally done with ORC,
it may be better to have native caching support optimized for each particular format.
> Given that caching pure text is not smart, and we already have ORC-encoded cache that
is columnar due to ORC file structure, we will try to reuse that. The general idea is to treat
all the data in the world as merely ORC that was compressed with some poor compression codec,
such as csv. Using the original IF and serde, as well as ORC writer (with some heavyweight
optimizations removed, potentially), we can "uncompress" the data into "original" ORC, then
reuse a lot of the existing code.
> Various other points:
> 1) Granularity in the file will have to be somehow determined (horizontal slicing of
the file, to avoid caching entire columns). We can base it on arbitrary disk offsets determined
during reading, but they will actually have to be propagated to the reader from the original
inputformat. Row counts are easier to use but there's a problem of how to actually map them
to missing ranges to read from disk.
> 2) Obviously for row-based formats, if any one column one needs is evicted, "all the
columns" have to be read for the corresponding slice. The vague plan is to handle this implicitly,
similarly to how ORC reader handles CB-RG overlaps - it will just so happen that a missing
column will expand the disk-range-to-read into the whole horizontal slice of the file.
> 3) Granularity/etc. won't work for gzipped text. If anything at all is evicted, the entire
file has to be re-read. Gzipped text is a ridiculous feature, so this is by design.
> 4) In future, it would be possible to also build some form or metadata/indexes for this
cached data to do PPD, etc. This is out of the scope of this stage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message