hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Hanson (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-6234) Implement fast vectorized InputFormat extension for text files
Date Fri, 24 Jan 2014 23:22:37 GMT

     [ https://issues.apache.org/jira/browse/HIVE-6234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Eric Hanson updated HIVE-6234:
------------------------------

    Attachment: Vectorized Text InputFormat design.pdf
                Vectorized Text InputFormat design.docx

Attaching version 01 of design specification for this feature.

> Implement fast vectorized InputFormat extension for text files
> --------------------------------------------------------------
>
>                 Key: HIVE-6234
>                 URL: https://issues.apache.org/jira/browse/HIVE-6234
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Eric Hanson
>            Assignee: Eric Hanson
>         Attachments: Vectorized Text InputFormat design.docx, Vectorized Text InputFormat
design.pdf
>
>
> Implement support for vectorized scan input of text files (plain text with configurable
record and field separators). This should work for CSV files, tab delimited files, etc. 
> The goal is to provide high-performance reading of these files using vectorized scans,
and also to do it as an extension of existing Hive. Then, if vectorized query is enabled,
existing tables based on text files will be able to benefit immediately without the need to
use a different input format. After upgrading to new Hive bits that support this, faster,
vectorized processing over existing text tables should just work, when vectorization is enabled.
> Another goal is to go beyond a simple layering of vectorized row batch iterator over
the top of the existing row iterator. It should be possible to, say, read a chunk of data
into a byte buffer (several thousand or even million rows), and then read data from it into
vectorized row batches directly. Object creations should be minimized to save allocation time
and GC overhead. If it is possible to save CPU for values like dates and numbers by caching
the translation from string to the final data type, that should ideally be implemented.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message