hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Hanson (JIRA)" <>
Subject [jira] [Created] (HIVE-6234) Implement fast vectorized InputFormat extension for text files
Date Mon, 20 Jan 2014 22:01:21 GMT
Eric Hanson created HIVE-6234:

             Summary: Implement fast vectorized InputFormat extension for text files
                 Key: HIVE-6234
             Project: Hive
          Issue Type: Sub-task
            Reporter: Eric Hanson
            Assignee: Eric Hanson

Implement support for vectorized scan input of text files (plain text with configurable record
and fields separators). This should work for CSV files, tab delimited files, etc. 

The goal is to provide high-performance reading of these files using vectorized scans, and
also to do it as an extension of existing Hive. Then, if vectorized query is enabled, existing
tables based on text files will be able to benefit immediately without the need to use a different
input format.

Another goal is to go beyond a simple layering of vectorized row batch iterator over the top
of the existing row iterator. It should be possible to, say, read a chunk of data into a byte
buffer (several thousand or even million rows), and then read data from it into vectorized
row batches directly. Object creations should be minimized to save allocation time and GC
overhead. If it is possible to save CPU for values like dates and numbers by caching the translation
from string to the final data type, that should ideally be implemented.

This message was sent by Atlassian JIRA

View raw message