Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 62FF9103FA for ; Mon, 20 Jan 2014 22:01:33 +0000 (UTC) Received: (qmail 44389 invoked by uid 500); 20 Jan 2014 22:01:24 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 44342 invoked by uid 500); 20 Jan 2014 22:01:23 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 44273 invoked by uid 500); 20 Jan 2014 22:01:21 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 44235 invoked by uid 99); 20 Jan 2014 22:01:21 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Jan 2014 22:01:21 +0000 Date: Mon, 20 Jan 2014 22:01:21 +0000 (UTC) From: "Eric Hanson (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HIVE-6234) Implement fast vectorized InputFormat extension for text files MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Eric Hanson created HIVE-6234: --------------------------------- Summary: Implement fast vectorized InputFormat extension for text files Key: HIVE-6234 URL: https://issues.apache.org/jira/browse/HIVE-6234 Project: Hive Issue Type: Sub-task Reporter: Eric Hanson Assignee: Eric Hanson Implement support for vectorized scan input of text files (plain text with configurable record and fields separators). This should work for CSV files, tab delimited files, etc. The goal is to provide high-performance reading of these files using vectorized scans, and also to do it as an extension of existing Hive. Then, if vectorized query is enabled, existing tables based on text files will be able to benefit immediately without the need to use a different input format. Another goal is to go beyond a simple layering of vectorized row batch iterator over the top of the existing row iterator. It should be possible to, say, read a chunk of data into a byte buffer (several thousand or even million rows), and then read data from it into vectorized row batches directly. Object creations should be minimized to save allocation time and GC overhead. If it is possible to save CPU for values like dates and numbers by caching the translation from string to the final data type, that should ideally be implemented. -- This message was sent by Atlassian JIRA (v6.1.5#6160)