Return-Path: X-Original-To: apmail-drill-dev-archive@www.apache.org Delivered-To: apmail-drill-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9D41E1011B for ; Fri, 10 Apr 2015 00:22:14 +0000 (UTC) Received: (qmail 22783 invoked by uid 500); 10 Apr 2015 00:22:14 -0000 Delivered-To: apmail-drill-dev-archive@drill.apache.org Received: (qmail 22724 invoked by uid 500); 10 Apr 2015 00:22:14 -0000 Mailing-List: contact dev-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list dev@drill.apache.org Received: (qmail 22673 invoked by uid 99); 10 Apr 2015 00:22:14 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Apr 2015 00:22:14 +0000 Date: Fri, 10 Apr 2015 00:22:14 +0000 (UTC) From: "Steven Phillips (JIRA)" To: dev@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (DRILL-2743) Parquet file metadata caching MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Steven Phillips created DRILL-2743: -------------------------------------- Summary: Parquet file metadata caching Key: DRILL-2743 URL: https://issues.apache.org/jira/browse/DRILL-2743 Project: Apache Drill Issue Type: New Feature Components: Storage - Parquet Reporter: Steven Phillips Assignee: Steven Phillips To run a query against parquet files, we have to first recursively search the directory tree for all of the files, get the block locations for each file, and read the footer from each file, and this is done during the planning phase. When there are many files, this can result in a very large delay in running the query, and it does not scale. However, there isn't really any need to read the footers during planning, if we instead treat each parquet file as a single work unit, all we need to know are the block locations for the file, the number of rows, and the columns. We should store only the information which we need for planning in a file located in the top directory for a given parquet table, and then we can delay reading of the footers until execution time, which can be done in parallel. -- This message was sent by Atlassian JIRA (v6.3.4#6332)