Return-Path: X-Original-To: apmail-drill-issues-archive@minotaur.apache.org Delivered-To: apmail-drill-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1B245187FC for ; Tue, 11 Aug 2015 16:52:52 +0000 (UTC) Received: (qmail 58750 invoked by uid 500); 11 Aug 2015 16:52:45 -0000 Delivered-To: apmail-drill-issues-archive@drill.apache.org Received: (qmail 58719 invoked by uid 500); 11 Aug 2015 16:52:45 -0000 Mailing-List: contact issues-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list issues@drill.apache.org Received: (qmail 58709 invoked by uid 99); 11 Aug 2015 16:52:45 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Aug 2015 16:52:45 +0000 Date: Tue, 11 Aug 2015 16:52:45 +0000 (UTC) From: "Hari Sekhon (JIRA)" To: issues@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (DRILL-3625) Dynamic Format Detection in DFS backend for unmapped file extensions / files without extensions MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/DRILL-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682071#comment-14682071 ] Hari Sekhon commented on DRILL-3625: ------------------------------------ That would create the same problem as mapping the file extension in the DFS configuration that I mentioned which is that it's not generic. What if one log file is json and another is csv? The Dynamic Format Detection is still needed when traversing filesystems like this since in the real-world there will almost certainly be different formats found and marking everything as json under the dfs workspace is not flexible. > Dynamic Format Detection in DFS backend for unmapped file extensions / files without extensions > ----------------------------------------------------------------------------------------------- > > Key: DRILL-3625 > URL: https://issues.apache.org/jira/browse/DRILL-3625 > Project: Apache Drill > Issue Type: New Feature > Components: Storage - JSON, Storage - Other, Storage - Parquet, Storage - Text & CSV > Affects Versions: 1.1.0 > Reporter: Hari Sekhon > Assignee: Steven Phillips > > When querying a json file that doesn't have a ".json" extension such as ".log" I get this exception: > {code}0: jdbc:drill:zk=local> select * from dfs.down.`auditOut.log` limit 1; > Aug 11, 2015 4:01:38 PM org.apache.calcite.sql.validate.SqlValidatorException > SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: Table 'dfs.down.auditOut.log' not found > Aug 11, 2015 4:01:38 PM org.apache.calcite.runtime.CalciteException > SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1, column 15 to line 1, column 17: Table 'dfs.down.auditOut.log' not found > Error: PARSE ERROR: From line 1, column 15 to line 1, column 17: Table 'dfs.down.auditOut.log' not found > [Error Id: 5610210b-3eb2-497f-9443-c725b29733b6 on :31010] (state=,code=0) > {code} > However when renaming the file to have a .json extension then the query succeeds. > Now while I could reconfigure the DFS plugin to associate all files with *.log extension to be mapped to json, this doesn't seem like the right thing to do. I could rename the file to have a .json extension of course which is the better thing to do but this highlights another question, why doesn't this just work as-is? > Hence I'd like to raise this as a feature request that when an unmapped extension or file without any extension is encountered Drill should do a few quick checks on the file type and then use the appropriate storage backend for the file. > Adding this "Dynamic Format Detection" as I have dubbed it would tie in nicely with Drill's style and existing features like the dynamic schema detection already used for json. > This may also come in handy for dealing with outputs from MapReduce jobs where the files may be named part-m-NNNNN or part-r-NNNNN without any extension and for example if those files were text then the text storage backend could be immediately invoked upon them in Drill. -- This message was sent by Atlassian JIRA (v6.3.4#6332)