drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hari Sekhon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-3625) Dynamic Format Detection in DFS backend for unmapped file extensions / files without extensions
Date Tue, 11 Aug 2015 16:52:45 GMT

    [ https://issues.apache.org/jira/browse/DRILL-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682071#comment-14682071
] 

Hari Sekhon commented on DRILL-3625:
------------------------------------

That would create the same problem as mapping the file extension in the DFS configuration
that I mentioned which is that it's not generic. What if one log file is json and another
is csv?

The Dynamic Format Detection is still needed when traversing filesystems like this since in
the real-world there will almost certainly be different formats found and marking everything
as json under the dfs workspace is not flexible.

> Dynamic Format Detection in DFS backend for unmapped file extensions / files without
extensions
> -----------------------------------------------------------------------------------------------
>
>                 Key: DRILL-3625
>                 URL: https://issues.apache.org/jira/browse/DRILL-3625
>             Project: Apache Drill
>          Issue Type: New Feature
>          Components: Storage - JSON, Storage - Other, Storage - Parquet, Storage - Text
& CSV
>    Affects Versions: 1.1.0
>            Reporter: Hari Sekhon
>            Assignee: Steven Phillips
>
> When querying a json file that doesn't have a ".json" extension such as ".log" I get
this exception:
> {code}0: jdbc:drill:zk=local> select * from dfs.down.`auditOut.log` limit 1;
> Aug 11, 2015 4:01:38 PM org.apache.calcite.sql.validate.SqlValidatorException <init>
> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: Table 'dfs.down.auditOut.log'
not found
> Aug 11, 2015 4:01:38 PM org.apache.calcite.runtime.CalciteException <init>
> SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1, column 15 to
line 1, column 17: Table 'dfs.down.auditOut.log' not found
> Error: PARSE ERROR: From line 1, column 15 to line 1, column 17: Table 'dfs.down.auditOut.log'
not found
> [Error Id: 5610210b-3eb2-497f-9443-c725b29733b6 on <host>:31010] (state=,code=0)
> {code}
> However when renaming the file to have a .json extension then the query succeeds.
> Now while I could reconfigure the DFS plugin to associate all files with *.log extension
to be mapped to json, this doesn't seem like the right thing to do. I could rename the file
to have a .json extension of course which is the better thing to do but this highlights another
question, why doesn't this just work as-is?
> Hence I'd like to raise this as a feature request that when an unmapped extension or
file without any extension is encountered Drill should do a few quick checks on the file type
and then use the appropriate storage backend for the file.
> Adding this "Dynamic Format Detection" as I have dubbed it would tie in nicely with Drill's
style and existing features like the dynamic schema detection already used for json.
> This may also come in handy for dealing with outputs from MapReduce jobs where the files
may be named part-m-NNNNN or part-r-NNNNN without any extension and for example if those files
were text then the text storage backend could be immediately invoked upon them in Drill.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message