drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Charles Givre (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (DRILL-7233) Format Plugin for HDF5
Date Fri, 03 May 2019 12:58:00 GMT

     [ https://issues.apache.org/jira/browse/DRILL-7233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Charles Givre updated DRILL-7233:
---------------------------------
    Labels: doc-impacting  (was: )

> Format Plugin for HDF5
> ----------------------
>
>                 Key: DRILL-7233
>                 URL: https://issues.apache.org/jira/browse/DRILL-7233
>             Project: Apache Drill
>          Issue Type: New Feature
>    Affects Versions: 1.17.0
>            Reporter: Charles Givre
>            Assignee: Charles Givre
>            Priority: Major
>              Labels: doc-impacting
>             Fix For: 1.17.0
>
>
> h2. Drill HDF5 Format Plugin
> h2. 
> Per wikipedia, Hierarchical Data Format (HDF) is a set of file formats designed to store
and organize large amounts of data. Originally developed at the National Center for Supercomputing
Applications, it is supported by The HDF Group, a non-profit corporation whose mission is
to ensure continued development of HDF5 technologies and the continued accessibility of data
stored in HDF.
> This plugin enables Apache Drill to query HDF5 files.
> h3. Configuration
> There are three configuration variables in this plugin:
> type: This should be set to hdf5.
> extensions: This is a list of the file extensions used to identify HDF5 files. Typically
HDF5 uses .h5 or .hdf5 as file extensions. This defaults to .h5.
> defaultPath:
> h3. Example Configuration
> h3. 
> For most uses, the configuration below will suffice to enable Drill to query HDF5 files.
> {{"hdf5": {
>       "type": "hdf5",
>       "extensions": [
>         "h5"
>       ],
>       "defaultPath": null
>     }}}
> h3. Usage
> Since HDF5 can be viewed as a file system within a file, a single file can contain many
datasets. For instance, if you have a simple HDF5 file, a star query will produce the following
result:
> {{apache drill> select * from dfs.test.`dset.h5`;
> +-------+-----------+-----------+--------------------------------------------------------------------------+
> | path  | data_type | file_name |                                 int_data          
                      |
> +-------+-----------+-----------+--------------------------------------------------------------------------+
> | /dset | DATASET   | dset.h5   | [[1,2,3,4,5,6],[7,8,9,10,11,12],[13,14,15,16,17,18],[19,20,21,22,23,24]]
|
> +-------+-----------+-----------+--------------------------------------------------------------------------+}}
> The actual data in this file is mapped to a column called int_data. In order to effectively
access the data, you should use Drill's FLATTEN() function on the int_data column, which produces
the following result.
> {{apache drill> select flatten(int_data) as int_data from dfs.test.`dset.h5`;
> +---------------------+
> |      int_data       |
> +---------------------+
> | [1,2,3,4,5,6]       |
> | [7,8,9,10,11,12]    |
> | [13,14,15,16,17,18] |
> | [19,20,21,22,23,24] |
> +---------------------+}}
> Once you have the data in this form, you can access it similarly to how you might access
nested data in JSON or other files.
> {{apache drill> SELECT int_data[0] as col_0,
> . .semicolon> int_data[1] as col_1,
> . .semicolon> int_data[2] as col_2
> . .semicolon> FROM ( SELECT flatten(int_data) AS int_data
> . . . . . .)> FROM dfs.test.`dset.h5`
> . . . . . .)> );
> +-------+-------+-------+
> | col_0 | col_1 | col_2 |
> +-------+-------+-------+
> | 1     | 2     | 3     |
> | 7     | 8     | 9     |
> | 13    | 14    | 15    |
> | 19    | 20    | 21    |
> +-------+-------+-------+}}
> Alternatively, a better way to query the actual data in an HDF5 file is to use the defaultPath
field in your query. If the defaultPath field is defined in the query, or via the plugin configuration,
Drill will only return the data, rather than the file metadata.
> ** Note: Once you have determined which data set you are querying, it is advisable to
use this method to query HDF5 data. **
> You can set the defaultPath variable in either the plugin configuration, or at query
time using the table() function as shown in the example below:
> {{SELECT * 
> FROM table(dfs.test.`dset.h5` (type => 'hdf5', defaultPath => '/dset'))}}
> This query will return the result below:
> {{apache drill> SELECT * FROM table(dfs.test.`dset.h5` (type => 'hdf5', defaultPath
=> '/dset'));
> +-----------+-----------+-----------+-----------+-----------+-----------+
> | int_col_0 | int_col_1 | int_col_2 | int_col_3 | int_col_4 | int_col_5 |
> +-----------+-----------+-----------+-----------+-----------+-----------+
> | 1         | 2         | 3         | 4         | 5         | 6         |
> | 7         | 8         | 9         | 10        | 11        | 12        |
> | 13        | 14        | 15        | 16        | 17        | 18        |
> | 19        | 20        | 21        | 22        | 23        | 24        |
> +-----------+-----------+-----------+-----------+-----------+-----------+
> 4 rows selected (0.223 seconds)}}
> If the data in defaultPath is a column, the column name will be the last part of the
path. If the data is multidimensional, the columns will get a name of <data_type>_col_n
. Therefore a column of integers will be called int_col_1.
> h3. Attributes
> Occasionally, HDF5 paths will contain attributes. Drill will map these to a map data
structure called attributes, as shown in the query below.
> {{apache drill> SELECT attributes FROM dfs.test.`browsing.h5`;
> +----------------------------------------------------------------------------------+
> |                                    attributes                                    |
> +----------------------------------------------------------------------------------+
> | {}                                                                               |
> | {"__TYPE_VARIANT__":"TIMESTAMP_MILLISECONDS_SINCE_START_OF_THE_EPOCH"}           |
> | {}                                                                               |
> | {}                                                                               |
> | {"important":false,"__TYPE_VARIANT__timestamp__":"TIMESTAMP_MILLISECONDS_SINCE_START_OF_THE_EPOCH","timestamp":1550033296762}
|
> | {}                                                                               |
> | {}                                                                               |
> | {}                                                                               |
> +----------------------------------------------------------------------------------+
> 8 rows selected (0.292 seconds)}}
> You can access the individual fields within the attributes map by using the structure
table.map.key. Note that you will have to give the table an alias for this to work properly.
> {{apache drill> SELECT path, data_type, file_name  
> FROM dfs.test.`browsing.h5` AS t1 WHERE t1.attributes.important = false;
> +---------+-----------+-------------+
> |  path   | data_type |  file_name  |
> +---------+-----------+-------------+
> | /groupB | GROUP     | browsing.h5 |
> +---------+-----------+-------------+}}
> h3. Known Limitations
> h3. 
> There are several limitations with the HDF5 format plugin in Drill.
> * Drill cannot read unsigned 64 bit integers. When the plugin encounters this data type,
it will write an INFO message to the log.
> * Drill cannot read compressed fields in HDF5 files.
> * HDF5 files can contain nested data sets of up to n dimensions. Since Drill works best
with two dimensional data, datasets with more than two dimensions are flattened.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message