db-ddlutils-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Oshinsky <doshin...@commvault.com>
Subject recommended way to use DECIMAL in Parquet
Date Tue, 12 Jan 2016 23:20:45 GMT
I've been experimenting with transferring information from Oracle RDBMS to Parquet files, and
then querying the Parquet using Drill.  Oracle tables often contain a lot of DECIMAL information,
since that seems to be a very common way to represent numbers there.  My initial approach
was to archive the DECIMAL values to variable length "bytes" in the Parquet.  However, I found
that Drill does  not expect to see DECIMAL values as "bytes" (variable length), but rather
as one of the fixed length types.  See https://issues.apache.org/jira/browse/DRILL-4184 for
more info on this.

I then tried archiving DECIMAL values to fixed length byte arrays in Parquet (rather than
variable length ones), with the length being the full precision (size) of each DECIMAL field.
 This increases my Parquet file sizes by about 10%, but now Drill queries with a where clause
on the DECIMAL values work, but with some oddities as shown here:

0: jdbc:drill:zk=local> select acct_no, curbal from dfs.`c:/dao/ acct2M_after.parquet`
where acct_no = 70000008.0;
+-----------+----------+
|  acct_no  |  curbal  |
+-----------+----------+
| 70000008  | 7000.00  |
+-----------+----------+
1 row selected (25.411 seconds)
0: jdbc:drill:zk=local> select acct_no, curbal from dfs.`c:/dao/acct2M_after.parquet` where
acct_no = 70000008;
+----------+---------+
| acct_no  | curbal  |
+----------+---------+
+----------+---------+
No rows selected (24.331 seconds)
0: jdbc:drill:zk=local> select acct_no, curbal from dfs.`c:/dao/acct2M_after.parquet` where
cast(acct_no AS DECIMAL) = 70000008;
+-----------+----------+
|  acct_no  |  curbal  |
+-----------+----------+
| 70000008  | 7000.00  |
+-----------+----------+
1 row selected (25.742 seconds)
0: jdbc:drill:zk=local>

Note how the query only finds the row if the value in the where clause has ".0" appended,
or the value is explicitly casted to DECIMAL.

Two questions:

1)      Are there any plans to allow for the more efficient storage of DECIMAL as "bytes"
(rather than fixed length byte arrays, at the full/maximum precision)?

2)      Is it expected that the ".0" must be appended in the where clause (or an explicit
cast applied to DECIMAL), or should I create a JIRA for that?  Intuitively speaking, a number
is a number (whether it's decimal or not) so this behavior was a bit unexpected to me.

Thanks,
Dave Oshinsky





***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message