impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yuanhao Luo" <luoyuan...@software.ict.ac.cn>
Subject 回复:Re: IMPALA-2428 Support multiple-character string as the field delimiter
Date Mon, 25 Jul 2016 14:30:57 GMT
Hello, Jim Apple:
I can't find any tests for case escape character is the same value as field delimiter and
case escape character is the same value as tuple delimiter from testdata/workloads/functional-query/queries/QueryTest/delimited-text.test.
I ran some tests on branch cdh5-trunk(commit id: 50a7ba059), and logs below show that even
though we have already add warning "WARNINGS: Field delimiter and escape character have same
value. Escape character will be ignored" and "WARNINGS: Line delimiter and escape character
have same value: . Escape character will be ignored" for these two corner cases, but codes
don't work as expected.


Take a detail look at logs below:


escape character is the same value as field delimiter
[nobida147:21000] > create table text_at_at_newline(col1 string, col2 string, col3 int,
col4 int) row format delimited fields terminated by '@' escaped by '@' lines terminated by
'\n';
Query: create table text_at_at_newline(col1 string, col2 string, col3 int, col4 int) row format
delimited fields terminated by '@' escaped by '@' lines terminated by '\n'
Query submitted at: 2016-07-25 18:31:06 (Coordinator: http://0.0.0.0:25000)
Query progress can be monitored at: http://0.0.0.0:25000/query_plan?query_id=cf4828e76dfa6124:96ccf769adaf0898
++
||
++
++
WARNINGS: Field delimiter and escape character have same value: byte 64. Escape character
will be ignored
Fetched 0 row(s) in 0.12s
[nobida147:21000] > load data inpath '/user/root/text-at-at-newline.txt' into table text_at_at_newline;
Query: load data inpath '/user/root/text-at-at-newline.txt' into table text_at_at_newline
Query submitted at: 2016-07-25 18:31:22 (Coordinator: http://0.0.0.0:25000)
Query progress can be monitored at: http://0.0.0.0:25000/query_plan?query_id=2c427e992f6ae0c5:9efc46932708ee94
+----------------------------------------------------------+
| summary                                                  |
+----------------------------------------------------------+
| Loaded 1 file(s). Total files in destination location: 1 |
+----------------------------------------------------------+
Fetched 1 row(s) in 4.59s
[nobida147:21000] > select * from text_at_at_newline;
Query: select * from text_at_at_newline
Query submitted at: 2016-07-25 18:31:33 (Coordinator: http://0.0.0.0:25000)
Query progress can be monitored at: http://0.0.0.0:25000/query_plan?query_id=534103c6ffd694af:163ec3646650dc87
+------+------+------+------+
| col1 | col2 | col3 | col4 |
+------+------+------+------+
| one  | two  | 3    | 4    |
| one  | ,one | NULL | 3    |
| one  | \    | NULL | 3    |
| one  | \    | NULL | NULL |
| one  | \    | NULL | NULL |
+------+------+------+------+
WARNINGS: Error converting column: 2 TO INT (Data is: two)
file: hdfs://localhost:20500/test-warehouse/single_byte_test1.db/text_at_at_newline/text-at-at-newline.txt
record: one@,one@two@3@4
Error converting column: 2 TO INT (Data is: two)
file: hdfs://localhost:20500/test-warehouse/single_byte_test1.db/text_at_at_newline/text-at-at-newline.txt
record: one@\@two@3@4
Error converting column: 2 TO INT (Data is: ,one)
Error converting column: 3 TO INT (Data is: two)
file: hdfs://localhost:20500/test-warehouse/single_byte_test1.db/text_at_at_newline/text-at-at-newline.txt
record: one@\@,one@two@3@4
Error converting column: 2 TO INT (Data is: \)
Error converting column: 3 TO INT (Data is: two)
file: hdfs://localhost:20500/test-warehouse/single_byte_test1.db/text_at_at_newline/text-at-at-newline.txt
record: one@\@\@two@3@4
Fetched 5 row(s) in 0.44s
escape character is the same value as tuple delimiter
[nobida147:21000] > create table text_comma_backslash_backslash(col1 string, col2 string,
col3 int, col4 int) row format delimited fields terminated by ',' escaped by '\\' lines terminated
by '\\';
Query: create table text_comma_backslash_backslash(col1 string, col2 string, col3 int, col4
int) row format delimited fields terminated by ',' escaped by '\\' lines terminated by '\\'
Query submitted at: 2016-07-25 18:39:08 (Coordinator: http://0.0.0.0:25000)
Query progress can be monitored at: http://0.0.0.0:25000/query_plan?query_id=f9482b9325a29355:751546a6b3640ebf
++
||
++
++
WARNINGS: Line delimiter and escape character have same value: byte 92. Escape character will
be ignored
Fetched 0 row(s) in 0.12s
[nobida147:21000] > load data inpath '/user/root/text-comma-backslash-backslash.txt' into
table text_comma_backslash_backslash;
Query: load data inpath '/user/root/text-comma-backslash-backslash.txt' into table text_comma_backslash_backslash
Query submitted at: 2016-07-25 18:39:42 (Coordinator: http://0.0.0.0:25000)
Query progress can be monitored at: http://0.0.0.0:25000/query_plan?query_id=a442f76958434721:4ddd2cdc78e3bbbd
+----------------------------------------------------------+
| summary                                                  |
+----------------------------------------------------------+
| Loaded 1 file(s). Total files in destination location: 1 |
+----------------------------------------------------------+
Fetched 1 row(s) in 4.09s
[nobida147:21000] > select * from text_comma_backslash_backslash;
Query: select * from text_comma_backslash_backslash
Query submitted at: 2016-07-25 18:39:58 (Coordinator: http://0.0.0.0:25000)
Query progress can be monitored at: http://0.0.0.0:25000/query_plan?query_id=6e41d94e5e4458f3:d2482d108394d6a0
+------+------+------+------+
| col1 | col2 | col3 | col4 |
+------+------+------+------+
| one  | two  | 3    | 4    |
| one  | NULL | NULL | NULL |
|      | one  | NULL | 3    |
|      | NULL | NULL | NULL |
|      | two  | 3    | NULL |
|      | NULL | NULL | NULL |
|      | NULL | NULL | NULL |
|      | one  | NULL | 3    |
|      | NULL | NULL | NULL |
|      | NULL | NULL | NULL |
|      | NULL | NULL | NULL |
|      | two  | 3    | 4    |
+------+------+------+------+
WARNINGS: Error converting column: 2 TO INT (Data is: two)
file: hdfs://localhost:20500/test-warehouse/single_byte_test1.db/text_comma_backslash_backslash/text-comma-backslash-backslash.txt
record: ,one,two,3,4one
Error converting column: 3 TO INT (Data is: 4one)
file: hdfs://localhost:20500/test-warehouse/single_byte_test1.db/text_comma_backslash_backslash/text-comma-backslash-backslash.txt
record: ,two,3,4one
Error converting column: 2 TO INT (Data is: two)
file: hdfs://localhost:20500/test-warehouse/single_byte_test1.db/text_comma_backslash_backslash/text-comma-backslash-backslash.txt
record: ,one,two,3,4one
Fetched 12 row(s) in 0.44s
It's a little difficult for me to fix these corner cases, so in my next patch, I'm going to
enhance restriction as below:

1. Delimiters can't be an empty string.
2. Tuple delimiter can't be the first byte of field delimiter.
3. Escape char can't be the first byte of field delimiter.
4. Escape char and tuple delimiter can't be the same.
5. Delimiters can't contain '\0'.What do you think? Looking forward to your reply.


What's more, in my tests, I found that sql-parser.cup can't parse unicode and octol of extended
ASCII character(with decimal value from 128 to 255) correctly. For example, if we want to
set "#@#" as fields terminator, we can use fields terminated by '\u0023\100\043' , which refers
to ASCII #@# respectively. The parse result is right. However, when I want to set double thorn(extended
ASCII character with decimal value 254) as field terminator, for example fields terminated
by '\u00fe\376', it turns out to '\u00A4376' when I run 'describe extended table'. I have
report this issue in IMPALA-3777 already.

 




------------------ 原始邮件 ------------------
发件人: "luoyuanhao";<luoyuanhao@software.ict.ac.cn>;
发送时间: 2016年7月25日(星期一) 上午10:53
收件人: "jbapple"<jbapple@cloudera.com>; 
抄送: "dev@impala"<dev@impala.incubator.apache.org>; 
主题: Re: IMPALA-2428 Support multiple-character string as the field delimiter



Hello, Jim Apple, I have test original delimiters  setting, and logs show's the difference
of my commit and original setting, as below:


My commit                                                                             Original
setting

Field terminators can't be an empty string.                         All terminators can't
be empty. (I will enhance restriction to this in my next patch)

Tuple delimiter can't be the first byte of field delimiter         Field delimiter and line
delimiter can't be the same value(So these two restrictions are actually the same one)

Escaped char can't be the first byte of field delimiter            Warning: Escaped char will
be ignored(I will relax my restriction to this in my next patch)

No restriction for escaped char and line terminator               Warning: Escaped char will
be ignored(I will add this warning in my next patch)

Terminator contains '\0'                                                     ImpalaRuntimeException(logs
for detail. I add this restriction to fix this runtime exception.)



Detail logs:
 Terminator is an empty string
[nobida147:21000] > create table field_null(id int) row format delimited fields terminated
by "";
Query: create table field_null(id int) row format delimited fields terminated by ""
Query submitted at: 2016-07-25 10:20:41 (Coordinator: http://0.0.0.0:25000)
ERROR: AnalysisException: ESCAPED BY values and LINE/FIELD terminators must be specified as
a single character or as a decimal value in the range [-128:127]: 
[nobida147:21000] > create table line_null(id int) row format delimited lines terminated
by "";
Query: create table line_null(id int) row format delimited lines terminated by ""
Query submitted at: 2016-07-25 10:20:54 (Coordinator: http://0.0.0.0:25000)
ERROR: AnalysisException: ESCAPED BY values and LINE/FIELD terminators must be specified as
a single character or as a decimal value in the range [-128:127]: 
[nobida147:21000] > create table escape_null(id int) row format delimited escaped by "";
Query: create table escape_null(id int) row format delimited escaped by ""
Query submitted at: 2016-07-25 10:21:13 (Coordinator: http://0.0.0.0:25000)
ERROR: AnalysisException: ESCAPED BY values and LINE/FIELD terminators must be specified as
a single character or as a decimal value in the range [-128:127]:



Field delimiter and line delimiter have same value 
[nobida147:21000] > create table line_equal_field(id int) row format delimited fields terminated
by "," lines terminated by ",";
Query: create table line_equal_field(id int) row format delimited fields terminated by ","
lines terminated by ","
Query submitted at: 2016-07-25 10:23:45 (Coordinator: http://0.0.0.0:25000)
ERROR: AnalysisException: Field delimiter and line delimiter have same value: byte 44


Field delimiter and escape character have same value
[nobida147:21000] > create table escape_equal_field(id int) row format delimited fields
terminated by "," escaped by ",";
Query: create table escape_equal_field(id int) row format delimited fields terminated by ","
escaped by ","
Query submitted at: 2016-07-25 10:22:48 (Coordinator: http://0.0.0.0:25000)
Query progress can be monitored at: http://0.0.0.0:25000/query_plan?query_id=924c6b616e183f62:7c4779a423b29d96
++
||
++
++
WARNINGS: Field delimiter and escape character have same value: byte 44. Escape character
will be ignored
Fetched 0 row(s) in 0.16s

Line delimiter and escape character have same value
[nobida147:21000] > create table escape_equal_line(id int) row format delimited escaped
by "," lines terminated by ',';
Query: create table escape_equal_line(id int) row format delimited escaped by "," lines terminated
by ','
Query submitted at: 2016-07-25 10:23:21 (Coordinator: http://0.0.0.0:25000)
Query progress can be monitored at: http://0.0.0.0:25000/query_plan?query_id=f443df31f58860bb:1c01f402050f35b3
++
||
++
++
WARNINGS: Line delimiter and escape character have same value: byte 44. Escape character will
be ignored
Fetched 0 row(s) in 0.13s


Delimiter contains '\0'
[nobida147:21000] > create table contains_zero(id int) row format delimited fields terminated
by "\0";
Query: create table contains_zero(id int) row format delimited fields terminated by "\0"
Query submitted at: 2016-07-25 10:08:39 (Coordinator: http://0.0.0.0:25000)
ERROR: 
ImpalaRuntimeException: Error making 'createTable' RPC to Hive Metastore: 
CAUSED BY: MetaException: javax.jdo.JDODataStoreException: Put request failed : INSERT INTO
"SERDE_PARAMS" ("PARAM_VALUE","SERDE_ID","PARAM_KEY") VALUES (?,?,?) 
  at org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:451)
  at org.datanucleus.api.jdo.JDOPersistenceManager.jdoMakePersistent(JDOPersistenceManager.java:732)
  at org.datanucleus.api.jdo.JDOPersistenceManager.makePersistent(JDOPersistenceManager.java:752)
  at org.apache.hadoop.hive.metastore.ObjectStore.createTable(ObjectStore.java:902)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:114)
  at com.sun.proxy.$Proxy0.createTable(Unknown Source)
  at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1469)
  at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1502)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:138)
  at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:99)
  .  .  .
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
  at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:118)
  at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)
Caused by: org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8":
0x00
  at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2102)
  at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1835)
  at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257)
  at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:500)
  at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388)
  at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:334)
  at com.jolbox.bonecp.PreparedStatementHandle.executeUpdate(PreparedStatementHandle.java:205)
  at org.datanucleus.store.rdbms.ParamLoggingPreparedStatement.executeUpdate(ParamLoggingPreparedStatement.java:399)
  at org.datanucleus.store.rdbms.SQLController.executeStatementUpdate(SQLController.java:439)
  at org.datanucleus.store.rdbms.scostore.JoinMapStore.internalPut(JoinMapStore.java:1069)
  ... 70 more


In conclusion, there's no difference with the current restrictions on field terminators.(In
my next patch, I will just inherit current restrictions and add one that delimiters can't
contains '\0' to fix ImpalaRuntimeException as above log shows.


------------------ 原始邮件 ------------------
发件人: "jbapple";<jbapple@cloudera.com>;
发送时间: 2016年7月24日(星期天) 晚上9:20
收件人: "Yuanhao Luo"<luoyuanhao@software.ict.ac.cn>; 
抄送: "dev@impala"<dev@impala.incubator.apache.org>; 
主题: Re: IMPALA-2428 Support multiple-character string as the field delimiter



We must be very careful about breaking changes. We may want to put
this change in Impala 3.0, rather than 2.x, if it breaks existing DDL
statements.

> Field terminator can't be an empty string

How is this different that the current restrictions on field terminators?

If field terminators can currently be empty strings, what kind of
queries or DDL statements does this break?

Do we currently have tests for those? Do we expect that many users are
using them?

These questions are also of interest to me on your three other restrictions.
Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message