airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AIRFLOW-179) DbApiHook string serialization fails when string contains non-ASCII characters
Date Fri, 27 May 2016 17:59:12 GMT

    [ https://issues.apache.org/jira/browse/AIRFLOW-179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304466#comment-15304466
] 

ASF subversion and git services commented on AIRFLOW-179:
---------------------------------------------------------

Commit 87b4b8fa19cb660317198d74f6d51fdde0a7e067 in incubator-airflow's branch refs/heads/master
from [~john.bodley@gmail.com]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=87b4b8f ]

[AIRFLOW-179] DbApiHook string serialization fails when string contains non-ASCII characters

Dear Airflow Maintainers,

Please accept this PR that addresses the following issues:
- https://issues.apache.org/jira/browse/AIRFLOW-179

In addition to correctly serializing non-ASCII characters the literal transformation also
corrects an issue with escaping single quotes (').

Note it was my intention to add another unit test to `test_hive_to_mysql` in `tests/core.py`
however on inspection the indentations of the various methods seemed wrong, methods are nested
and it's not apparent what class they refer to. Additionally it seems a number of the test
cases aren't related to the corresponding class.

For testing purposes I simply ran a pipeline which previously failed with the following exception,

    [2016-05-26 22:03:39,256] {models.py:1286} ERROR - 'ascii' codec can't decode byte 0xc3
in position 230: ordinal not in range(128)
    Traceback (most recent call last):
      File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 1245, in run
    result = task_copy.execute(context=context)
      File "/usr/local/lib/python2.7/dist-packages/airflow/operators/hive_to_mysql.py", line
88, in execute
    mysql.insert_rows(table=self.mysql_table, rows=results)
      File "/usr/local/lib/python2.7/dist-packages/airflow/hooks/dbapi_hook.py", line 176,
in insert_rows
    l.append(self._serialize_cell(cell))
      File "/usr/local/lib/python2.7/dist-packages/airflow/hooks/dbapi_hook.py", line 196,
in _serialize_cell
    return "'" + str(cell).replace("'", "''") + "'"
      File "/usr/local/lib/python2.7/dist-packages/future/types/newstr.py", line 102, in __new__
    return super(newstr, cls).__new__(cls, value)
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 230: ordinal not
in range(128)

and verified with the presence of the fix that the task succeeded and the resulting output
was correct. Note currently from grokking the code base it seems that only `MySqlHook` objects
call the the `insert_rows` method.

Author: John Bodley <john.bodley@airbnb.com>

Closes #1550 from johnbodley/dbapi_hook_serialization.


> DbApiHook string serialization fails when string contains non-ASCII characters
> ------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-179
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-179
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: hooks
>            Reporter: John Bodley
>            Assignee: John Bodley
>             Fix For: Airflow 1.8
>
>
> The DbApiHook.insert_rows(...) method tries to serialize all values to strings using
the ASCII codec,  this is problematic if the cell contains non-ASCII characters, i.e.
>     >>> from airflow.hooks import DbApiHook
>     >>> DbApiHook._serialize_cell('Nguyễn Tấn Dũng')
>     Traceback (most recent call last):
>       File "<stdin>", line 1, in <module>
>       File "/usr/local/lib/python2.7/dist-packages/airflow/hooks/dbapi_hook.py", line
196, in _serialize_cell
>         return "'" + str(cell).replace("'", "''") + "'"
>       File "/usr/local/lib/python2.7/dist-packages/future/types/newstr.py", line 102,
in __new__
>         return super(newstr, cls).__new__(cls, value)
>     UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 4: ordinal not
in range(128)
> Rather than manually trying to serialize and escape values to an ASCII string one should
try to serialize the value to string using the character set of the corresponding target database
leveraging the connection to mutate the object to the SQL string literal.
> Additionally the escaping logic for single quotes (') within the _serialize_cell method
seems wrong, i.e. 
>     str(cell).replace("'", "''")
> would escape the string "you're" to be "'you''ve'" as opposed to "'you\'ve'".
> Note an exception should still be thrown if the target encoding is not compatible with
the source encoding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message