impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Behm <alex.b...@cloudera.com>
Subject Re: Query non-ASCII data, filter out rows with problematic chars?
Date Tue, 16 May 2017 15:43:56 GMT
Do you have a stack for that exception? Might be i the impalad logs. This
will help identify where it goes wrong and may guide towards a
fix/workaround.

On Mon, May 15, 2017 at 11:51 PM, John Russell <jrussell@cloudera.com>
wrote:

> Round 2 of diagnosis.  The Chinese characters, e.g. 语句, come through fine
> when I run the query interactively in impala-shell, but not in impala-shell
> -q through a bash script.  I tried bash idioms like:
>
> stty iutf8
>
> export LC_CTYPE=C
> export LANG=C
>
> export LC_CTYPE=zh_CN.utf8
> export LANG=zh_CN.utf8
>
> to no avail.  This is different from IMPALA-532 where the problem is due
> to specifying a non-existent locale.
>
> Thanks,
> John
>
> > On May 15, 2017, at 11:34 PM, John Russell <jrussell@cloudera.com>
> wrote:
> >
> > I'm running some impala-shell queries against Parquet files with
> user-entered strings that are causing character encoding problems.  I get
> Chinese characters coming through just fine in results.  There must be some
> more exotic or non-UTF8 characters somewhere in the input.  The errors look
> like the following (citing different positions, sometimes echoing a u''
> codepoint, always mentioning range(128)):
> >
> > Unknown Exception : 'ascii' codec can't encode characters in position
> 875-876: ordinal not in range(128)
> > Could not execute command: select int_col, string_col from report where
> string_col like "%${var:component}%" limit 250
> >
> > Unknown Exception : 'ascii' codec can't encode character u'\u4e0e' in
> position 3698: ordinal not in range(128)
> > Could not execute command: select int_col, string_col from report where
> string_col like "%${var:component}%" limit 250
> >
> > Is there a WHERE technique or string regularizer function I could use to
> skip over strings containing unrecognizable characters? SET MAX_ERRORS=0
> and/or ABORT_ON_ERROR=0 in advance of the queries didn't help.  If I reduce
> the LIMIT to something very low, the queries tend to work -- they seem to
> fail on the first instance encountered of any problematic character.  The
> impala-shell commands are being issued from a bash script.
> ${var:component} is a Hadoop-related name like 'impala' or 'kafka'.
> >
> > Thanks,
> > John
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message