drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Khurram Faraaz <kfar...@maprtech.com>
Subject Re: to_date(csv-columns[x],'yyyy-mm-dd') - IllegalArgumentException
Date Fri, 28 Oct 2016 16:09:50 GMT
Thanks Andries and Veera.

1. Yes, my CSV file does have empty strings in some rows in columns[4].
2. it worked for parquet because I had used the case expression to cast
empty strings to NULL.
3. I tried with 'yyyy-mm-dd' and 'yyyy-MM-dd' and to_Date returned results
with both representations.

Question - Shouldn't Drill handle such empty strings that are within rows
in CSV files ?
                 Why should user have to take care such cases ?

Regards,
Khurram

On Fri, Oct 28, 2016 at 9:17 PM, Veera Naranammalpuram <
vnaranammalpuram@maprtech.com> wrote:

> That should work and a lot faster too. Thanks for the pointer.
>
> -Veera
>
> On Fri, Oct 28, 2016 at 11:43 AM, Andries Engelbrecht <
> aengelbrecht@maprtech.com> wrote:
>
> > Good catch on empty string Veera!
> >
> > Wouldn't it be cheaper to check for an empty string?
> > case when columns[] ='' then null else to_date(columns[],'yyyy-MM-dd')
> end
> >
> > I don't think the option to read csv empty columns (or empty string in
> any
> > text reader) as null is in the reader yet. So we can't check with
> columns[]
> > is null.
> >
> >
> > --Andries
> >
> >
> > > On Oct 28, 2016, at 8:21 AM, Veera Naranammalpuram <
> > vnaranammalpuram@maprtech.com> wrote:
> > >
> > > Do you have zero length strings in your data? I have seen cases where
> the
> > > system option to cast empty strings to NULL doesn't work as advertised.
> > You
> > > should re-open DRILL-3214.
> > >
> > > When I run into this problem, I usually use a regex to workaround. The
> > > PROJECT takes a performance hit when you do this for larger data sets
> but
> > > it works.
> > >
> > > $cat nulls.psv
> > > date_col|string_col
> > > |test
> > > 2016-10-28|test2
> > > $ sqlline
> > > apache drill 1.8.0
> > > "a little sql for your nosql"
> > > 0: jdbc:drill:> select date_col, string_col from `nulls.psv`;
> > > +-------------+-------------+
> > > |  date_col   | string_col  |
> > > +-------------+-------------+
> > > |             | test        |
> > > | 2016-10-28  | test2       |
> > > +-------------+-------------+
> > > 2 rows selected (0.303 seconds)
> > > 0: jdbc:drill:> select to_date(date_col,'yyyy-mm-dd') from `nulls.psv`;
> > > Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> > >
> > > Fragment 0:0
> > >
> > > [Error Id: c058acbe-f2bf-4c3b-a447-66bebdc4c642 on
> > se-node10.se.lab:31010]
> > > (state=,code=0)
> > > 0: jdbc:drill:>  select case when date_col similar to '[0-9]+%' then
> > > to_date(date_col,'yyyy-MM-dd') else null end as date_col_converted from
> > > `nulls.psv`;
> > > +---------------------+
> > > | date_col_converted  |
> > > +---------------------+
> > > | null                |
> > > | 2016-10-28          |
> > > +---------------------+
> > > 2 rows selected (0.521 seconds)
> > > 0: jdbc:drill:> alter system set
> > > `drill.exec.functions.cast_empty_string_to_null` = true;
> > > +-------+----------------------------------------------------------+
> > > |  ok   |                         summary                          |
> > > +-------+----------------------------------------------------------+
> > > | true  | drill.exec.functions.cast_empty_string_to_null updated.  |
> > > +-------+----------------------------------------------------------+
> > > 1 row selected (0.304 seconds)
> > > 0: jdbc:drill:>  select to_date(date_col,'yyyy-mm-dd') from
> `nulls.psv`;
> > > Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> > >
> > > Fragment 0:0
> > >
> > > [Error Id: 92126a1b-1c03-4e90-bc3a-01c5c81bb013 on
> > se-node10.se.lab:31010]
> > > (state=,code=0)
> > > 0: jdbc:drill:>
> > >
> > > -Veera
> > >
> > > On Fri, Oct 28, 2016 at 9:24 AM, Khurram Faraaz <kfaraaz@maprtech.com>
> > > wrote:
> > >
> > >> All,
> > >>
> > >> Question is - why does it work for a parquet column and fails when CSV
> > >> column is used ?
> > >>
> > >> Drill 1.9.0 commit : a29f1e29
> > >>
> > >> This is a simple project of column from a csv file, works.
> > >> {noformat}
> > >> 0: jdbc:drill:schema=dfs.tmp> select columns[4] FROM `typeall_l.csv`
> t1
> > >> limit 5;
> > >> +-------------+
> > >> |   EXPR$0    |
> > >> +-------------+
> > >> | 2011-11-04  |
> > >> | 1986-10-22  |
> > >> | 1992-09-10  |
> > >> | 2016-08-07  |
> > >> | 1986-01-25  |
> > >> +-------------+
> > >> 5 rows selected (0.26 seconds)
> > >> {noformat}
> > >>
> > >> Using TO_DATE function with columns[x] as first input fails, with an
> > >> IllegalArgumentException
> > >> {noformat}
> > >> 0: jdbc:drill:schema=dfs.tmp> select to_date(columns[4],'yyyy-mm-dd')
> > FROM
> > >> `typeall_l.csv` t1 limit 5;
> > >> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> > >>
> > >> Fragment 0:0
> > >>
> > >> [Error Id: 9cff3eb9-4045-4d9a-a6a1-1eadaa597f30 on
> > centos-01.qa.lab:31010]
> > >> (state=,code=0)
> > >> {noformat}
> > >>
> > >> However, interestingly same query over parquet column returns correct
> > >> results, on same data.
> > >>
> > >> {noformat}
> > >> 0: jdbc:drill:schema=dfs.tmp> select to_date(col_dt,'yyyy-mm-dd') FROM
> > >> typeall_l limit 5;
> > >> +-------------+
> > >> |   EXPR$0    |
> > >> +-------------+
> > >> | 2011-01-04  |
> > >> | 1986-01-22  |
> > >> | 1992-01-10  |
> > >> | 2016-01-07  |
> > >> | 1986-01-25  |
> > >> +-------------+
> > >> 5 rows selected (0.286 seconds)
> > >> {noformat}
> > >>
> > >> When the date string is passed as first input, to_date function
> returns
> > >> correct results.
> > >> {noformat}
> > >> 0: jdbc:drill:schema=dfs.tmp> select to_date('2011-01-04','yyyy-mm-
> dd')
> > >> from (values(1));
> > >> +-------------+
> > >> |   EXPR$0    |
> > >> +-------------+
> > >> | 2011-01-04  |
> > >> +-------------+
> > >> 1 row selected (0.235 seconds)
> > >> {noformat}
> > >>
> > >> Thanks,
> > >> Khurram
> > >>
> > >
> > >
> > >
> > > --
> > > Veera Naranammalpuram
> > > Product Specialist - SQL on Hadoop
> > > *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
> > > *(Email) vnaranammalpuram@maprtech.com <naranammalpuram@maprtech.com>*
> > > *(Mobile) 917 683 8116 - can text *
> > > *Timezone: ET (UTC -5:00 / -4:00)*
> >
> >
>
>
> --
> Veera Naranammalpuram
> Product Specialist - SQL on Hadoop
> *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
> *(Email) vnaranammalpuram@maprtech.com <naranammalpuram@maprtech.com>*
> *(Mobile) 917 683 8116 - can text *
> *Timezone: ET (UTC -5:00 / -4:00)*
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message