flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: The null in Flink
Date Wed, 02 Dec 2015 10:27:15 GMT
Hi Chenliang!

I have to dig into this again, it was a while back. I think (vaguely) the
reason why this worked was that in the end (at the root of a tree that is a
logical expression) if the result is UNKNOWN, it is treated like FALSE.

For example a predicate like "WHERE t.a > 10 && t.b == 'pending' ". If one
boolean atom is UNKNOWN, the other is TRUE, the whole term becomes UNKNOWN
and the row is filtered out (as if the predicate was false) - the result of
the query contains no rows where predicate results are UNKNOWN.

Stephan



On Tue, Dec 1, 2015 at 4:09 AM, Li, Chengxiang <chengxiang.li@intel.com>
wrote:

> Stephen,
> For the 3rd topic, you mentioned that "If the boolean expressions are
> monotonous (have no NOT), then the UNKNOWN value can be the same as FALSE
> ", as UNKNOWN means it could be TRUE as well, does it a proper way to
> handle it just as FALSE?
>
> Aljoscha,
> I agree with you, Table can only be transformed from Tuple/Case Class
> DataSet now, and Tuple/Case Class does not allow null field value, so read
> files from data source to Row DataSet is necessary for NULL value handling.
>
> -----Original Message-----
> From: Aljoscha Krettek [mailto:aljoscha@apache.org]
> Sent: Friday, November 27, 2015 6:41 PM
> To: dev@flink.apache.org
> Subject: Re: The null in Flink
>
> Oh, this is probably the Jira for what I mentioned:
> https://issues.apache.org/jira/browse/FLINK-2988
>
> > On 27 Nov 2015, at 11:02, Aljoscha Krettek <aljoscha@apache.org> wrote:
> >
> > Hi,
> > just some information. The Table API code generator already has
> preliminary support for generating code that is NULL-aware. So for example
> if you have expressions like 1 + NULL the result would also be null.
> >
> > I think one of the missing pieces is a way to get data that contains
> null values into the system. For example, right now the expected way to
> read csv files is via tuples and they don’t support null values. I think we
> need a way to directly read CSV files into a Row DataSet (or Table).
> >
> > Cheers,
> > Aljoscha
> >> On 26 Nov 2015, at 12:31, Stephan Ewen <sewen@apache.org> wrote:
> >>
> >> Hi!
> >>
> >> Thanks for the good discussion! Here are some thoughts from my side:
> >>
> >> 1)
> >> I would agree with Chengxiang that it helps to have as much NULL
> >> handling in the table API as possible, since most SQL constructs will
> >> be permitted there are well.
> >>
> >> 2)
> >> A question that I have is whether we want to actually follow the SQL
> >> standard exactly. There is a lot of criticism on NULL in the SQL
> >> standard, and there have been many good proposals for more meaningful
> >> semantics (for example differentiate between the meanings "value
> >> missing", "value unknown", "value not applicable", etc).
> >>
> >> Going with the SQL way is easiest and makes SQL addition on top of
> >> the table API much easier. Also, there is only one type of NULL,
> >> meaning that null-values can be encoded efficiently in bitmaps.
> >> Further more, the fact that the Table API users have the power of a
> >> programming language at hand (rather than the limited set of SQL
> >> operators), they should be able to easily define their own constants
> >> for special meanings like "value not applicable" or so.
> >>
> >> Just curious if anyone has experience with some of the other
> >> null-sematic proposals that have been around.
> >>
> >> 3)
> >> One comment concerning the three-value-logic for boolean expressions:
> >>
> >> A while back, I worked on a SQL engine, and we were able to not
> >> implement three-value logic with trick. If I recall correctly, it was
> like this:
> >>
> >> If the boolean expressions are monotonous (have no NOT), then the
> >> UNKNOWN value can be the same as FALSE. So the query planner had to
> >> rewrite all expression trees to have no NOT, which means pushing the
> >> NOT down into the leaf comparison operations (for example push NOT into
> == to become !=).
> >> These leaf comparison operators needed to be NULL aware to return
> >> FALSE on comparisons with NULL.
> >>
> >>
> >> Greetings,
> >> Stephan
> >>
> >>
> >> On Thu, Nov 26, 2015 at 6:41 AM, Li, Chengxiang
> >> <chengxiang.li@intel.com>
> >> wrote:
> >>
> >>> Thanks, Timo.
> >>> We may put the NULL related function support to SQL API, but for
> >>> Scalar expression and Boolean expression, it already been supported
> >>> in Table API, without NULL value handling support, query with Scalar
> >>> expression and Boolean expression would fail while encounter NULL
> value.
> >>>
> >>> Thanks
> >>> Chengxiang
> >>>
> >>> -----Original Message-----
> >>> From: Timo Walther [mailto:twalthr@apache.org]
> >>> Sent: Wednesday, November 25, 2015 7:33 PM
> >>> To: dev@flink.apache.org
> >>> Subject: Re: The null in Flink
> >>>
> >>> Hi Chengxiang,
> >>>
> >>> I totally agree that the Table API should fully support NULL values.
> >>> The Table API is a logical API and therefore we should be as close
> >>> to ANSI SQL as possible. Rows need to be nullable in the near future.
> >>>
> >>> 2. i, ii, iii and iv sound reasonable. But v, vi and vii sound to
> >>> much like SQL magic. I think all other SQL magic (DBMS specific
> >>> corner cases) should be handled by the SQL API on top of the Table API.
> >>>
> >>> Regards,
> >>> Timo
> >>>
> >>>
> >>> On 25.11.2015 11:31, Li, Chengxiang wrote:
> >>>> Hi
> >>>> In this mail list, there are some discussions about null value
> >>>> handling
> >>> in Flink, and I saw several related JIRAs as well(like FLINK-2203,
> >>> FLINK-2210), but unfortunately, got reverted due to immature design,
> >>> and no further action since then. I would like to pick this topic up
> >>> here, as it's quite an important part of data analysis and many
> features depend on it.
> >>> Hopefully, through a plenary discussion, we can generate an
> >>> acceptable solution and move forward. Stephan has explained very
> >>> clearly about how and why Flink handle "Null values in the
> >>> Programming Language APIs", so I mainly talk about the second part
> >>> of "Null values in the high-level
> >>> (logical) APIs ".
> >>>>
> >>>> 1. Why should Flink support Null values handling in Table API?
> >>>>     i.  Data source may miss column value in many cases, if no Null
> >>> values handling in Table API, user need to write an extra ETL to
> >>> handle missing values manually.
> >>>>     ii. Some Table API operators generate Null values on their own,
> >>> like Outer Join/Cube/Rollup/Grouping Set, and so on. Null values
> >>> handling in Table API is the prerequisite of these features.
> >>>>
> >>>> 2. The semantic of Null value handling in Table API.
> >>>> Fortunately, there are already mature DBMS  standards we can follow
> >>>> for
> >>> Null value handling, I list several semantic of Null value handling
> here.
> >>> To be noted that, this may not cover all the cases, and the
> >>> semantics may vary in different DBMSs, so it should totally open to
> discuss.
> >>>>     I,  NULL compare. In ascending order, NULL is smaller than any
> >>> other value, and NULL == NULL return false.
> >>>>     ii. NULL exists in GroupBy Key, all NULL values are grouped as
> >>>> a
> >>> single group.
> >>>>     iii. NULL exists in Aggregate columns, ignore NULL in
> >>>> aggregation
> >>> function.
> >>>>                iv. NULL exists in both side Join key, refer to #i,
> >>> NULL == NULL return false, no output for NULL Join key.
> >>>>                v.  NULL in Scalar expression, expression within
> >>> NULL(eg. 1 + NULL) return NULL.
> >>>>                vi. NULL in Boolean expression, add an extra result:
> >>> UNKNOWN, more semantic for Boolean expression in reference #1.
> >>>>                vii. More related function support, like COALESCE,
> >>>> NVL,
> >>> NANVL, and so on.
> >>>>
> >>>> 3. NULL value storage in Table API.
> >>>>  Just set null to Row field value. To mark NULL value in serialized
> >>> binary record data, normally it use extra flag for each field to
> >>> mark whether its value is NULL, which would change the data layout
> >>> of Row object. So any logic that access serialized Row data directly
> >>> should updated to sync with new data layout, for example, many
> >>> methods in RowComparator.
> >>>>
> >>>> Reference:
> >>>> 1. Nulls: Nothing to worry about:
> >>> http://www.oracle.com/technetwork/issue-archive/2005/05-jul/o45sql-0
> >>> 97727.html
> >>> .
> >>>> 2. Null related functions:
> >>>> https://oracle-base.com/articles/misc/null-related-functions
> >>>>
> >>>> -----Original Message-----
> >>>> From: ewenstephan@gmail.com [mailto:ewenstephan@gmail.com] On
> >>>> Behalf Of Stephan Ewen
> >>>> Sent: Thursday, June 18, 2015 8:43 AM
> >>>> To: dev@flink.apache.org
> >>>> Subject: Re: The null in Flink
> >>>>
> >>>> Hi!
> >>>>
> >>>> I think we actually have two discussions here, both of them important:
> >>>>
> >>>> --------------------------------------------------------------
> >>>> 1) Null values in the Programming Language APIs
> >>>> --------------------------------------------------------------
> >>>>
> >>>> Fields in composite types may simply be null pointers.
> >>>>
> >>>> In object types:
> >>>>  - primitives members are naturally non-nullable
> >>>>  - all other members are nullable
> >>>>
> >>>> => If you want to avoid the overhead of nullability, go with
> >>>> primitive
> >>> types.
> >>>>
> >>>> In Tuples, and derives types (Scala case classes):
> >>>>  - Fields are non-nullable.
> >>>>
> >>>> => The reason here is that we initially decided to keep tuples as
a
> >>>> very
> >>> fast data type. Because tuples cannot hold primitives in Java/Scala,
> >>> we would not have a way to make fast non-nullable fields. The
> >>> performance of nullable fields affects the key-operations, especially
> on normalized keys.
> >>>> We can work around that with some effort, but have not one it so far.
> >>>>
> >>>> => In Scala, the Option types is a natural way of elegantly working
> >>> around that.
> >>>>
> >>>>
> >>>> --------------------------------------------------------------
> >>>> 2) Null values in the high-level (logial) APIs
> >>>> --------------------------------------------------------------
> >>>>
> >>>> This is mainly what Ted was referring to, if I understood him
> correctly.
> >>>>
> >>>> Here, we need to figure out what form of semantical null values in
> >>>> the
> >>> Table API and later, in SQL.
> >>>>
> >>>> Besides deciding what semantics to follow here in the logical APIs,
> >>>> we
> >>> need to decide what these values confert to/from when switching
> >>> between logical/physical APIs.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Jun 15, 2015 at 10:07 AM, Ted Dunning
> >>>> <ted.dunning@gmail.com>
> >>> wrote:
> >>>>
> >>>>> On Mon, Jun 15, 2015 at 8:45 AM, Maximilian Michels
> >>>>> <mxm@apache.org>
> >>>>> wrote:
> >>>>>
> >>>>>> Just to give an idea what null values could cause in Flink:
> >>>>> DataSet.count()
> >>>>>> returns the number of elements of all values in a Dataset (null
> >>>>>> or
> >>>>>> not) while #834 would ignore null values and aggregate the
> >>>>>> DataSet without
> >>>>> them.
> >>>>> Compare R's na.action.
> >>>>>
> >>>>> http://www.ats.ucla.edu/stat/r/faq/missing.htm
> >>>>>
> >>>
> >>>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message