drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neeraja Rentachintala <nrentachint...@maprtech.com>
Subject Re: Working with Case-Sensitive Data-sources
Date Mon, 14 Mar 2016 23:29:14 GMT
How is this handled for MongoDB storage plugin, which I believe a case
sensitive DB as well?

On Mon, Mar 14, 2016 at 4:27 PM, Jacques Nadeau <jacques@dremio.com> wrote:

> I don't think it is that simple since there are some types of things that
> we can't pushdown that will cause inconsistent results.
>
> For example, assuming that all values of x are positive, the following two
> queries should return the same result
>
> select * from hbase where x = 5
> select * from hbase where abs(x) = 5
>
> However, if the field x is sometimes 'x' and sometimes 'X', we're going to
> different results between the first query and the second. That is why I
> think we need to guarantee that even when optimization rules fails, we have
> the same plan meaning. In essence, all plans should be valid. If you get to
> a place where a rule changes the data, then the original plan was
> effectively invalid.
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Mon, Mar 14, 2016 at 3:46 PM, Jinfeng Ni <jinfengni99@gmail.com> wrote:
>
> > Project pushdown should always happen. If you see project pushdown
> > does not happen for your HBase query, then it's a bug.
> >
> > However, if you submit two physical plans, one with project pushdown,
> > another one without project pushdown, but they return different
> > results for HBase query. I'll not call this a bug.
> >
> >
> >
> > On Mon, Mar 14, 2016 at 2:54 PM, Jacques Nadeau <jacques@dremio.com>
> > wrote:
> > > Agree with Zelaine, plan changes/optimizations shouldn't change
> results.
> > > This is a bug.
> > >
> > > Drill is focused on being case-insensitive, case-preserving. Each
> storage
> > > plugin implements its own case sensitivity policy when working with
> > > columns/fields and should be documented. It isn't practical to make
> HBase
> > > case-insensitive so it should behave case sensitivity. DFS formats (as
> > > opposed to HBase) are entirely under Drill's control and thus target
> > > case-insensitive, case-preserving operation.
> > >
> > > --
> > > Jacques Nadeau
> > > CTO and Co-Founder, Dremio
> > >
> > > On Mon, Mar 14, 2016 at 2:43 PM, Jinfeng Ni <jinfengni99@gmail.com>
> > wrote:
> > >
> > >> Abhishek
> > >>
> > >> Great question. Here is what I understand regarding the case sensitive
> > >> policy.
> > >>
> > >> Drill's case sensitivity policy (case insensitive and case preserving)
> > >> applies to the execution engine in Drill; it does not enforce the case
> > >> sensitivity policy to all the storage plugin. A storage plugin could
> > >> decide and implement it's own policy.
> > >>
> > >> Why would the pushdown impact the case sensitivity when query HBase?
> > >> Without project pushdown, HBase storage plugin will return all the
> > >> data, and it's up to Drill's execution Project operator to apply the
> > >> case insensitive policy.  With the project pushdown, Drill will pass
> > >> the list of column names to HBase storage plugin, and HBase decides to
> > >> apply it's case sensitivity policy when scan the data.
> > >>
> > >> Adding an option to make case sensitive storage plugin honor case
> > >> insensitive policy seems to be a good idea. The question is whether
> > >> the underneath storage (like HBase) will support such mode.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Mon, Mar 14, 2016 at 2:09 PM, Zelaine Fong <zfong@maprtech.com>
> > wrote:
> > >> > Abhishek,
> > >> >
> > >> > I guess you're arguing that Drill's current behavior of honoring the
> > case
> > >> > sensitive nature of the underlying data source (in this case, HBase
> > and
> > >> > MapR-DB) will be confusing for Drill users who are accustomed to
> > Drill's
> > >> > case insensitive behavior.
> > >> >
> > >> > I can see arguments both ways.
> > >> >
> > >> > But the part I think is confusing is that the behavior differs
> > depending
> > >> on
> > >> > whether or not projections and filters are pushed down to the data
> > >> source.
> > >> > If the push down is done, then the behavior is case sensitive
> > >> > (corresponding to the data source).  But if pushdown doesn't happen,
> > then
> > >> > the behavior is case insensitive.  That difference seems
> inconsistent
> > and
> > >> > undesirable -- unless you argue that there are instances where you
> > would
> > >> > want one behavior vs the other.  But it seems like that should be
> > >> > orthogonal and separate from whether pushdowns are applied.
> > >> >
> > >> > -- Zelaine
> > >> >
> > >> > On Mon, Mar 14, 2016 at 1:40 AM, Abhishek Girish <agirish@mapr.com>
> > >> wrote:
> > >> >
> > >> >> Hello all,
> > >> >>
> > >> >> As I understand, Drill by design is case-insensitive, w.r.t column
> > names
> > >> >> within a table or file [1]. While this provides great flexibility
> and
> > >> works
> > >> >> well with many data-sources, there are issues when working with
> > >> >> case-sensitive data-sources such as HBase / MapR-DB.
> > >> >>
> > >> >> Consider the following JSON file:
> > >> >>
> > >> >> {"_id": "ID1",
> > >> >>  *"Name"* : "ABC",
> > >> >>  "Age" : "25",
> > >> >>  "Phone" : null
> > >> >> }
> > >> >> {"_id": "ID2",
> > >> >>  *"name"* : "PQR",
> > >> >>  "Age" : "30",
> > >> >>  "Phone" : "408-123-456"
> > >> >> }
> > >> >> {"_id": "ID3",
> > >> >>  *"NAME"* : "XYZ",
> > >> >>  "Phone" : ""
> > >> >> }
> > >> >>
> > >> >> Note that the case of the name field within the JSON file is of
> > >> mixed-case.
> > >> >>
> > >> >> From Drill, while querying the JSON file directly (or corresponding
> > >> content
> > >> >> in Parquet or Text formats), we get results which we as Drill
users
> > have
> > >> >> come to expect:
> > >> >>
> > >> >> > select NAME from mfs.`/tmp/json/a.json`;
> > >> >> +-------+
> > >> >> | NAME  |
> > >> >> +-------+
> > >> >> | ABC   |
> > >> >> | PQR   |
> > >> >> | XYZ   |
> > >> >> +-------+
> > >> >>
> > >> >>
> > >> >> However, while querying a case-sensitive datasource (*with pushdown
> > >> >> enabled*)
> > >> >> the following results are returned. The case provided in the query
> > text
> > >> is
> > >> >> honored and would determine the results. This could come as a
> *slight
> > >> >> surprise to certain Drill users* exploring/migrating to new
> Databases
> > >> >> (using new Storage / Format plugins within Drill)
> > >> >>
> > >> >> > select *Name* from mfs.`/tmp/json/a`;
> > >> >> +-------+
> > >> >> | Name  |
> > >> >> +-------+
> > >> >> | ABC   |
> > >> >> +-------+
> > >> >>
> > >> >> > select *name* from mfs.`/tmp/json/a`;
> > >> >> +-------+
> > >> >> | name  |
> > >> >> +-------+
> > >> >> | PQR   |
> > >> >> +-------+
> > >> >>
> > >> >> > select *NAME* from mfs.`/tmp/json/a`;
> > >> >> +-------+
> > >> >> | NAME  |
> > >> >> +-------+
> > >> >> | XYZ   |
> > >> >> +-------+
> > >> >>
> > >> >>
> > >> >> > select *nAME* from mfs.`/tmp/json/a`;
> > >> >> +-------+
> > >> >> | nAME  |
> > >> >> +-------+
> > >> >> +-------+
> > >> >> No rows selected
> > >> >>
> > >> >> There is no easy way to get all matching rows (irrespective of
the
> > case
> > >> of
> > >> >> the column name). In the above example, the first row matching
the
> > >> provided
> > >> >> case is returned.
> > >> >>
> > >> >>
> > >> >> > select *Name, name, NAME* from mfs.`/tmp/json/a`;
> > >> >> +-------+--------+--------+
> > >> >> | Name  | name0  | NAME1  |
> > >> >> +-------+--------+--------+
> > >> >> | ABC   | ABC    | ABC    |
> > >> >> +-------+--------+--------+
> > >> >>
> > >> >> > select *NAME, Name, name* from mfs.`/tmp/json/a`;
> > >> >> +-------+--------+--------+
> > >> >> | NAME  | Name0  | name1  |
> > >> >> +-------+--------+--------+
> > >> >> | XYZ   | XYZ    | XYZ    |
> > >> >> +-------+--------+--------+
> > >> >>
> > >> >>
> > >> >> If Pushdown features are disabled, the behavior seen above would
> > indeed
> > >> >> match JSON files. However, this could come at a cost of not fully
> > >> utilizing
> > >> >> the power of the underlying data-source, and could lead to
> > performance
> > >> >> issues.
> > >> >>
> > >> >> *In-consistent Results can happen when:*
> > >> >>
> > >> >> (1) Dataset has mixed-cases for fields. Example seen above. While
> > this
> > >> >> might not be very common, the concerns are still valid*, *since
> > >> substantial
> > >> >> Drill users are exploring Drill for ETL cases where Data is not
> > >> completely
> > >> >> sanitized.
> > >> >>
> > >> >> (2) Data is consistent w.r.t case, but the query text has
> > non-matching
> > >> >> case. While some could term this as user error, it could still
> cause
> > >> issues
> > >> >> when users, applications or the underlying datasources change.
> > >> >>
> > >> >> In both the above cases, Drill would silently perform the query
and
> > >> return
> > >> >> results which could be either *none, partial, complete/correct
or
> > >> entirely*
> > >> >> *wrong*.
> > >> >>
> > >> >> Some specific questions:
> > >> >>
> > >> >> (1) *Supporting Case-In-sensitive Behavior for Case-Sensitive
> > >> Data-sources.
> > >> >> *For users who prefer the flexibility, how can Drill ensure that
> the
> > >> >> underlying data-source can return case-insensitive results.
> > >> >>
> > >> >> (2) *Supporting Case-Sensitive Behavior. *How can Drill OPTIONALLY
> > >> support
> > >> >> case-sensitive behavior for data-sources. Users coming from
> > >> case-sensitive
> > >> >> databases might want results matching the provided case. Example
> > using
> > >> the
> > >> >> above data:
> > >> >>
> > >> >> > select _id, *Name, name, NAME* from mfs.`/tmp/json/a`;
> > >> >> +------+-------+--------+--------+
> > >> >>
> > >> >> | _id  | Name  | name   | NAME   |
> > >> >> +------+-------+--------+--------+
> > >> >> | ID1  | ABC   | null   | null   |
> > >> >> +------+-------+--------+--------+
> > >> >> | ID2  | null  | PQR    | null   |
> > >> >> +------+-------+--------+--------+
> > >> >> | ID3  | null  | null   | XYZ    |
> > >> >> +------+-------+--------+--------+
> > >> >>
> > >> >>
> > >> >> (3) How does Drill currently work with *MongoDB*, which i guess
is
> a
> > >> >> case-sensitive database? Have these issues ever been discussed
> > >> previously?
> > >> >>
> > >> >>
> > >> >> Thanks in advance. I'd appreciate any helpful response.
> > >> >>
> > >> >> Regards,
> > >> >> Abhishek
> > >> >>
> > >> >>
> > >> >> [1]
> > https://drill.apache.org/docs/lexical-structure/#case-sensitivity
> > >> >>
> > >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message