spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reynold Xin" <r...@databricks.com>
Subject Re: FYI: The evolution on `CHAR` type behavior
Date Tue, 17 Mar 2020 00:29:22 GMT
I looked up our usage logs (sorry I can't share this publicly) and trim has at least four orders
of magnitude higher usage than char.

On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun < dongjoon.hyun@gmail.com > wrote:

> 
> Thank you, Stephen and Reynold.
> 
> 
> To Reynold.
> 
> 
> The way I see the following is a little different.
> 
> 
>       > CHAR is an undocumented data type without clearly defined
> semantics.
> 
> Let me describe in Apache Spark User's View point.
> 
> 
> Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at
> Apache Spark 1.x without much documentation. In addition, there still
> exists an effort which is trying to keep it in 3.0.0 age.
> 
>        https:/ / issues. apache. org/ jira/ browse/ SPARK-31088 (
> https://issues.apache.org/jira/browse/SPARK-31088 )
>        Add back HiveContext and createExternalTable
> 
> Historically, we tried to make many SQL-based customer migrate their
> workloads from Apache Hive into Apache Spark through `HiveContext`.
> 
> Although Apache Spark didn't have a good document about the inconsistent
> behavior among its data sources, Apache Hive has been providing its
> documentation and many customers rely the behavior.
> 
>       - https:/ / cwiki. apache. org/ confluence/ display/ Hive/ LanguageManual+Types
> ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types )
> 
> At that time, frequently in on-prem Hadoop clusters by well-known vendors,
> many existing huge tables were created by Apache Hive, not Apache Spark.
> And, Apache Spark is used for boosting SQL performance with its *caching*.
> This was true because Apache Spark was added into the Hadoop-vendor
> products later than Apache Hive.
> 
> 
> Until the turning point at Apache Spark 2.0, we tried to catch up more
> features to be consistent at least with Hive tables in Apache Hive and
> Apache Spark because two SQL engines share the same tables.
> 
> For the following, technically, while Apache Hive doesn't changed its
> existing behavior in this part, Apache Spark evolves inevitably by moving
> away from the original Apache Spark old behaviors one-by-one.
> 
> 
>       >  the value is already fucked up
> 
> 
> The following is the change log.
> 
>       - When we switched the default value of `convertMetastoreParquet`.
> (at Apache Spark 1.2)
>       - When we switched the default value of `convertMetastoreOrc` (at
> Apache Spark 2.4)
>       - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
> `PARQUET` table at Apache Spark 3.0)
> 
> To sum up, this has been a well-known issue in the community and among the
> customers.
> 
> Bests,
> Dongjoon.
> 
> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy < scoy@ infomedia. com. au (
> scoy@infomedia.com.au ) > wrote:
> 
> 
>> Hi there,
>> 
>> 
>> I’m kind of new around here, but I have had experience with all of all the
>> so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL
>> Server as well as Postgresql.
>> 
>> 
>> They all support the notion of “ANSI padding” for CHAR columns - which
>> means that such columns are always space padded, and they default to
>> having this enabled (for ANSI compliance).
>> 
>> 
>> MySQL also supports it, but it defaults to leaving it disabled for
>> historical reasons not unlike what we have here.
>> 
>> 
>> In my opinion we should push toward standards compliance where possible
>> and then document where it cannot work.
>> 
>> 
>> If users don’t like the padding on CHAR columns then they should change to
>> VARCHAR - I believe that was its purpose in the first place, and it does
>> not dictate any sort of “padding".
>> 
>> 
>> I can see why you might “ban” the use of CHAR columns where they cannot be
>> consistently supported, but VARCHAR is a different animal and I would
>> expect it to work consistently everywhere.
>> 
>> 
>> 
>> 
>> Cheers,
>> 
>> 
>> Steve C
>> 
>> 
>>> On 17 Mar 2020, at 10:01 am, Dongjoon Hyun < dongjoon. hyun@ gmail. com (
>>> dongjoon.hyun@gmail.com ) > wrote:
>>> 
>>> Hi, Reynold.
>>> (And +Michael Armbrust)
>>> 
>>> 
>>> If you think so, do you think it's okay that we change the return value
>>> silently? Then, I'm wondering why we reverted `TRIM` functions then?
>>> 
>>> 
>>> > Are we sure "not padding" is "incorrect"?
>>> 
>>> 
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> 
>>> 
>>> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta < gourav. sengupta@ gmail.
>>> com ( gourav.sengupta@gmail.com ) > wrote:
>>> 
>>> 
>>>> Hi,
>>>> 
>>>> 
>>>> 100% agree with Reynold.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Regards,
>>>> Gourav Sengupta
>>>> 
>>>> 
>>>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin < rxin@ databricks. com (
>>>> rxin@databricks.com ) > wrote:
>>>> 
>>>> 
>>>>> 
>>>>> Are we sure "not padding" is "incorrect"?
>>>>> 
>>>>> 
>>>>> 
>>>>> I don't know whether ANSI SQL actually requires padding, but plenty of
>>>>> databases don't actually pad.
>>>>> 
>>>>> 
>>>>> 
>>>>> https:/ / docs. snowflake. net/ manuals/ sql-reference/ data-types-text.
html
>>>>> (
>>>>> https://aus01.safelinks.protection.outlook.com/?url=https:%2F%2Fdocs.snowflake.net%2Fmanuals%2Fsql-reference%2Fdata-types-text.html%23:~:text%3DCHAR%2520%252C%2520CHARACTER%2C(1)%2520is%2520the%2520default.%26text%3DSnowflake%2520currently%2520deviates%2520from%2520common%2Cspace-padded%2520at%2520the%2520end.&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=BvnZTTPTZBAi8oGWIvJk2fC%2FYSgdvq%2BAxtOj0nVzufk%3D&reserved=0
>>>>> ) : "Snowflake currently deviates from common CHAR semantics in that
>>>>> strings shorter than the maximum length are not space-padded at the end."
>>>>> 
>>>>> 
>>>>> 
>>>>> MySQL: https:/ / stackoverflow. com/ questions/ 53528645/ why-char-dont-have-padding-in-mysql
>>>>> (
>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53528645%2Fwhy-char-dont-have-padding-in-mysql&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=3OGLht%2Fa28GcKhAGwJPXIR%2BMODiIwXGVuNuResZqwXM%3D&reserved=0
>>>>> )
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun < dongjoon. hyun@ gmail.
com
>>>>> ( dongjoon.hyun@gmail.com ) > wrote:
>>>>> 
>>>>>> Hi, Reynold.
>>>>>> 
>>>>>> 
>>>>>> Please see the following for the context.
>>>>>> 
>>>>>> 
>>>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-31136 (
>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-31136&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=pWQ9QhfVY4Uzyc8oIJ1QONQ0zOBAQ2DGSemyBj%2BvFeM%3D&reserved=0
>>>>>> )
>>>>>> "Revert SPARK-30098 Use default datasource as provider for CREATE
TABLE
>>>>>> syntax"
>>>>>> 
>>>>>> 
>>>>>> I raised the above issue according to the new rubric, and the banning
was
>>>>>> the proposed alternative to reduce the potential issue.
>>>>>> 
>>>>>> 
>>>>>> Please give us your opinion since it's still PR.
>>>>>> 
>>>>>> 
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>> 
>>>>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin < rxin@ databricks.
com (
>>>>>> rxin@databricks.com ) > wrote:
>>>>>> 
>>>>>> 
>>>>>>> I don’t understand this change. Wouldn’t this “ban” confuse
the hell out
>>>>>>> of both new and old users?
>>>>>>> 
>>>>>>> 
>>>>>>> For old users, their old code that was working for char(3) would
now stop
>>>>>>> working. 
>>>>>>> 
>>>>>>> 
>>>>>>> For new users, depending on whether the underlying metastore
char(3) is
>>>>>>> either supported but different from ansi Sql (which is not that
big of a
>>>>>>> deal if we explain it) or not supported. 
>>>>>>> 
>>>>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun < dongjoon.
hyun@ gmail. com
>>>>>>> ( dongjoon.hyun@gmail.com ) > wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> Hi, All.
>>>>>>>> 
>>>>>>>> Apache Spark has been suffered from a known consistency issue
on `CHAR`
>>>>>>>> type behavior among its usages and configurations. However,
the evolution
>>>>>>>> direction has been gradually moving forward to be consistent
inside Apache
>>>>>>>> Spark because we don't have `CHAR` offically. The following
is the
>>>>>>>> summary.
>>>>>>>> 
>>>>>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different
result.
>>>>>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides
a fallback to
>>>>>>>> Hive behavior.)
>>>>>>>> 
>>>>>>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>>>>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS
ORC;
>>>>>>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS
PARQUET;
>>>>>>>> 
>>>>>>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>>>>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>>>>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>>>>>>> 
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>     a   3
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>     a   3
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>     a 2
>>>>>>>> 
>>>>>>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>>>>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback
to Hive
>>>>>>>> behavior.)
>>>>>>>> 
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>     a   3
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>     a 2
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>     a 2
>>>>>>>> 
>>>>>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS`
clause) became
>>>>>>>> consistent.
>>>>>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true`
provides a
>>>>>>>> fallback to Hive behavior.)
>>>>>>>> 
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>     a 2
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>     a 2
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>     a 2
>>>>>>>> 
>>>>>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR`
type in the
>>>>>>>> following syntax to be safe.
>>>>>>>> 
>>>>>>>>     CREATE TABLE t(a CHAR(3));
>>>>>>>>    https:/ / github. com/ apache/ spark/ pull/ 27902 (
>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F27902&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=lhwUP5TcTtaO%2BLUTmx%2BPTjT0ASXPrQ7oKLL0N6EG0Ug%3D&reserved=0
>>>>>>>> )
>>>>>>>> 
>>>>>>>> This email is sent out to inform you based on the new policy
we voted.
>>>>>>>> The recommendation is always using Apache Spark's native
type `String`.
>>>>>>>> 
>>>>>>>> Bests,
>>>>>>>> Dongjoon.
>>>>>>>> 
>>>>>>>> References:
>>>>>>>> 1. "CHAR implementation?", 2017/09/15
>>>>>>>>      https:/ / lists. apache. org/ thread. html/ 96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.
>>>>>>>> spark. apache. org%3E (
>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=6hkno6zKTkcIrO%2FJo4hTYihsYvNynMuWcxhzL0fZR68%3D&reserved=0
>>>>>>>> )
>>>>>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for
CREATE TABLE
>>>>>>>> syntax", 2019/12/06
>>>>>>>>    https:/ / lists. apache. org/ thread. html/ 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>>>>>>>> spark. apache. org%3E (
>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=QJnEU3mvUJff53Gw8F%2FAbxzd%2F8ZA1hhuoQwicX4ZXyI%3D&reserved=0
>>>>>>>> )
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> This email contains confidential information of and is the copyright of
>> Infomedia. It must not be forwarded, amended or disclosed without consent
>> of the sender. If you received this message by mistake, please advise the
>> sender and delete all copies. Security of transmission on the internet
>> cannot be guaranteed, could be infected, intercepted, or corrupted and you
>> should ensure you have suitable antivirus protection in place. By sending
>> us your or any third party personal details, you consent to (or confirm
>> you have obtained consent from such third parties) to Infomedia’s privacy
>> policy. http:/ / www. infomedia. com. au/ privacy-policy/ (
>> http://www.infomedia.com.au/privacy-policy/ )
>> 
> 
>
Mime
View raw message