[SPARK-31147][SQL] Forbid CHAR type in non-Hive-Serde tables #27902

cloud-fan · 2020-03-13T09:17:30Z

What changes were proposed in this pull request?

Spark introduced CHAR type for hive compatibility but it only works for hive tables. CHAR type is never documented and is treated as STRING type for non-Hive tables.

However, this leads to confusing behaviors

Apache Spark 3.0.0-preview2

spark-sql> CREATE TABLE t(a CHAR(3));

spark-sql> INSERT INTO TABLE t SELECT 'a ';

spark-sql> SELECT a, length(a) FROM t;
a 	2

Apache Spark 2.4.5

spark-sql> CREATE TABLE t(a CHAR(3));

spark-sql> INSERT INTO TABLE t SELECT 'a ';

spark-sql> SELECT a, length(a) FROM t;
a  	3

According to the SQL standard, CHAR(3) should guarantee all the values are of length 3. Since CHAR(3) is treated as STRING so Spark doesn't guarantee it.

This PR forbids CHAR type in non-Hive tables as it's not supported correctly.

Why are the changes needed?

avoid confusing/wrong behavior

Does this PR introduce any user-facing change?

yes, now users can't create/alter non-Hive tables with CHAR type.

How was this patch tested?

new tests

cloud-fan · 2020-03-13T09:17:57Z

cc @dongjoon-hyun @HeartSaVioR @HyukjinKwon

SparkQA · 2020-03-13T10:14:11Z

Test build #119751 has finished for PR 27902 at commit cdbfea3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-13T13:22:56Z

Test build #119756 has finished for PR 27902 at commit d6d57a5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-03-13T16:09:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala

+
+  def failCharType(dt: DataType): Unit = {
+    if (HiveStringType.containsCharType(dt)) {
+      throw new AnalysisException("Cannot use CHAR/VARCHAR type in non-Hive tables.")


Shall we have a directional warning like Use STRING type instead of CHAR/VARCHAR type in non-Hive tables.?

dongjoon-hyun · 2020-03-13T16:12:52Z

I support this approach in Apache Spark 3.0.0.
cc @marmbrus and @gatorsmile since this causes many migration failures.
cc @rxin since he is a release manager for 3.0.0.

dongjoon-hyun · 2020-03-13T16:15:26Z

BTW, @cloud-fan . I updated the PR description to give both examples of 2.4.5 and 3.0.0-preview2 to be fair.

dongjoon-hyun · 2020-03-13T16:19:04Z

docs/sql-migration-guide.md

    - You need to migrate your custom SerDes to Hive 2.3 or build your own Spark with `hive-1.2` profile. See HIVE-15167 for more details.

    - The decimal string representation can be different between Hive 1.2 and Hive 2.3 when using `TRANSFORM` operator in SQL for script transformation, which depends on hive's behavior. In Hive 1.2, the string representation omits trailing zeroes. But in Hive 2.3, it is always padded to 18 digits with trailing zeroes if necessary.

+  - Since Spark 3.0, columns of CHAR/VARCHAR type are not allowed in non-Hive tables, and CREATE/ALTER TABLE commands will fail if CHAR/VARCHAR type is detected. In Spark version 2.4 and earlier, CHAR/VARCHAR type are treated as STRING type and the length parameter is simply ignored.


BTW, VARCHAR is a little different and have more official documents. Could you check them together?

$ git grep 'CHAR' sql-data-sources-jdbc.md: The database column data types to use instead of the defaults, when creating the table. Data type information should be specified in the same format as CREATE TABLE columns syntax (e.g: <code>"name CHAR(64), comments VARCHAR(1024)")</code>. The specified types should be valid spark sql data types. This option applies only to writing. sql-ref-syntax-aux-describe-table.md: state VARCHAR(20), sql-ref-syntax-aux-show-columns.md: name VARCHAR(100), sql-ref-syntax-aux-show-tblproperties.md:CREATE TABLE customer(cust_code INT, name VARCHAR(100), cust_addr STRING) sql-ref-syntax-dml-insert-into.md: CREATE TABLE students (name VARCHAR(64), address VARCHAR(64), student_id INT) sql-ref-syntax-dml-load.md: CREATE TABLE test_load (name VARCHAR(64), address VARCHAR(64), student_id INT);

SparkQA · 2020-03-13T16:30:40Z

Test build #119762 has finished for PR 27902 at commit f2b5825.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-03-17T19:11:22Z

Hi, @rxin . What is your current opinion on this?

SparkQA · 2020-03-17T21:19:03Z

Test build #119940 has finished for PR 27902 at commit 478d164.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala

SparkQA · 2020-03-20T14:04:06Z

Test build #120098 has finished for PR 27902 at commit 32b5023.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-20T14:39:45Z

Test build #120096 has finished for PR 27902 at commit 33efaac.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-03-20T17:42:24Z

This PR still bans VARCHAR. The failure comes from the following.

  test("self-join on a partitioned table should not trigger DPP") {
    withSQLConf(SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "true",
      SQLConf.DYNAMIC_PARTITION_PRUNING_REUSE_BROADCAST_ONLY.key -> "false",
      SQLConf.EXCHANGE_REUSE_ENABLED.key -> "false") {
      withTable("fact") {
        sql(
          s"""
             |CREATE TABLE fact (
             |  col1 varchar(14), col2 bigint, col3 bigint, col4 decimal(18,8), partCol1 varchar(1)
             |) USING $tableFormat PARTITIONED BY (partCol1)
        """.stripMargin)

SparkQA · 2020-03-23T11:02:35Z

Test build #120183 has finished for PR 27902 at commit ba28637.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-03-23T11:07:42Z

retest this please

SparkQA · 2020-03-23T16:33:00Z

Test build #120195 has finished for PR 27902 at commit ba28637.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-03-23T18:00:52Z

Retest this please.

SparkQA · 2020-03-23T22:22:03Z

Test build #120217 has finished for PR 27902 at commit ba28637.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-03-24T06:11:39Z

retest this please

SparkQA · 2020-03-24T07:05:01Z

Test build #120241 has finished for PR 27902 at commit ba28637.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-03-24T09:38:20Z

retest this please

SparkQA · 2020-03-24T15:59:11Z

Test build #120263 has finished for PR 27902 at commit ba28637.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya

If we don't allow char, can we remove HiveStringType.replaceCharType?

docs/sql-migration-guide.md

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala

SparkQA · 2020-03-25T14:59:26Z

Test build #120328 has finished for PR 27902 at commit 98095bb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you, @cloud-fan and @viirya .
Merged to master.

dongjoon-hyun · 2020-03-25T16:27:17Z

Hi, @cloud-fan . There was a conflict on branch-3.0. Could you make a backporting PR, please?

Spark introduced CHAR type for hive compatibility but it only works for hive tables. CHAR type is never documented and is treated as STRING type for non-Hive tables. However, this leads to confusing behaviors **Apache Spark 3.0.0-preview2** ``` spark-sql> CREATE TABLE t(a CHAR(3)); spark-sql> INSERT INTO TABLE t SELECT 'a '; spark-sql> SELECT a, length(a) FROM t; a 2 ``` **Apache Spark 2.4.5** ``` spark-sql> CREATE TABLE t(a CHAR(3)); spark-sql> INSERT INTO TABLE t SELECT 'a '; spark-sql> SELECT a, length(a) FROM t; a 3 ``` According to the SQL standard, `CHAR(3)` should guarantee all the values are of length 3. Since `CHAR(3)` is treated as STRING so Spark doesn't guarantee it. This PR forbids CHAR type in non-Hive tables as it's not supported correctly. avoid confusing/wrong behavior yes, now users can't create/alter non-Hive tables with CHAR type. new tests Closes #27902 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

cloud-fan · 2020-03-25T18:02:37Z

the conflict is caused by that 3.0 doesn't have REPLACE COLUMN. It's easy to fix, I've backported to 3.0 with local test passing

dongjoon-hyun · 2020-03-25T23:28:16Z

Thank you, @cloud-fan !

### What changes were proposed in this pull request? Spark introduced CHAR type for hive compatibility but it only works for hive tables. CHAR type is never documented and is treated as STRING type for non-Hive tables. However, this leads to confusing behaviors **Apache Spark 3.0.0-preview2** ``` spark-sql> CREATE TABLE t(a CHAR(3)); spark-sql> INSERT INTO TABLE t SELECT 'a '; spark-sql> SELECT a, length(a) FROM t; a 2 ``` **Apache Spark 2.4.5** ``` spark-sql> CREATE TABLE t(a CHAR(3)); spark-sql> INSERT INTO TABLE t SELECT 'a '; spark-sql> SELECT a, length(a) FROM t; a 3 ``` According to the SQL standard, `CHAR(3)` should guarantee all the values are of length 3. Since `CHAR(3)` is treated as STRING so Spark doesn't guarantee it. This PR forbids CHAR type in non-Hive tables as it's not supported correctly. ### Why are the changes needed? avoid confusing/wrong behavior ### Does this PR introduce any user-facing change? yes, now users can't create/alter non-Hive tables with CHAR type. ### How was this patch tested? new tests Closes apache#27902 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

cloud-fan force-pushed the char branch from d6d57a5 to f2b5825 Compare March 13, 2020 13:49

dongjoon-hyun reviewed Mar 13, 2020

View reviewed changes

dongjoon-hyun added the SQL label Mar 13, 2020

dongjoon-hyun reviewed Mar 13, 2020

View reviewed changes

cloud-fan force-pushed the char branch from f2b5825 to 478d164 Compare March 17, 2020 16:42

dongjoon-hyun reviewed Mar 20, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala Show resolved Hide resolved

cloud-fan changed the title ~~[SPARK-31147][SQL] forbid CHAR/VARCHAR type in non-Hive tables~~ [SPARK-31147][SQL] forbid CHAR type in non-Hive-Serde tables Mar 20, 2020

cloud-fan force-pushed the char branch from 33efaac to 32b5023 Compare March 20, 2020 12:58

forbid CHAR type in non-Hive-Serde tables

ba28637

cloud-fan force-pushed the char branch from 32b5023 to ba28637 Compare March 23, 2020 07:43

HyukjinKwon changed the title ~~[SPARK-31147][SQL] forbid CHAR type in non-Hive-Serde tables~~ [SPARK-31147][SQL] Forbid CHAR type in non-Hive-Serde tables Mar 25, 2020

viirya reviewed Mar 25, 2020

View reviewed changes

docs/sql-migration-guide.md Outdated Show resolved Hide resolved

viirya reviewed Mar 25, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala Show resolved Hide resolved

typo

98095bb

dongjoon-hyun approved these changes Mar 25, 2020

View reviewed changes

dongjoon-hyun closed this in 4f274a4 Mar 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-31147][SQL] Forbid CHAR type in non-Hive-Serde tables #27902

[SPARK-31147][SQL] Forbid CHAR type in non-Hive-Serde tables #27902

cloud-fan commented Mar 13, 2020 •

edited

cloud-fan commented Mar 13, 2020

SparkQA commented Mar 13, 2020

SparkQA commented Mar 13, 2020

dongjoon-hyun Mar 13, 2020

dongjoon-hyun commented Mar 13, 2020

dongjoon-hyun commented Mar 13, 2020

dongjoon-hyun Mar 13, 2020

SparkQA commented Mar 13, 2020

dongjoon-hyun commented Mar 17, 2020

SparkQA commented Mar 17, 2020

SparkQA commented Mar 20, 2020

SparkQA commented Mar 20, 2020

dongjoon-hyun commented Mar 20, 2020

SparkQA commented Mar 23, 2020

cloud-fan commented Mar 23, 2020

SparkQA commented Mar 23, 2020

dongjoon-hyun commented Mar 23, 2020

SparkQA commented Mar 23, 2020

cloud-fan commented Mar 24, 2020

SparkQA commented Mar 24, 2020

cloud-fan commented Mar 24, 2020

SparkQA commented Mar 24, 2020

viirya left a comment

SparkQA commented Mar 25, 2020

dongjoon-hyun left a comment •

edited

dongjoon-hyun commented Mar 25, 2020

cloud-fan commented Mar 25, 2020

dongjoon-hyun commented Mar 25, 2020

[SPARK-31147][SQL] Forbid CHAR type in non-Hive-Serde tables #27902

[SPARK-31147][SQL] Forbid CHAR type in non-Hive-Serde tables #27902

Conversation

cloud-fan commented Mar 13, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cloud-fan commented Mar 13, 2020

SparkQA commented Mar 13, 2020

SparkQA commented Mar 13, 2020

dongjoon-hyun Mar 13, 2020

Choose a reason for hiding this comment

dongjoon-hyun commented Mar 13, 2020

dongjoon-hyun commented Mar 13, 2020

dongjoon-hyun Mar 13, 2020

Choose a reason for hiding this comment

SparkQA commented Mar 13, 2020

dongjoon-hyun commented Mar 17, 2020

SparkQA commented Mar 17, 2020

SparkQA commented Mar 20, 2020

SparkQA commented Mar 20, 2020

dongjoon-hyun commented Mar 20, 2020

SparkQA commented Mar 23, 2020

cloud-fan commented Mar 23, 2020

SparkQA commented Mar 23, 2020

dongjoon-hyun commented Mar 23, 2020

SparkQA commented Mar 23, 2020

cloud-fan commented Mar 24, 2020

SparkQA commented Mar 24, 2020

cloud-fan commented Mar 24, 2020

SparkQA commented Mar 24, 2020

viirya left a comment

Choose a reason for hiding this comment

SparkQA commented Mar 25, 2020

dongjoon-hyun left a comment • edited

Choose a reason for hiding this comment

dongjoon-hyun commented Mar 25, 2020

cloud-fan commented Mar 25, 2020

dongjoon-hyun commented Mar 25, 2020

cloud-fan commented Mar 13, 2020 •

edited

dongjoon-hyun left a comment •

edited