Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31147][SQL] Forbid CHAR type in non-Hive-Serde tables #27902

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Mar 13, 2020

What changes were proposed in this pull request?

Spark introduced CHAR type for hive compatibility but it only works for hive tables. CHAR type is never documented and is treated as STRING type for non-Hive tables.

However, this leads to confusing behaviors

Apache Spark 3.0.0-preview2

spark-sql> CREATE TABLE t(a CHAR(3));

spark-sql> INSERT INTO TABLE t SELECT 'a ';

spark-sql> SELECT a, length(a) FROM t;
a 	2

Apache Spark 2.4.5

spark-sql> CREATE TABLE t(a CHAR(3));

spark-sql> INSERT INTO TABLE t SELECT 'a ';

spark-sql> SELECT a, length(a) FROM t;
a  	3

According to the SQL standard, CHAR(3) should guarantee all the values are of length 3. Since CHAR(3) is treated as STRING so Spark doesn't guarantee it.

This PR forbids CHAR type in non-Hive tables as it's not supported correctly.

Why are the changes needed?

avoid confusing/wrong behavior

Does this PR introduce any user-facing change?

yes, now users can't create/alter non-Hive tables with CHAR type.

How was this patch tested?

new tests

@cloud-fan
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented Mar 13, 2020

Test build #119751 has finished for PR 27902 at commit cdbfea3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 13, 2020

Test build #119756 has finished for PR 27902 at commit d6d57a5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


def failCharType(dt: DataType): Unit = {
if (HiveStringType.containsCharType(dt)) {
throw new AnalysisException("Cannot use CHAR/VARCHAR type in non-Hive tables.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we have a directional warning like Use STRING type instead of CHAR/VARCHAR type in non-Hive tables.?

@dongjoon-hyun
Copy link
Member

I support this approach in Apache Spark 3.0.0.
cc @marmbrus and @gatorsmile since this causes many migration failures.
cc @rxin since he is a release manager for 3.0.0.

@dongjoon-hyun
Copy link
Member

BTW, @cloud-fan . I updated the PR description to give both examples of 2.4.5 and 3.0.0-preview2 to be fair.

- You need to migrate your custom SerDes to Hive 2.3 or build your own Spark with `hive-1.2` profile. See HIVE-15167 for more details.

- The decimal string representation can be different between Hive 1.2 and Hive 2.3 when using `TRANSFORM` operator in SQL for script transformation, which depends on hive's behavior. In Hive 1.2, the string representation omits trailing zeroes. But in Hive 2.3, it is always padded to 18 digits with trailing zeroes if necessary.

- Since Spark 3.0, columns of CHAR/VARCHAR type are not allowed in non-Hive tables, and CREATE/ALTER TABLE commands will fail if CHAR/VARCHAR type is detected. In Spark version 2.4 and earlier, CHAR/VARCHAR type are treated as STRING type and the length parameter is simply ignored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, VARCHAR is a little different and have more official documents. Could you check them together?

$ git grep 'CHAR'
sql-data-sources-jdbc.md:     The database column data types to use instead of the defaults, when creating the table. Data type information should be specified in the same format as CREATE TABLE columns syntax (e.g: <code>"name CHAR(64), comments VARCHAR(1024)")</code>. The specified types should be valid spark sql data types. This option applies only to writing.
sql-ref-syntax-aux-describe-table.md:    state VARCHAR(20),
sql-ref-syntax-aux-show-columns.md:  name VARCHAR(100),
sql-ref-syntax-aux-show-tblproperties.md:CREATE TABLE customer(cust_code INT, name VARCHAR(100), cust_addr STRING)
sql-ref-syntax-dml-insert-into.md: CREATE TABLE students (name VARCHAR(64), address VARCHAR(64), student_id INT)
sql-ref-syntax-dml-load.md: CREATE TABLE test_load (name VARCHAR(64), address VARCHAR(64), student_id INT);

@SparkQA
Copy link

SparkQA commented Mar 13, 2020

Test build #119762 has finished for PR 27902 at commit f2b5825.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Hi, @rxin . What is your current opinion on this?

@SparkQA
Copy link

SparkQA commented Mar 17, 2020

Test build #119940 has finished for PR 27902 at commit 478d164.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan changed the title [SPARK-31147][SQL] forbid CHAR/VARCHAR type in non-Hive tables [SPARK-31147][SQL] forbid CHAR type in non-Hive-Serde tables Mar 20, 2020
@SparkQA
Copy link

SparkQA commented Mar 20, 2020

Test build #120098 has finished for PR 27902 at commit 32b5023.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 20, 2020

Test build #120096 has finished for PR 27902 at commit 33efaac.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

This PR still bans VARCHAR. The failure comes from the following.

  test("self-join on a partitioned table should not trigger DPP") {
    withSQLConf(SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "true",
      SQLConf.DYNAMIC_PARTITION_PRUNING_REUSE_BROADCAST_ONLY.key -> "false",
      SQLConf.EXCHANGE_REUSE_ENABLED.key -> "false") {
      withTable("fact") {
        sql(
          s"""
             |CREATE TABLE fact (
             |  col1 varchar(14), col2 bigint, col3 bigint, col4 decimal(18,8), partCol1 varchar(1)
             |) USING $tableFormat PARTITIONED BY (partCol1)
        """.stripMargin)

@SparkQA
Copy link

SparkQA commented Mar 23, 2020

Test build #120183 has finished for PR 27902 at commit ba28637.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Mar 23, 2020

Test build #120195 has finished for PR 27902 at commit ba28637.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Mar 23, 2020

Test build #120217 has finished for PR 27902 at commit ba28637.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Mar 24, 2020

Test build #120241 has finished for PR 27902 at commit ba28637.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Mar 24, 2020

Test build #120263 has finished for PR 27902 at commit ba28637.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon changed the title [SPARK-31147][SQL] forbid CHAR type in non-Hive-Serde tables [SPARK-31147][SQL] Forbid CHAR type in non-Hive-Serde tables Mar 25, 2020
Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't allow char, can we remove HiveStringType.replaceCharType?

docs/sql-migration-guide.md Outdated Show resolved Hide resolved
@SparkQA
Copy link

SparkQA commented Mar 25, 2020

Test build #120328 has finished for PR 27902 at commit 98095bb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @cloud-fan and @viirya .
Merged to master.

@dongjoon-hyun
Copy link
Member

Hi, @cloud-fan . There was a conflict on branch-3.0. Could you make a backporting PR, please?

cloud-fan added a commit that referenced this pull request Mar 25, 2020
Spark introduced CHAR type for hive compatibility but it only works for hive tables. CHAR type is never documented and is treated as STRING type for non-Hive tables.

However, this leads to confusing behaviors

**Apache Spark 3.0.0-preview2**
```
spark-sql> CREATE TABLE t(a CHAR(3));

spark-sql> INSERT INTO TABLE t SELECT 'a ';

spark-sql> SELECT a, length(a) FROM t;
a 	2
```

**Apache Spark 2.4.5**
```
spark-sql> CREATE TABLE t(a CHAR(3));

spark-sql> INSERT INTO TABLE t SELECT 'a ';

spark-sql> SELECT a, length(a) FROM t;
a  	3
```

According to the SQL standard, `CHAR(3)` should guarantee all the values are of length 3. Since `CHAR(3)` is treated as STRING so Spark doesn't guarantee it.

This PR forbids CHAR type in non-Hive tables as it's not supported correctly.

avoid confusing/wrong behavior

yes, now users can't create/alter non-Hive tables with CHAR type.

new tests

Closes #27902 from cloud-fan/char.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@cloud-fan
Copy link
Contributor Author

the conflict is caused by that 3.0 doesn't have REPLACE COLUMN. It's easy to fix, I've backported to 3.0 with local test passing

@dongjoon-hyun
Copy link
Member

Thank you, @cloud-fan !

sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?

Spark introduced CHAR type for hive compatibility but it only works for hive tables. CHAR type is never documented and is treated as STRING type for non-Hive tables.

However, this leads to confusing behaviors

**Apache Spark 3.0.0-preview2**
```
spark-sql> CREATE TABLE t(a CHAR(3));

spark-sql> INSERT INTO TABLE t SELECT 'a ';

spark-sql> SELECT a, length(a) FROM t;
a 	2
```

**Apache Spark 2.4.5**
```
spark-sql> CREATE TABLE t(a CHAR(3));

spark-sql> INSERT INTO TABLE t SELECT 'a ';

spark-sql> SELECT a, length(a) FROM t;
a  	3
```

According to the SQL standard, `CHAR(3)` should guarantee all the values are of length 3. Since `CHAR(3)` is treated as STRING so Spark doesn't guarantee it.

This PR forbids CHAR type in non-Hive tables as it's not supported correctly.

### Why are the changes needed?

avoid confusing/wrong behavior

### Does this PR introduce any user-facing change?

yes, now users can't create/alter non-Hive tables with CHAR type.

### How was this patch tested?

new tests

Closes apache#27902 from cloud-fan/char.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants