New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31147][SQL] Forbid CHAR type in non-Hive-Serde tables #27902
Conversation
Test build #119751 has finished for PR 27902 at commit
|
Test build #119756 has finished for PR 27902 at commit
|
|
||
def failCharType(dt: DataType): Unit = { | ||
if (HiveStringType.containsCharType(dt)) { | ||
throw new AnalysisException("Cannot use CHAR/VARCHAR type in non-Hive tables.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we have a directional warning like Use STRING type instead of CHAR/VARCHAR type in non-Hive tables.
?
I support this approach in Apache Spark 3.0.0. |
BTW, @cloud-fan . I updated the PR description to give both examples of 2.4.5 and 3.0.0-preview2 to be fair. |
docs/sql-migration-guide.md
Outdated
- You need to migrate your custom SerDes to Hive 2.3 or build your own Spark with `hive-1.2` profile. See HIVE-15167 for more details. | ||
|
||
- The decimal string representation can be different between Hive 1.2 and Hive 2.3 when using `TRANSFORM` operator in SQL for script transformation, which depends on hive's behavior. In Hive 1.2, the string representation omits trailing zeroes. But in Hive 2.3, it is always padded to 18 digits with trailing zeroes if necessary. | ||
|
||
- Since Spark 3.0, columns of CHAR/VARCHAR type are not allowed in non-Hive tables, and CREATE/ALTER TABLE commands will fail if CHAR/VARCHAR type is detected. In Spark version 2.4 and earlier, CHAR/VARCHAR type are treated as STRING type and the length parameter is simply ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, VARCHAR
is a little different and have more official documents. Could you check them together?
$ git grep 'CHAR'
sql-data-sources-jdbc.md: The database column data types to use instead of the defaults, when creating the table. Data type information should be specified in the same format as CREATE TABLE columns syntax (e.g: <code>"name CHAR(64), comments VARCHAR(1024)")</code>. The specified types should be valid spark sql data types. This option applies only to writing.
sql-ref-syntax-aux-describe-table.md: state VARCHAR(20),
sql-ref-syntax-aux-show-columns.md: name VARCHAR(100),
sql-ref-syntax-aux-show-tblproperties.md:CREATE TABLE customer(cust_code INT, name VARCHAR(100), cust_addr STRING)
sql-ref-syntax-dml-insert-into.md: CREATE TABLE students (name VARCHAR(64), address VARCHAR(64), student_id INT)
sql-ref-syntax-dml-load.md: CREATE TABLE test_load (name VARCHAR(64), address VARCHAR(64), student_id INT);
Test build #119762 has finished for PR 27902 at commit
|
Hi, @rxin . What is your current opinion on this? |
Test build #119940 has finished for PR 27902 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala
Show resolved
Hide resolved
Test build #120098 has finished for PR 27902 at commit
|
Test build #120096 has finished for PR 27902 at commit
|
This PR still bans
|
Test build #120183 has finished for PR 27902 at commit
|
retest this please |
Test build #120195 has finished for PR 27902 at commit
|
Retest this please. |
Test build #120217 has finished for PR 27902 at commit
|
retest this please |
Test build #120241 has finished for PR 27902 at commit
|
retest this please |
Test build #120263 has finished for PR 27902 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't allow char, can we remove HiveStringType.replaceCharType
?
sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala
Show resolved
Hide resolved
Test build #120328 has finished for PR 27902 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @cloud-fan and @viirya .
Merged to master.
Hi, @cloud-fan . There was a conflict on branch-3.0. Could you make a backporting PR, please? |
Spark introduced CHAR type for hive compatibility but it only works for hive tables. CHAR type is never documented and is treated as STRING type for non-Hive tables. However, this leads to confusing behaviors **Apache Spark 3.0.0-preview2** ``` spark-sql> CREATE TABLE t(a CHAR(3)); spark-sql> INSERT INTO TABLE t SELECT 'a '; spark-sql> SELECT a, length(a) FROM t; a 2 ``` **Apache Spark 2.4.5** ``` spark-sql> CREATE TABLE t(a CHAR(3)); spark-sql> INSERT INTO TABLE t SELECT 'a '; spark-sql> SELECT a, length(a) FROM t; a 3 ``` According to the SQL standard, `CHAR(3)` should guarantee all the values are of length 3. Since `CHAR(3)` is treated as STRING so Spark doesn't guarantee it. This PR forbids CHAR type in non-Hive tables as it's not supported correctly. avoid confusing/wrong behavior yes, now users can't create/alter non-Hive tables with CHAR type. new tests Closes #27902 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
the conflict is caused by that 3.0 doesn't have REPLACE COLUMN. It's easy to fix, I've backported to 3.0 with local test passing |
Thank you, @cloud-fan ! |
### What changes were proposed in this pull request? Spark introduced CHAR type for hive compatibility but it only works for hive tables. CHAR type is never documented and is treated as STRING type for non-Hive tables. However, this leads to confusing behaviors **Apache Spark 3.0.0-preview2** ``` spark-sql> CREATE TABLE t(a CHAR(3)); spark-sql> INSERT INTO TABLE t SELECT 'a '; spark-sql> SELECT a, length(a) FROM t; a 2 ``` **Apache Spark 2.4.5** ``` spark-sql> CREATE TABLE t(a CHAR(3)); spark-sql> INSERT INTO TABLE t SELECT 'a '; spark-sql> SELECT a, length(a) FROM t; a 3 ``` According to the SQL standard, `CHAR(3)` should guarantee all the values are of length 3. Since `CHAR(3)` is treated as STRING so Spark doesn't guarantee it. This PR forbids CHAR type in non-Hive tables as it's not supported correctly. ### Why are the changes needed? avoid confusing/wrong behavior ### Does this PR introduce any user-facing change? yes, now users can't create/alter non-Hive tables with CHAR type. ### How was this patch tested? new tests Closes apache#27902 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
Spark introduced CHAR type for hive compatibility but it only works for hive tables. CHAR type is never documented and is treated as STRING type for non-Hive tables.
However, this leads to confusing behaviors
Apache Spark 3.0.0-preview2
Apache Spark 2.4.5
According to the SQL standard,
CHAR(3)
should guarantee all the values are of length 3. SinceCHAR(3)
is treated as STRING so Spark doesn't guarantee it.This PR forbids CHAR type in non-Hive tables as it's not supported correctly.
Why are the changes needed?
avoid confusing/wrong behavior
Does this PR introduce any user-facing change?
yes, now users can't create/alter non-Hive tables with CHAR type.
How was this patch tested?
new tests