Skip to content

Commit 5adbc03

Browse files
committed
[SPARK-48816][SQL] Shorthand for interval converters in UnivocityParser
### What changes were proposed in this pull request? Directly call `IntervalUtils.castStringToDTInterval/castStringToYMInterval` instead of creating Cast expressions to evaluate. - Benchmarks indicated a 10% time-saving. - Bad record recording might not work if the cast handles the exceptions early ### Why are the changes needed? - pref improvement - Bugfix for bad record recording ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passing existing tests and benchmark tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #47227 from yaooqinn/SPARK-48816. Authored-by: Kent Yao <[email protected]> Signed-off-by: Kent Yao <[email protected]>
1 parent 82fd7bb commit 5adbc03

File tree

4 files changed

+134
-85
lines changed

4 files changed

+134
-85
lines changed

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

+5-3
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ import com.univocity.parsers.csv.CsvParser
2626
import org.apache.spark.SparkUpgradeException
2727
import org.apache.spark.internal.Logging
2828
import org.apache.spark.sql.catalyst.{InternalRow, NoopFilters, OrderedFilters}
29-
import org.apache.spark.sql.catalyst.expressions.{Cast, EmptyRow, ExprUtils, GenericInternalRow, Literal}
29+
import org.apache.spark.sql.catalyst.expressions.{ExprUtils, GenericInternalRow}
3030
import org.apache.spark.sql.catalyst.util._
3131
import org.apache.spark.sql.catalyst.util.LegacyDateFormats.FAST_DATE_FORMAT
3232
import org.apache.spark.sql.errors.{ExecutionErrors, QueryExecutionErrors}
@@ -260,12 +260,14 @@ class UnivocityParser(
260260

261261
case ym: YearMonthIntervalType => (d: String) =>
262262
nullSafeDatum(d, name, nullable, options) { datum =>
263-
Cast(Literal(datum), ym).eval(EmptyRow)
263+
IntervalUtils.castStringToYMInterval(
264+
UTF8String.fromString(datum), ym.startField, ym.endField)
264265
}
265266

266267
case dt: DayTimeIntervalType => (d: String) =>
267268
nullSafeDatum(d, name, nullable, options) { datum =>
268-
Cast(Literal(datum), dt).eval(EmptyRow)
269+
IntervalUtils.castStringToDTInterval(
270+
UTF8String.fromString(datum), dt.startField, dt.endField)
269271
}
270272

271273
case udt: UserDefinedType[_] =>

sql/core/benchmarks/CSVBenchmark-jdk21-results.txt

+48-41
Original file line numberDiff line numberDiff line change
@@ -2,69 +2,76 @@
22
Benchmark to measure CSV read/write performance
33
================================================================================================
44

5-
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
5+
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
66
AMD EPYC 7763 64-Core Processor
77
Parsing quoted values: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
88
------------------------------------------------------------------------------------------------------------------------
9-
One quoted string 23353 23432 75 0.0 467067.4 1.0X
9+
One quoted string 23962 24182 316 0.0 479231.3 1.0X
1010

11-
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
11+
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
1212
AMD EPYC 7763 64-Core Processor
1313
Wide rows with 1000 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
1414
------------------------------------------------------------------------------------------------------------------------
15-
Select 1000 columns 56825 57244 679 0.0 56825.1 1.0X
16-
Select 100 columns 20482 20568 86 0.0 20481.7 2.8X
17-
Select one column 16968 17000 36 0.1 16967.7 3.3X
18-
count() 3366 3378 11 0.3 3366.4 16.9X
19-
Select 100 columns, one bad input field 28347 28379 30 0.0 28346.6 2.0X
20-
Select 100 columns, corrupt record field 32401 32450 42 0.0 32401.2 1.8X
15+
Select 1000 columns 56724 57115 570 0.0 56724.1 1.0X
16+
Select 100 columns 20740 20855 115 0.0 20739.7 2.7X
17+
Select one column 17304 17377 114 0.1 17304.3 3.3X
18+
count() 3719 3740 21 0.3 3719.0 15.3X
19+
Select 100 columns, one bad input field 24943 24999 69 0.0 24943.2 2.3X
20+
Select 100 columns, corrupt record field 28306 28341 31 0.0 28306.2 2.0X
2121

22-
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
22+
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
2323
AMD EPYC 7763 64-Core Processor
2424
Count a dataset with 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
2525
------------------------------------------------------------------------------------------------------------------------
26-
Select 10 columns + count() 11174 11195 18 0.9 1117.4 1.0X
27-
Select 1 column + count() 7666 7694 24 1.3 766.6 1.5X
28-
count() 2042 2048 5 4.9 204.2 5.5X
26+
Select 10 columns + count() 10977 10982 5 0.9 1097.7 1.0X
27+
Select 1 column + count() 7406 7554 131 1.4 740.6 1.5X
28+
count() 1550 1558 9 6.5 155.0 7.1X
2929

30-
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
30+
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
3131
AMD EPYC 7763 64-Core Processor
3232
Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
3333
------------------------------------------------------------------------------------------------------------------------
34-
Create a dataset of timestamps 854 882 27 11.7 85.4 1.0X
35-
to_csv(timestamp) 6166 6174 13 1.6 616.6 0.1X
36-
write timestamps to files 6480 6575 158 1.5 648.0 0.1X
37-
Create a dataset of dates 948 949 1 10.6 94.8 0.9X
38-
to_csv(date) 4471 4474 3 2.2 447.1 0.2X
39-
write dates to files 4599 4616 15 2.2 459.9 0.2X
34+
Create a dataset of timestamps 845 847 3 11.8 84.5 1.0X
35+
to_csv(timestamp) 5546 5597 57 1.8 554.6 0.2X
36+
write timestamps to files 5760 5768 8 1.7 576.0 0.1X
37+
Create a dataset of dates 1053 1064 9 9.5 105.3 0.8X
38+
to_csv(date) 4115 4122 9 2.4 411.5 0.2X
39+
write dates to files 4102 4108 5 2.4 410.2 0.2X
4040

41-
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
41+
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
4242
AMD EPYC 7763 64-Core Processor
4343
Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
4444
-----------------------------------------------------------------------------------------------------------------------------------------------------
45-
read timestamp text from files 1200 1213 12 8.3 120.0 1.0X
46-
read timestamps from files 11576 11601 22 0.9 1157.6 0.1X
47-
infer timestamps from files 23234 23253 16 0.4 2323.4 0.1X
48-
read date text from files 1115 1162 44 9.0 111.5 1.1X
49-
read date from files 10978 11006 43 0.9 1097.8 0.1X
50-
infer date from files 22588 22604 13 0.4 2258.8 0.1X
51-
timestamp strings 1224 1236 21 8.2 122.4 1.0X
52-
parse timestamps from Dataset[String] 13566 13595 41 0.7 1356.6 0.1X
53-
infer timestamps from Dataset[String] 25057 25094 36 0.4 2505.7 0.0X
54-
date strings 1618 1626 7 6.2 161.8 0.7X
55-
parse dates from Dataset[String] 12784 12816 34 0.8 1278.4 0.1X
56-
from_csv(timestamp) 12008 12088 69 0.8 1200.8 0.1X
57-
from_csv(date) 11930 11938 12 0.8 1193.0 0.1X
58-
infer error timestamps from Dataset[String] with default format 14366 14394 35 0.7 1436.6 0.1X
59-
infer error timestamps from Dataset[String] with user-provided format 14380 14412 52 0.7 1438.0 0.1X
60-
infer error timestamps from Dataset[String] with legacy format 14439 14453 21 0.7 1443.9 0.1X
45+
read timestamp text from files 1107 1119 16 9.0 110.7 1.0X
46+
read timestamps from files 9511 9553 49 1.1 951.1 0.1X
47+
infer timestamps from files 19084 19114 27 0.5 1908.4 0.1X
48+
read date text from files 1036 1046 14 9.7 103.6 1.1X
49+
read date from files 8299 8309 15 1.2 829.9 0.1X
50+
infer date from files 17290 17294 4 0.6 1729.0 0.1X
51+
timestamp strings 1188 1197 7 8.4 118.8 0.9X
52+
parse timestamps from Dataset[String] 11442 11458 14 0.9 1144.2 0.1X
53+
infer timestamps from Dataset[String] 21076 21116 39 0.5 2107.6 0.1X
54+
date strings 1651 1659 10 6.1 165.1 0.7X
55+
parse dates from Dataset[String] 10181 10186 5 1.0 1018.1 0.1X
56+
from_csv(timestamp) 10023 10062 34 1.0 1002.3 0.1X
57+
from_csv(date) 9335 9351 15 1.1 933.5 0.1X
58+
infer error timestamps from Dataset[String] with default format 11187 11205 16 0.9 1118.7 0.1X
59+
infer error timestamps from Dataset[String] with user-provided format 11201 11216 13 0.9 1120.1 0.1X
60+
infer error timestamps from Dataset[String] with legacy format 11210 11227 17 0.9 1121.0 0.1X
6161

62-
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
62+
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
6363
AMD EPYC 7763 64-Core Processor
6464
Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
6565
------------------------------------------------------------------------------------------------------------------------
66-
w/o filters 4302 4383 137 0.0 43020.6 1.0X
67-
pushdown disabled 4206 4220 13 0.0 42058.8 1.0X
68-
w/ filters 776 784 10 0.1 7756.3 5.5X
66+
w/o filters 4365 4377 13 0.0 43653.8 1.0X
67+
pushdown disabled 4348 4370 22 0.0 43477.7 1.0X
68+
w/ filters 695 713 29 0.1 6950.2 6.3X
69+
70+
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
71+
AMD EPYC 7763 64-Core Processor
72+
Interval: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
73+
------------------------------------------------------------------------------------------------------------------------
74+
Read as Intervals 7089 7096 7 0.4 2362.1 1.0X
75+
Read Raw Strings 2071 2075 6 1.4 690.1 3.4X
6976

7077

0 commit comments

Comments
 (0)