[AURON #2080] Support Hive Parquet table to NativeParquetHiveTableScanExec by guixiaowen · Pull Request #2081 · apache/auron

guixiaowen · 2026-03-10T08:00:36Z

Which issue does this PR close?

Closes #2080

Rationale for this change

Switch the Spark execution plan to HiveTableScanExec, and when the table storage type is Parquet, the native execution plan NativeParquetHiveTableScanExec executes.

Design parameters are as follows:
spark.auron.enable.hiveTable and spark.auron.enable.parquetHiveTableScanExec parameters

Eg：
spark.auron.enable.hiveTable default is true （When set to true, this conversion is enabled by default）
spark.auron.enable.parquetHiveTableScanExec default is false（When set to true, this conversion is enabled by default））

What changes are included in this PR?

Are there any user-facing changes?

NO.

How was this patch tested?

UT

…bleScanExec

merrily01

Hi @guixiaowen

Could you take a look and get the CI fixed?

https://github.com/apache/auron/actions/runs/26717696445/job/82051444906?pr=2081

weiqingy

Thanks for taking this on — reusing NativeHiveTableScanBase and wiring the scan in through the AuronConvertProvider SPI (mirroring PaimonConvertProvider) is a clean fit. A few questions inline.

weiqingy · 2026-06-22T06:39:51Z

+import org.apache.spark.sql.hive.execution.HiveTableScanExec
+
+class HiveConvertProvider extends AuronConvertProvider with Logging {
+  override def isEnabled: Boolean =


AuronConvertProvider declares isEnabled(exec: SparkPlan): Boolean, and the dispatcher calls it with the plan (AuronConverters.scala:216 and :307). This override drops the parameter, so it doesn't implement the trait method — the abstract isEnabled(SparkPlan) stays unimplemented and override matches nothing, which the compiler rejects. That looks like the CI break flagged above. PaimonConvertProvider keeps the parameter and branches inside — would matching that signature work here?

override def isEnabled(exec: SparkPlan): Boolean = getBooleanConf("spark.auron.enable.hiveTable", defaultValue = true)

weiqingy · 2026-06-22T06:39:51Z

+  private def getMaxSplitBytes(sparkSession: SparkSession): Long = {
+    val defaultMaxSplitBytes = sparkSession.sessionState.conf.filesMaxPartitionBytes
+    val openCostInBytes = sparkSession.sessionState.conf.filesOpenCostInBytes
+    Math.min(defaultMaxSplitBytes, openCostInBytes)


filesMaxPartitionBytes defaults to 128MB and filesOpenCostInBytes to 4MB, so Math.min here always returns the open-cost (4MB) and every split is capped at it — which would fan a large table out into many tiny partitions. Spark computes the cap as min(maxPartitionBytes, max(openCostInBytes, bytesPerCore)) rather than a straight min of the two. Was min intended here? FilePartition.maxSplitBytes already implements that formula if it helps, though note it's keyed on Seq[PartitionDirectory] rather than the Seq[PartitionedFile] you have here, so it's not a literal drop-in.

weiqingy · 2026-06-22T06:39:51Z

+  private def enableHiveTableScanExec: Boolean =
+    getBooleanConf("spark.auron.enable.parquetHiveTableScanExec", defaultValue = false)
+
+  override def isSupported(exec: SparkPlan): Boolean =


isSupported returns true for any table whose provider is hive, but convert only builds a native plan when isParquetTable holds and otherwise returns exec unchanged (line 49). So a non-Parquet Hive table (ORC, text) with both flags on would be reported as supported and then handed back un-converted, instead of falling through to the normal conversion path. Would gating isSupported on isParquetTable too keep the two in lockstep?

weiqingy · 2026-06-22T06:39:51Z

                  </excludes>
                </relocation>
-                <relocation>
-                  <pattern>io.netty</pattern>


This io.netty relocation was added in #1597 to fix NoClassDefFoundError: io/netty/buffer/Unpooled under shaded-spark — that change bundled netty-buffer/netty-common into the assembly and relocated them so they wouldn't clash. This PR removes the relocation but keeps those bundled deps (pom.xml:48-55), so netty ships at its original io.netty.* coordinates again, which is the state #1597 fixed away. Does dropping it reopen that shaded-spark conflict? Since the relocation only affects the shaded assembly — not the unit tests or the Hive Parquet path here — it'd help to know what in this PR needs it gone, along with the arrow-memory-netty exclusions added to the shims pom.xml.

github-actions Bot added spark build thirdparty-iceberg labels Mar 10, 2026

guixiaowen marked this pull request as draft March 10, 2026 08:01

github-actions Bot removed the thirdparty-iceberg label Mar 10, 2026

guixiaowen force-pushed the hiveTableScanExec branch 2 times, most recently from 69c67ac to 2cafb1d Compare March 10, 2026 08:15

guixiaowen changed the title ~~[AURON #2080]Support Hive Parquet table to native~~ [AURON #2080] Support Hive Parquet table to native Mar 10, 2026

guixiaowen changed the title ~~[AURON #2080] Support Hive Parquet table to native~~ [AURON #2080] Support Hive Parquet table to NativeParquetHiveTableScanExec Mar 10, 2026

github-actions Bot added thirdparty-uniffle and removed thirdparty-uniffle labels Mar 11, 2026

guixiaowen force-pushed the hiveTableScanExec branch from 6b0631f to 3ade2b1 Compare April 22, 2026 05:36

github-actions Bot added the thirdparty-uniffle label Apr 22, 2026

guixiaowen force-pushed the hiveTableScanExec branch 4 times, most recently from 9fa8a84 to 8c67b23 Compare April 28, 2026 12:15

github-actions Bot added the dev-tools label Apr 28, 2026

guixiaowen force-pushed the hiveTableScanExec branch from 8c67b23 to 9fa8a84 Compare April 28, 2026 12:18

github-actions Bot added dev-tools and removed dev-tools labels Apr 28, 2026

guixiaowen force-pushed the hiveTableScanExec branch 2 times, most recently from 3849d4c to fe6a8bd Compare April 28, 2026 12:32

guixiaowen marked this pull request as ready for review April 29, 2026 11:50

guixiaowen force-pushed the hiveTableScanExec branch from 9842c63 to a30703b Compare April 29, 2026 11:53

github-actions Bot removed the thirdparty-uniffle label Apr 30, 2026

github-actions Bot added dev-tools and removed dev-tools labels May 7, 2026

[AURON apache#2080] Support Hive Parquet table to NativeParquetHiveTa…

bd4ecca

…bleScanExec

guixiaowen force-pushed the hiveTableScanExec branch from 2013481 to bd4ecca Compare May 31, 2026 16:11

SteNicholas assigned richox, cxzl25 and merrily01 Jun 15, 2026

merrily01 reviewed Jun 18, 2026

View reviewed changes

weiqingy reviewed Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AURON #2080] Support Hive Parquet table to NativeParquetHiveTableScanExec #2081

[AURON #2080] Support Hive Parquet table to NativeParquetHiveTableScanExec #2081
guixiaowen wants to merge 1 commit into
apache:masterfrom
guixiaowen:hiveTableScanExec

guixiaowen commented Mar 10, 2026 •

edited

Loading

Uh oh!

merrily01 left a comment

Uh oh!

weiqingy left a comment

Uh oh!

weiqingy Jun 22, 2026

Uh oh!

weiqingy Jun 22, 2026

Uh oh!

weiqingy Jun 22, 2026

Uh oh!

weiqingy Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

guixiaowen commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

How was this patch tested?

Uh oh!

merrily01 left a comment

Choose a reason for hiding this comment

Uh oh!

weiqingy left a comment

Choose a reason for hiding this comment

Uh oh!

weiqingy Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

weiqingy Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

weiqingy Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

weiqingy Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

guixiaowen commented Mar 10, 2026 •

edited

Loading