Skip to content

[AURON #2080] Support Hive Parquet table to NativeParquetHiveTableScanExec #2081

Open
guixiaowen wants to merge 1 commit into
apache:masterfrom
guixiaowen:hiveTableScanExec
Open

[AURON #2080] Support Hive Parquet table to NativeParquetHiveTableScanExec #2081
guixiaowen wants to merge 1 commit into
apache:masterfrom
guixiaowen:hiveTableScanExec

Conversation

@guixiaowen

@guixiaowen guixiaowen commented Mar 10, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #2080

Rationale for this change

Switch the Spark execution plan to HiveTableScanExec, and when the table storage type is Parquet, the native execution plan NativeParquetHiveTableScanExec executes.

Design parameters are as follows:
spark.auron.enable.hiveTable and spark.auron.enable.parquetHiveTableScanExec parameters

Eg:
spark.auron.enable.hiveTable default is true (When set to true, this conversion is enabled by default)
spark.auron.enable.parquetHiveTableScanExec default is false(When set to true, this conversion is enabled by default))

What changes are included in this PR?

Are there any user-facing changes?

NO.

How was this patch tested?

UT

@guixiaowen guixiaowen marked this pull request as draft March 10, 2026 08:01
@guixiaowen guixiaowen force-pushed the hiveTableScanExec branch 2 times, most recently from 69c67ac to 2cafb1d Compare March 10, 2026 08:15
@guixiaowen guixiaowen changed the title [AURON #2080]Support Hive Parquet table to native [AURON #2080] Support Hive Parquet table to native Mar 10, 2026
@guixiaowen guixiaowen changed the title [AURON #2080] Support Hive Parquet table to native [AURON #2080] Support Hive Parquet table to NativeParquetHiveTableScanExec Mar 10, 2026
@guixiaowen guixiaowen force-pushed the hiveTableScanExec branch 4 times, most recently from 9fa8a84 to 8c67b23 Compare April 28, 2026 12:15
@guixiaowen guixiaowen force-pushed the hiveTableScanExec branch 2 times, most recently from 3849d4c to fe6a8bd Compare April 28, 2026 12:32
@guixiaowen guixiaowen marked this pull request as ready for review April 29, 2026 11:50
@guixiaowen guixiaowen force-pushed the hiveTableScanExec branch from 2013481 to bd4ecca Compare May 31, 2026 16:11

@merrily01 merrily01 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weiqingy weiqingy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on — reusing NativeHiveTableScanBase and wiring the scan in through the AuronConvertProvider SPI (mirroring PaimonConvertProvider) is a clean fit. A few questions inline.

import org.apache.spark.sql.hive.execution.HiveTableScanExec

class HiveConvertProvider extends AuronConvertProvider with Logging {
override def isEnabled: Boolean =

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AuronConvertProvider declares isEnabled(exec: SparkPlan): Boolean, and the dispatcher calls it with the plan (AuronConverters.scala:216 and :307). This override drops the parameter, so it doesn't implement the trait method — the abstract isEnabled(SparkPlan) stays unimplemented and override matches nothing, which the compiler rejects. That looks like the CI break flagged above. PaimonConvertProvider keeps the parameter and branches inside — would matching that signature work here?

override def isEnabled(exec: SparkPlan): Boolean =
  getBooleanConf("spark.auron.enable.hiveTable", defaultValue = true)

private def getMaxSplitBytes(sparkSession: SparkSession): Long = {
val defaultMaxSplitBytes = sparkSession.sessionState.conf.filesMaxPartitionBytes
val openCostInBytes = sparkSession.sessionState.conf.filesOpenCostInBytes
Math.min(defaultMaxSplitBytes, openCostInBytes)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filesMaxPartitionBytes defaults to 128MB and filesOpenCostInBytes to 4MB, so Math.min here always returns the open-cost (4MB) and every split is capped at it — which would fan a large table out into many tiny partitions. Spark computes the cap as min(maxPartitionBytes, max(openCostInBytes, bytesPerCore)) rather than a straight min of the two. Was min intended here? FilePartition.maxSplitBytes already implements that formula if it helps, though note it's keyed on Seq[PartitionDirectory] rather than the Seq[PartitionedFile] you have here, so it's not a literal drop-in.

private def enableHiveTableScanExec: Boolean =
getBooleanConf("spark.auron.enable.parquetHiveTableScanExec", defaultValue = false)

override def isSupported(exec: SparkPlan): Boolean =

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isSupported returns true for any table whose provider is hive, but convert only builds a native plan when isParquetTable holds and otherwise returns exec unchanged (line 49). So a non-Parquet Hive table (ORC, text) with both flags on would be reported as supported and then handed back un-converted, instead of falling through to the normal conversion path. Would gating isSupported on isParquetTable too keep the two in lockstep?

</excludes>
</relocation>
<relocation>
<pattern>io.netty</pattern>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This io.netty relocation was added in #1597 to fix NoClassDefFoundError: io/netty/buffer/Unpooled under shaded-spark — that change bundled netty-buffer/netty-common into the assembly and relocated them so they wouldn't clash. This PR removes the relocation but keeps those bundled deps (pom.xml:48-55), so netty ships at its original io.netty.* coordinates again, which is the state #1597 fixed away. Does dropping it reopen that shaded-spark conflict? Since the relocation only affects the shaded assembly — not the unit tests or the Hive Parquet path here — it'd help to know what in this PR needs it gone, along with the arrow-memory-netty exclusions added to the shims pom.xml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Hive Parquet table to native

5 participants