[GLUTEN-12225][CORE] Fix arrow.c shading: exclude memory/vector packages so public API stays unshaded#12226
[GLUTEN-12225][CORE] Fix arrow.c shading: exclude memory/vector packages so public API stays unshaded#12226sezruby wants to merge 2 commits into
Conversation
…API stays unshaded The bundled Arrow C-Data classes (org.apache.arrow.c.*) are correctly excluded from relocation because their native JNI binds to the original class names. However, their public API signatures take and return org.apache.arrow.memory.* and org.apache.arrow.vector.* types, which were being relocated to org.apache.gluten.shaded.*. The result: bundled ArrowArrayStream/ArrowSchema/ArrowArray/Data classes are compiled against the shaded BufferAllocator/VectorSchemaRoot, so any caller passing a vanilla Apache Arrow allocator gets NoSuchMethodError. Triggered for any Spark workload that combines gluten with another library using Arrow C-Data (Iceberg's Arrow vector layer, Lance Java's writer, Snowflake JDBC's Arrow result decoder, etc.) when gluten's bundle wins classloader resolution against vanilla Arrow. Fix: extend the relocation excludes to also keep org.apache.arrow.memory.** and org.apache.arrow.vector.** unshaded. The bundled C-Data API now matches the public Apache Arrow API. Adds dev/check-arrow-c-shading.sh which runs javap on the produced bundle jar and asserts that public method signatures reference unshaded Arrow types. Wired into package/pom.xml's verify phase via exec-maven-plugin so regressions are caught in CI. Tested against the upstream gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.6.0.jar — script exits 1 with a clear diagnosis on the broken bundle. Closes apache#12225
|
@sezruby, thanks for the PR. This fix makes sense. I recall there's a related issue that occurs at compile time when an external project introduces the Gluten JAR as a dependency: a Scala type mismatch caused by the Maven Shade Plugin not rewriting ScalaSignature annotations. My understanding is this PR also fixes that case (see https://chungmin.hashnode.dev/unraveling-a-scala-type-mismatch-mystery). One small concern is potential Arrow version conflicts, since these packages are no longer shaded. That said, the memory and vector APIs should be stable across minor versions, so I assume the risk should be low in practice. |
| <exclude>org.apache.arrow.memory.**</exclude> | ||
| <exclude>org.apache.arrow.vector.**</exclude> |
There was a problem hiding this comment.
nit: Should we directly exclude all arrow packages? E.g., org.apache.arrow.*
There was a problem hiding this comment.
The full org.apache.arrow.* exclusion would lose gluten's isolation from the user's Arrow version everywhere, not just on the C-Data boundary. The C-Data classes have to be unshaded because their JNI native lib hardcodes the original class names; arrow.memory.* and arrow.vector.* follow because they appear in arrow.c.* public method signatures. Anything else under org.apache.arrow.* (flight, algorithm, adapter, etc.) is internal to gluten's columnar batch handling and safer to keep shaded so it doesn't conflict with user Arrow. The narrow exclusion is the minimum that makes the public C-Data API self-consistent without giving up isolation elsewhere.
This sounds a real risk. Moving forward, can we completely remove Arrow from the bundled Gluten Jar, and let user rely on Spark's bundled Arrow instead? I assume we don't have any customized Arrow code now: #12130 |
Worth doing, but a couple of things worth confirming first: #12130 removed the dead Arrow-CSV / Arrow-Dataset JVM paths. That's progress, but there are still ~34 files under If we drop the bundled Arrow:
So I think it's the right long-term direction. As an immediate fix this PR makes the current shading approach internally consistent (which is independently a valid bug fix, since the partial-shading is a latent bug regardless of whether Arrow gets unbundled later). Happy to take either path — let me know if you'd prefer I close this and pursue the Arrow-unbundling work instead, or merge this as the short-term fix and treat unbundling as a follow-up.
@philo-he Good link, the partial-shading also breaks downstream Scala consumers that pull the gluten jar as a dep, since their compile-time Arrow type doesn't match what gluten expects across the API boundary. This PR fixes that as a side effect by keeping the boundary types unshaded so they match the public Apache Arrow types every other Scala consumer compiles against. cc author @clee704 |
What changes were proposed in this pull request?
Extend
package/pom.xml'sorg.apache.arrowrelocation excludes to also keeporg.apache.arrow.memory.**andorg.apache.arrow.vector.**unshaded.The bundled Arrow C-Data classes (
org.apache.arrow.c.*) are correctly excluded from relocation because their native JNI binds to the original class names. However, their public API signatures take and returnorg.apache.arrow.memory.*andorg.apache.arrow.vector.*types — which were being relocated. The result: the bundledArrowArrayStream/ArrowSchema/ArrowArray/Dataclasses get compiled against the shadedBufferAllocator/VectorSchemaRoot, so any caller passing a vanilla Apache Arrow allocator hitsNoSuchMethodError.This affects any Spark workload that combines gluten with another library using Arrow C-Data (Iceberg's Arrow vector layer, Lance Java's writer, Snowflake JDBC's Arrow result decoder, etc.) when gluten's bundle wins classloader resolution against vanilla Arrow.
How was this patch tested?
Adds
dev/check-arrow-c-shading.shwhich runsjavapon the produced bundle jar and asserts that public method signatures reference unshaded Arrow types. Wired intopackage/pom.xml'sverifyphase viaexec-maven-pluginso regressions are caught in CI.Tested against the upstream
gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.6.0.jar:After applying the relocation exclude change, a freshly-built bundle should pass the same check (script exits 0). The repro from #12225 (3 lines calling
ArrowArrayStream.allocateNew(new RootAllocator(...))) goes fromNoSuchMethodErrortoOK.Closes
#12225