Upgrade Apache Calcite from 1.40.0 to 1.42.0#18658
Conversation
b8bf3a6 to
3cbe3eb
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18658 +/- ##
============================================
+ Coverage 64.45% 64.46% +0.01%
- Complexity 1282 1291 +9
============================================
Files 3352 3372 +20
Lines 207171 208583 +1412
Branches 32348 32573 +225
============================================
+ Hits 133534 134465 +931
- Misses 62910 63312 +402
- Partials 10727 10806 +79
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
xiangfu0
left a comment
There was a problem hiding this comment.
Found 1 high-signal issue; see inline comment.
| case UINTEGER: | ||
| // UBIGINT (0..2^64-1) has no wider signed type, so values above Long.MAX_VALUE wrap (two's-complement) - this is | ||
| // unavoidable and acceptable since Pinot has no unsigned storage type. | ||
| case UBIGINT: |
There was a problem hiding this comment.
Accepting UBIGINT here turns BIGINT UNSIGNED into a lossy signed LONG. Any value above Long.MAX_VALUE will now silently wrap, so this is a wrong-result regression rather than a harmless type downgrade. Since Pinot cannot represent the full unsigned 64-bit range, this needs to fail validation/planning instead of being mapped to LONG.
There was a problem hiding this comment.
Good catch — agreed, and fixed. BIGINT UNSIGNED (UBIGINT) now fails fast at planning instead of being mapped to a lossy LONG: convertToColumnDataType throws a clear "Unsigned BIGINT is not supported … CAST to BIGINT or DECIMAL instead" error.
I kept the other unsigned types accepted, since those map losslessly and don't have the wrap problem — TINYINT/SMALLINT UNSIGNED → INT, INTEGER UNSIGNED → LONG. UBIGINT is the only one with no signed Pinot type wide enough for its full 0..2⁶⁴−1 range, so it's the one that has to be rejected.
The rejection is applied consistently across every site that touched the type — both (P)RelToPlanNodeConverter.convertToColumnDataType, the single-stage DataTypeConversionFunctions.cast, TypeSystem.deriveSumType, and ArithmeticFunctionUtils.normalizeNumericType — and is covered by regression tests (QueryCompilationTest#testUnsignedBigintCastIsRejected for both the column- and literal-cast planning paths, plus updated RelToPlanNodeConverterTest/PRelToPlanNodeConverterTest/DataTypeConversionFunctionsTest unit assertions). PR description updated too. Thanks!
Bump calcite-core/babel to 1.42.0 (folds in both the 1.40->1.41 and 1.41->1.42 deltas, since master was never moved to 1.41) and pin joou-java-6 to 0.9.5 to resolve a dependency-convergence conflict between calcite-core and the transitive avatica-core 1.28.0. Sync the customized SQL parser (Parser.jj + the fmpp configs) to upstream 1.42, preserving all PINOT CUSTOMIZATION regions. The new babel feature flags (includeStarExclude, includeSelectBy, includeIntervalWithoutQualifier) are intentionally kept OFF: the grammar is synced but inactive, as the multi-stage engine has no downstream support for those features yet. The upstream colon-path field-access grammar was likewise synced and is inert under Pinot's BABEL conformance (isColonFieldAccessAllowed() returns false) - gated by conformance rather than an fmpp flag. The sync also makes the bitwise infix operators '<<', '&' and infix '^' parseable (upstream-stock 1.42 additions); they are not registered in PinotOperatorTable, so they are not yet supported end-to-end. Handle 1.42 behavioral changes: - CALCITE-7189: Validator disables non-strict GROUP BY (BABEL enables it in 1.41+, which NPEs for window functions combined with GROUP BY). - CALCITE-7379: PinotRelDecorrelator relaxes the post-decorrelation type assertion for the nullability-only divergence that still fires on some Pinot correlated-subquery shapes; it fails fast on structural changes. - CALCITE-7351: drop the now-final getMaxNumericScale/getMaxNumericPrecision overrides (they delegate to the type-specific overrides Pinot keeps). - Filtered MIN/MAX are now nullable via SqlOperatorBinding.hasEmptyGroup(). - Unsigned integer types (CALCITE-1466) parse under BABEL. The representable ones are mapped to the narrowest lossless signed type (TINYINT/SMALLINT UNSIGNED -> INT, INTEGER UNSIGNED -> LONG) throughout (converters, literal folding, SUM return type, arithmetic normalization). BIGINT UNSIGNED (UBIGINT) has no signed type wide enough for its full 0..2^64-1 range, so it is rejected at planning rather than silently wrapping values above Long.MAX_VALUE. Regenerate the EXPLAIN plan snapshots and update/extend the affected tests.
3cbe3eb to
a8c9fe5
Compare
Summary
Upgrades Apache Calcite from 1.40.0 to 1.42.0. Pinot's master was never moved to 1.41, so this folds in both the 1.40→1.41 and 1.41→1.42 deltas in a single bump.
The bulk of the upgrade is a faithful re-sync of Pinot's customized SQL parser grammar to upstream 1.42, plus a handful of targeted workarounds for behavioral changes Calcite introduced across these two releases. No public Pinot API or wire/segment format changes.
Changes
Dependency
pom.xml:calcite.version→1.42.0(coverscalcite-coreandcalcite-babel).joou-java-6to0.9.5independencyManagementto resolve a dependency-convergence conflict:calcite-core1.42 pullsjoou-java-60.9.5 while the transitiveavatica-core1.28.0 still wants 0.9.4.LICENSE-binary: bumped the Calcite/Avatica entries to 1.42.0 / 1.28.0 (calcite-core,calcite-babel,calcite-linq4j,avatica-core,avatica-metrics) and addedorg.jooq:joou-java-6:0.9.5— matching the binary distribution'sDEPENDENCIESmanifest.SQL parser codegen sync (
pinot-common/src/main/codegen)templates/Parser.jj,config.fmpp, anddefault_config.fmppto upstream Calcite 1.42, preserving everyPINOT CUSTOMIZATIONregion.includeStarExclude(SELECT * EXCLUDE/REPLACE),includeSelectBy(SELECT ... BY), andincludeIntervalWithoutQualifier— are intentionally kept OFF. The grammar is synced but inactive: the multi-stage engine has no downstream support for these features, so enabling them would parse syntax the planner/runtime can't execute. They are wired throughdefault_config.fmppso a future change can flip them on deliberately.UNSIGNEDis added as a non-reserved keyword (needed for the unsigned integer types below).Behavioral workarounds for 1.41/1.42 changes
ANY_VALUE()), but the implementation NPEs when a window function is combined with GROUP BY (e.g.SELECT MIN(col) OVER() FROM t GROUP BY col).Validatornow uses aSqlDelegatingConformanceover BABEL that overridesisNonStrictGroupBy()tofalse. This is also the semantically correct behavior for Pinot, which requires all non-aggregated columns to appear in GROUP BY. The feature remains present (un-reverted) in 1.42, so the override is retained.Litmus.THROWtype assertion still fires for a nullability-only divergence.PinotRelDecorrelator(new, see below) relaxes that one assertion: it logs a warning when the row types differ only in nullability and continues, but still fails fast on any structural type change.RelDataTypeSystem#getMaxNumericScale/getMaxNumericPrecisionbecamefinal.TypeSystemdrops the now-illegal overrides; the equivalent behavior is preserved via the type-specificgetMaxScale/getMaxPrecision(DECIMAL)overrides Pinot already defines.SqlOperatorBinding#hasEmptyGroup().PinotMinMaxReturnTypeInferencenow also treats a possibly-empty group as nullable (alongside the existinggetGroupCount() == 0/hasFilter()checks), matching the runtime's null-on-empty behavior forMIN(x) FILTER (WHERE ...).Unsigned integer types (CALCITE-1466)
BABEL now parses
UTINYINT/USMALLINT/UINTEGER/UBIGINT. Pinot has no native unsigned storage, so each is mapped to the narrowest signed type that holds its full range without loss — andUBIGINT(BIGINT UNSIGNED), which has no such type, is rejected rather than silently wrapping:UTINYINT/USMALLINT→INT,UINTEGER→LONG(a signedINTwould wrapUINTEGERvalues above 2³¹); applied inRelToPlanNodeConverter/ v2PRelToPlanNodeConverter(convertToColumnDataType), the single-stageDataTypeConversionFunctions.cast,TypeSystem.deriveSumType(widens to signedBIGINTsoSUMdoesn't overflow a 32-bitINT), andArithmeticFunctionUtils.normalizeNumericType(keeps arithmetic integral instead of widening toDOUBLE).UBIGINTis rejected at planning (convertToColumnDataTypethrows): its0..2⁶⁴−1range exceeds signedLONG(2⁶³−1), so mapping it toLONGwould silently wrap values aboveLong.MAX_VALUEinto negatives — a silent wrong result. Failing fast (with a clear message suggestingCAST … AS BIGINT/DECIMAL) is safer;UBIGINTwas a parse error pre-1.41 anyway, and only ever arises from an explicitCAST(… AS BIGINT UNSIGNED). (Per review feedback from @xiangfu0.)PinotEvaluateLiteralRulefolds a constant unsigned cast into its signed-equivalent type by delegating toconvertToColumnDataType— so the representable types fold, and aUBIGINTliteral cast is rejected on the same path.New class
org.apache.pinot.calcite.sql2rel.PinotRelDecorrelator— a minimal subclass of Calcite'sRelDecorrelatorthat exists solely to relax the CALCITE-7379 assertion described above. It lives under theorg.apache.calcite.sql2relpackage because the relevant members (CorelMap,decorrelate, etc.) are package/protected-subclass visible in Calcite.Testing & validation
pinot-query-plannerunit suite: 1262/1262 pass.pinot-query-runtimeresult-correctness vs H2 —ResourceBasedQueriesTest3571 pass / 0 fail (6 pre-existing skips),QueryRunnerTest130/130.OfflineClusterIntegrationTestrun locally; updatedtestQueryWithRepeatedColumnsV2to reflect that 1.42 now accepts repeated columns in GROUP BY but still rejects ambiguous repeated columns in ORDER BY.SUM/arithmetic return types over unsigned operands, and single-stage unsigned casts.pinot-query-planner/src/test/resources/queries/*.jsonEXPLAIN-plan snapshots are mechanically regenerated (label/whitespace deltas from upstream rule changes), not hand-edited.Behavior & compatibility notes
UNSIGNEDcasts — new accepted/rejected query surface (user-facing). As a consequence of CALCITE-1466, BABEL now parsesCAST(x AS <type> UNSIGNED)on both engines. Pinot accepts the representable ones, mapping to the narrowest lossless signed type (TINYINT/SMALLINT UNSIGNED→INT,INTEGER UNSIGNED→LONG), and rejectsBIGINT UNSIGNEDat planning with a clearIllegalArgumentException(no signed type holds its full range). Net: some... UNSIGNEDcasts that were previously parse errors now succeed, andBIGINT UNSIGNEDnow produces a specific planning-time rejection. Worth a release note.SELECT x, COUNT(*) FROM t GROUP BY x, x) previously failed validation and now succeeds — Calcite 1.42 de-duplicates the repeated key. This is a benign relaxation (no previously-working query breaks), but during a mixed-version rolling upgrade the same query is rejected by a 1.40 broker and accepted by a 1.42 broker. Repeated columns inORDER BYare still rejected as ambiguous (covered by the updatedtestQueryWithRepeatedColumnsV2).JoinPlans.json/PhysicalOptimizerPlans.json). This is a sound, plan-equivalent rewrite — Calcite only applies it where the join's right input is already distinct (e.g. fed by an aggregate) — and result-equivalence is covered by the H2-comparison suites (ResourceBasedQueriesTest) and the integration tests, which all pass.<<,&,^). The faithful parser sync makes these parseable for the first time (upstream-stock 1.42 additions toBinaryRowOperator, mapping toSqlStdOperatorTable.BIT_LEFT_SHIFT/BITAND_OPERATOR/BITXOR_OPERATOR); previously they were parse errors. No operator wiring is added by this PR. End-to-end status differs by engine: in the multi-stage enginePinotOperatorTableis a curated allow-list that does not register them, so such expressions fail at operator resolution (validation) — same status as the other synced-but-inert 1.42 grammar. In the single-stage engine the canonical names of&/^(bitand/bitxor) coincide with Pinot's existingbitAnd/bitXorscalar functions, soa & b/a ^ bmay resolve to those (a MySQL-style infix alias), while<<has no Pinot equivalent and errors. This v1/v2 divergence is an inherent consequence of faithfully syncing the upstream grammar; explicitly wiring up or rejecting these operators is left as a separate, deliberate decision.CAST(x AS INTEGER UNSIGNED)(and the other representable unsigned types) is now parseable in v1 too.DataTypeConversionFunctions.castmaps them to their signed equivalent (UTINYINT/USMALLINT → INT, UINTEGER → LONG), mirroring the multi-stage converter, and rejectsBIGINT UNSIGNED(UBIGINT) with the same clear error. Covered byDataTypeConversionFunctionsTest#testCastToUnsignedTypes.Notes for reviewers
Parser.jja faithful upstream sync rather than a divergent fork. Flipping the three fmpp flags is a separate, future decision.AddOptionalColonPath/ColonBracketSegment) was also synced and is likewise inert under Pinot — but it is gated by the upstream conformance methodSqlConformance.isColonFieldAccessAllowed()(which returnsfalsefor BABEL), not by an fmpp flag. These productions are byte-for-byte from upstream Calcite 1.42.0 (verified againstcalcite-1.42.0Parser.jj), so they intentionally carry noPINOT CUSTOMIZATIONmarkers.PINOT CUSTOMIZATIONmarkers from the prior grammar are preserved.BIGINT UNSIGNED(UBIGINT) is rejected rather than mapped to a lossyLONG(per @xiangfu0's review) — see the unsigned-types section. The representable unsigned types (TINYINT/SMALLINT/INTEGER UNSIGNED) map losslessly, so no unsigned value silently wraps.switches (the two converters,DataTypeConversionFunctions,ArithmeticFunctionUtils.normalizeNumericType,TypeSystem.deriveSumType). Each maps to a different target enum and usescaselabels (which must be compile-time constants, so they can't delegate to a shared predicate), so the per-switch listing is largely inherent. The one genuinely-collapsible duplication isRelToPlanNodeConverter.convertToColumnDataTypeand the v2PRelToPlanNodeConverter.convertToColumnDataType, which are byte-for-byte identicalpublic staticcopies — a pre-existing smell this diff merely extends. Collapsing those two (have v2 delegate to v1) is worth a dedicated refactor; left out here to keep the diff scoped to the version bump.hasEmptyGroup()) is a planning-time type-nullability correction — it lets the query validate under 1.42's stricter nullability checks. Pinot'sDataSchema/ColumnDataTypeerases nullability, so this does not change runtime values; the empty-filtered-group → NULL runtime semantics are pre-existing and unchanged. The fix is covered by a compile-time regression test (testFilteredMinMaxAggregateNullability).