Skip to content

docs: M5-3 proposed - SQS partitioned-FIFO reverse encoder#914

Open
bootjp wants to merge 8 commits into
mainfrom
backup/m5-3-partitioned-design
Open

docs: M5-3 proposed - SQS partitioned-FIFO reverse encoder#914
bootjp wants to merge 8 commits into
mainfrom
backup/m5-3-partitioned-design

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Jun 3, 2026

Summary

  • Phase 0b M5-3 design doc: lift ErrSQSEncodeUnsupportedPartitioned, plumb a partition field through sqsMessageRecord, and emit the partitioned data/vis/byage/dedup key family for queues with partition_count > 1.
  • Group lock rows remain unemitted (carry-over from M5-2 — emitting any row falsely blocks the group).
  • Design-doc-first per CLAUDE.md: this PR is doc-only; implementation lands in a follow-up.

Decision gates documented

  • Single messages.jsonl per queue with "partition": <N> per line (Option A, recommended) vs per-partition file layout (Option B).
  • Backward / forward compatibility analyzed: omitempty on Partition covers pre-M5-3 dumps; format_version stays at 1.
  • M5-2's "full reconstruction of vis+byage+dedup, never emit group" decision inherited unchanged.

Fail-closed invariants added

  • meta.PartitionCount > 1 AND message.Partition >= PartitionCount → new ErrSQSEncodeOutOfRangePartition.
  • meta.PartitionCount == 1 AND message.Partition != 0ErrSQSInvalidMessage (reused).

Out of scope (documented)

  • Cross-partition rebalancing on restore (PartitionCount mismatch between dump and target — needs a future --repartition flag).
  • In-flight cross-partition receives (same vis=0 rule as M5-2).
  • Group lock rows (same prohibition as M5-2).

Test plan (planned for the impl PR)

Test Verifies
TestSQSEncodePartitionedQueueRoundTrip 2-partition queue with 3 messages → all data + side records emitted with |p| prefix
TestSQSEncodePartitionedDedupBuildsGroupSegment FIFO partitioned dedup row's <group-seg> matches message_group_id
TestSQSEncodeRejectsOutOfRangePartition Partition >= PartitionCountErrSQSEncodeOutOfRangePartition
TestSQSEncodeRejectsNonZeroPartitionOnClassicQueue PartitionCount=1 + Partition=2ErrSQSInvalidMessage
TestSQSEncodeLegacyDumpsWithoutPartitionStillRoundTrip pre-M5-3 messages.jsonl (no partition field) round-trips unchanged
TestSQSEncodePartitionedSideRecordsByteCrossCheckLiveAdapter byte-identical key bytes vs sqsPartitionedMsg{...}Key(...)

Self-review (5 lenses)

  1. Data loss — Closes the silent-rejection surface where partitioned queues couldn't round-trip. New out-of-range + classic-vs-partitioned guards catch malformed dumps.
  2. Concurrency / distributed failures — Encoder is offline + single-goroutine. No new concurrency surface. Decoder write path adds one field; no new locking.
  3. Performance — O(messages-in-dump) walk, same as M5-1 + M5-2. Cross-check vs live constructors at test time only.
  4. Data consistency — Partition number recovered byte-equal from messages.jsonl. Group-by-partition sort before emit ensures byte-identical re-encodes.
  5. Test coverage — 6 tests across encoder, decoder round-trip, and live-adapter cross-check. Legacy-compat regression explicitly pinned.

Test plan

  • Doc only — no code changes.
  • Bot review cycle (claude / codex / CodeRabbit)
  • Implementation PR follows after this doc lands.

Phase 0b M5-3 design doc. Lifts the ErrSQSEncodeUnsupportedPartitioned
gate that M5-1 (#849) and M5-2 (#892) explicitly deferred.

Adds:
- sqsMessageRecord.Partition field + decoder plumbing.
- Partitioned key-family emission (data/vis/byage/dedup) when
  meta.PartitionCount > 1.
- Out-of-range and classic-vs-partitioned mismatch fail-closed guards.

Group lock rows remain unemitted (carry-over from M5-2 — emitting any
row falsely blocks the group).

CLAUDE.md design-doc-first workflow: this doc lands before the
implementation PR.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 3, 2026

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 30 minutes and 3 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: be2308bc-d462-4d7b-889b-f76e72add077

📥 Commits

Reviewing files that changed from the base of the PR and between 6ffdbb2 and ed24e2a.

📒 Files selected for processing (1)
  • docs/design/2026_06_03_proposed_sqs_partitioned_fifo_encoder.md
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch backup/m5-3-partitioned-design

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 744ecbb5db

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

| `data` | `!sqs\|msg\|data\|<queue-seg><gen-BE><msgID-seg>` | `!sqs\|msg\|data\|p\|<queue-seg>\|<part-BE><gen-BE><msgID-seg>` |
| `vis` | `!sqs\|msg\|vis\|<queue-seg><gen-BE><visibleAt-BE><msgID-seg>` | `!sqs\|msg\|vis\|p\|<queue-seg>\|<part-BE><gen-BE><visibleAt-BE><msgID-seg>` |
| `byage` | `!sqs\|msg\|byage\|<queue-seg><gen-BE><sendTs-BE><msgID-seg>` | `!sqs\|msg\|byage\|p\|<queue-seg>\|<part-BE><gen-BE><sendTs-BE><msgID-seg>` |
| `dedup` | `!sqs\|msg\|dedup\|<queue-seg><gen-BE><dedupID-seg>` | `!sqs\|msg\|dedup\|p\|<queue-seg>\|<part-BE><gen-BE><group-seg><dedupID-seg>` |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add the missing group/dedup delimiter

For partitioned FIFO dedup keys, this shape omits the literal terminator between the encoded group ID and dedup ID. The live constructor appends sqsPartitionedQueueTerminator after encodeSQSSegment(groupID) and before encodeSQSSegment(dedupID) in adapter/sqs_keys.go, with comments explaining that back-to-back raw-base64 segments are ambiguous and can make distinct (groupID, dedupID) pairs collapse onto the same key. If the M5-3 implementation follows this table, restored partitioned queues can reintroduce false duplicate suppression across groups; the documented shape should include the separator.

Useful? React with 👍 / 👎.


## Decoder lift (M5-1 follow-up)

The decoder's `decodeSQSMessageValue` (`internal/backup/sqs.go:719`) already runs after the partition trailer has been parsed by `sqsParsePartitionedMsgKey` (`sqs.go:600`). M5-3 plumbs the partition number through `sqsMessageRecord.Partition` and writes it to `messages.jsonl`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Specify returning the partition from key parsing

This describes plumbing the partition after parsing, but the current decoder path only validates the partitioned trailer and returns the encoded queue segment: parseSQSMessageDataKey calls parseSQSPartitionedQueueAndTrailer, whose return type is just (string, error). Since the live message value decoded by decodeSQSMessageValue does not contain the partition, an implementation that only updates sqsMessageRecord/decodeSQSMessageValue as written here still has no source for the partition and will emit all new dumps with the zero default. The M5-3 slice should explicitly require changing the data-key parser/call site to return the parsed uint32 partition from the key.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request proposes a design document for the SQS partitioned-FIFO reverse encoder (Phase 0b M5-3) to support queues with multiple partitions. The feedback highlights critical improvements to the design: using a pointer *uint32 for the Partition field to prevent silent corruption and ensure JSON consistency, including sequence_number in the sorting criteria to guarantee deterministic re-encodes, and adding a fail-closed validation check when partition information is missing in partitioned queues.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +26 to +30
// sqsMessageRecord adds (in M5-3):
Partition uint32 `json:"partition,omitempty"`
```

`omitempty` is load-bearing — every classic-queue dump produced before M5-3 lands has no `partition` field, and the encoder MUST default to `partition=0` (the only valid value for `partition_count == 1`). New partitioned-queue dumps populate `partition` from the live key's partition trailer.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a plain uint32 with omitempty means that for partitioned queues (PartitionCount > 1), any message belonging to partition 0 will have its partition field omitted from the JSON. This has two major drawbacks:

  1. Inconsistent JSON representation: Some messages in a partitioned queue dump will have "partition": N while others will completely lack the field, making the schema inconsistent and harder for external tools to parse.
  2. Silent Corruption Risk: If an operator attempts to restore a legacy classic dump (which has no partition fields) into a partitioned queue (e.g., by manually changing PartitionCount in _queue.json), the encoder will silently default all messages to Partition = 0. This leads to silent routing/FIFO consistency corruption because the messages are not distributed across partitions according to their MessageGroupID hash.

Recommendation:
Use a pointer *uint32 for the Partition field:

// sqsMessageRecord adds (in M5-3):
Partition *uint32 `json:"partition,omitempty"`
  • For classic queues, Partition can be left as nil, so it is completely omitted, preserving backward compatibility.
  • For partitioned queues, Partition is populated as a pointer to the partition number (even if it is 0). This ensures that "partition": 0 is explicitly written, making the JSON consistent and allowing the encoder to detect and reject legacy dumps being restored to partitioned queues without repartitioning.


1. **Drop `ErrSQSEncodeUnsupportedPartitioned`.** Remove the `meta.PartitionCount > 1` gate at line 162.
2. **Branch on `PartitionCount`.** When `> 1`, use partitioned key constructors (duplicated from `adapter/sqs_keys.go` following the established M3b-3 GSI pattern). When `<= 1`, classic constructors as today.
3. **Group-by-partition before emit.** Sort messages by `(partition, send_timestamp_millis, message_id)` so per-partition order is stable across runs — required for byte-identical re-encodes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The sorting criteria for partitioned FIFO queues should include sequence_number to ensure strict stability and correctness, matching the classic FIFO sorting logic in sortMessagesForEmit.

Without sequence_number, messages with the same send_timestamp_millis within the same partition might be reordered across runs, breaking the deterministic byte-identical re-encode guarantee.

Recommendation:
Update the sorting criteria to:
Sort messages by (partition, send_timestamp_millis, sequence_number, message_id)


The encoder fails closed with the existing per-adapter sentinels on:

- `meta.PartitionCount > 1` AND any message has `Partition == 0` AND the dump's record count for partition 0 doesn't match the live partition assignment. (Detectable only if the encoder can recompute the partition; deferred to a self-test invariant rather than a runtime check.)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If we adopt the *uint32 pointer representation for the Partition field, we can implement a robust, simple runtime validation check instead of deferring it to a self-test invariant.

Specifically, if meta.PartitionCount > 1 and any message has Partition == nil (indicating a legacy dump without partition info), the encoder should immediately fail closed. This prevents accidental/silent corruption when restoring a classic dump into a partitioned queue.

Recommendation:
Update this invariant to:

  • meta.PartitionCount > 1 AND any message has Partition == nil -> fail closed with a new sentinel (e.g., ErrSQSEncodeMissingPartition).

Two doc fixes from the first-round review:

1. Codex P2 (L16): the partitioned-dedup key shape was missing the
   sqsPartitionedQueueTerminator between <group-seg> and <dedupID-seg>.
   The live sqsPartitionedMsgDedupKey (adapter/sqs_keys.go:389) emits
   this delimiter because base64.RawURLEncoding has no padding, so
   back-to-back raw-base64 segments would let distinct (group,dedup)
   pairs collapse onto the same key. Add the delimiter to the table
   + a notation paragraph explaining why every partitioned key carries
   the terminator after <queue-seg> AND why dedup additionally carries
   one between group/dedup.

2. Codex P2 (L38): v1 said the partition is 'plumbed through
   sqsMessageRecord.Partition' but parseSQSMessageDataKey /
   parseSQSPartitionedQueueAndTrailer currently return only the
   encoded queue (with the partition u32 parsed-then-discarded
   inside parseSQSPartitionedQueueAndTrailer). v2 explicitly
   requires extending both signatures to return the partition, and
   clarifies the message value itself never carried partition — the
   key is the source of truth.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 3, 2026

@claude review

v2 addresses both codex P2 findings from the v1 review:

Codex P2 (L16) — partitioned-dedup key was missing the sqsPartitionedQueueTerminator between and . The live sqsPartitionedMsgDedupKey (adapter/sqs_keys.go:389) emits this delimiter because base64.RawURLEncoding has no padding and back-to-back segments would let distinct (group,dedup) pairs collapse onto the same key. Table now shows the delimiter; notation paragraph explains why every partitioned key carries a terminator after AND why dedup additionally carries one between group/dedup.

Codex P2 (L38) — v1 was under-specified on partition plumbing. parseSQSMessageDataKey / parseSQSPartitionedQueueAndTrailer currently return only the encoded queue; partition uint32 is parsed-then-discarded inside the partitioned trailer parser. v2 explicitly requires extending both signatures to return the partition; clarifies the message value itself never carried partition — the key is the source of truth.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 3, 2026

Claude finished @bootjp's task in 3m 55s —— View job


Review: M5-3 SQS partitioned-FIFO reverse encoder (v2)

  • Load tools
  • Read design doc (v2)
  • Read relevant source files for context
  • Verify v2 fixes for Codex P2 findings
  • Assess Gemini suggestions
  • Post comprehensive review

Codex P2 fixes — both verified ✅

P2 (L16) — terminator in dedup key.
The live sqsPartitionedMsgDedupKey (adapter/sqs_keys.go:389-408) emits:

prefix(p|) + encodedQueue + '|' + u32(part) + u64(gen) + encodedGroupID + '|' + encodedDedupID

The v2 table (line 16) now shows <group-seg>|<dedupID-seg>, and the notation paragraph (lines 18-19) correctly explains both the (3) queue-segment terminator and (4) the dedup-only group/dedup separator. Cross-checked against the live constructor. ✅

P2 (L38) — partition plumbing.
The doc now explicitly requires extending both parseSQSMessageDataKey and parseSQSPartitionedQueueAndTrailer to return the partition, with the updated signatures shown (lines 51-52). It clarifies that the message value itself never carries partition and the key is the authoritative source. The isPartitioned bool on parseSQSMessageDataKey is a clean way to avoid re-reading the p| discriminator at the call site. ✅

The current code at sqs.go:634 performs only a length validation (len(trailer) <= fixedTrailerBytes) — the partition bytes exist in trailer[:4] but binary.BigEndian.Uint32 is never called. The implementation PR will need to add that decode step explicitly; worth a note in the impl PR but fine for the design doc.


Remaining open items from Gemini — my take

1. *uint32 vs uint32 omitempty for Partition (Gemini L30) — I agree this is a real gap.

With uint32 + omitempty, a partition-0 message in a 2-partition queue serializes as {} — identical to a legacy classic-queue message. The encoder has no way to distinguish them at runtime: both decode to rec.Partition == 0. The doc defers detection to "a self-test invariant rather than a runtime check" (line 86), which is the weakest possible gate.

The specific scenario Gemini flags: an operator manually edits _queue.json to set partition_count=2 and restores a classic dump — all messages silently land in partition 0, and the FIFO group hash-to-partition invariant breaks without any error.

Adopting *uint32 allows a clean fail-closed rule at line 86: meta.PartitionCount > 1 AND rec.Partition == nilErrSQSEncodeMissingPartition. Partition-0 messages in a genuine partitioned dump would write "partition": 0 explicitly (pointer is non-nil), so the encoder can tell them apart.

The tradeoff is pointer-dereference noise in the encoder loop. That's real but minor. Recommend the implementation PR adopt *uint32 and add ErrSQSEncodeMissingPartition. If the doc author intentionally rejects this, the rationale should be stated explicitly (e.g., "we rely on PartitionCount as the single source of truth and accept the self-test-only gap").

2. sequence_number in the sort key (Gemini L74) — agree, add it.

The doc specifies (partition, send_timestamp_millis, message_id). Partitioned queues are always FIFO, so sequence_number is always populated (sqsMessageRecord.SequenceNumber uint64 omitempty, non-zero for every FIFO message). For byte-identical re-encodes, message_id alone is a stable tiebreaker (IDs are unique), but sequence_number is the definitive in-order tie between messages that share a send timestamp (burst scenario). The classic path uses it; the partitioned path should too.

Recommend: (partition, send_timestamp_millis, sequence_number, message_id) — exactly as Gemini suggests. Four fields; no ambiguity.

3. Fail-closed for nil partition (Gemini L86) — follows from #1.

If *uint32 is adopted, this is implementable and should be added. If uint32 is kept, this cannot be implemented at runtime and the doc should say so plainly rather than the current implicit deferral.


New finding: parseSQSGenericKey call site update not documented

When parseSQSPartitionedQueueAndTrailer gains a partition return value (line 52), the parseSQSGenericKey wrapper at sqs.go:571 also calls it:

func parseSQSGenericKey(key []byte, prefix string) (string, error) {
    ...
    return parseSQSPartitionedQueueAndTrailer(rest, false /*hasMsgID*/, key)
    ...
}

This function is called by HandleSideRecord (sqs.go:367). HandleSideRecord doesn't need partition (it routes to _internals/ by queue only), so the partition can be discarded there — but the signature change to parseSQSPartitionedQueueAndTrailer forces a mechanical update to parseSQSGenericKey too. The "Files to add / modify" section lists sqs.go generically; worth an explicit call-out so the implementation doesn't miss this.


Implementation note: two call sites for parseSQSMessageDataKey

HandleMessageData (sqs.go:341) is the primary call site:

encQueue, err := parseSQSMessageDataKey(key)
...
rec, err := decodeSQSMessageValue(value)

After the signature change, HandleMessageData needs to receive (encQueue, partition, isPartitioned, err) and, when isPartitioned, set rec.Partition = &partition (with *uint32) or rec.Partition = partition (with uint32) before appending to the queue state. decodeSQSMessageValue decodes only the value bytes and has no awareness of the key, so partition needs to be wired in at the call site, not inside the decode function. The doc's phrasing "decodeSQSMessageValue then receives the parsed partition from its caller" could mean a signature change to decodeSQSMessageValue(value []byte, partition uint32) (*sqsMessageRecord, error) or a post-decode assignment — either works, but the intent should be specified in the doc to prevent the implementation from choosing the wrong abstraction level.


Minor / editorial

  • Line 30 (omitempty paragraph): the sentence "the encoder MUST default to partition=0 (the only valid value for partition_count == 1)" is accurate for the classic path but slightly misleading for the partitioned-partition-0 case. A parenthetical like "(classic queues only; partitioned queues always populate this field)" would prevent future readers from applying the classic assumption to the partitioned branch.
  • Line 74 (encode_sqs.go change 3): "Sort messages by (partition, send_timestamp_millis, message_id)" — with Gemini's suggested sequence_number addition, the tuple should be updated here too.
  • The "Files to add / modify" section (line 100-109): add sqs.go with an explicit note about parseSQSGenericKey + the HandleMessageData call site.

Summary

The two Codex P2 findings are correctly and completely addressed in v2 — no issues there. The open Gemini suggestions are all valid:

Item Severity Recommendation
*uint32 for Partition field Medium — silent data corruption risk Adopt in impl PR; add ErrSQSEncodeMissingPartition
sequence_number in sort key Low — byte-identical re-encode correctness Add as tiebreaker
Fail-closed for nil partition Medium — follows from *uint32 decision Implement if *uint32 is adopted
parseSQSGenericKey not mentioned Low — mechanical update to a call site Add to "Files to modify"
decodeSQSMessageValue partition wiring Low — unclear abstraction level Specify in doc

The design doc can merge as-is with these noted as items for the implementation PR, or the doc can be updated to resolve the *uint32 decision and the sequence_number sort key before merge so the implementation PR has a fully-specified contract. Either is workable; I lean toward resolving them in the doc before the impl PR starts, since the *uint32 decision affects the fail-closed invariants, the JSON schema, and the validation test inputs.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 48bcab7512

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


```go
// sqsMessageRecord adds (in M5-3):
Partition uint32 `json:"partition,omitempty"`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require partition presence for partitioned dumps

Because partition 0 is a valid partition, omitempty makes a partition-0 message indistinguishable from a message whose partition was never captured. The current decoder already recognizes partitioned data keys but discards the parsed partition, so an existing format_version=1 dump of a partitioned queue can contain no partition fields; after this design lifts the encoder gate and defaults missing values to 0, replaying that dump would silently move every message from partitions 1..N-1 into partition 0. The implementation needs a way to preserve presence (for example no omitempty on partitioned dumps or a pointer/presence bit) and fail closed when partition_count > 1 but the field is absent.

Useful? React with 👍 / 👎.

…x P1 + claude review)

Three findings from the v2 review:

1. Codex P1 (L27) + Gemini #1: uint32+omitempty made partition-0
   indistinguishable from absent. Replace with *uint32: nil = pre-M5-3
   dump or never-captured, non-nil = explicitly captured (incl. 0).
   New ErrSQSEncodeMissingPartition fail-closed sentinel when
   partition_count>1 AND rec.Partition==nil. Decision matrix table
   added to make all 6 (partition_count × Partition shape) cases
   explicit.

2. Gemini #2 (sort): include sequence_number in the per-partition
   sort key. Now (partition, send_timestamp_millis, sequence_number,
   message_id) — matches the classic path's tiebreakers.

3. Claude review (call-site plumbing): explicit notes that
   HandleMessageData wires rec.Partition = &partition only when
   isPartitioned; decodeSQSMessageValue keeps its current
   (value)->(*record, error) signature; parseSQSGenericKey wrapper
   updated mechanically to discard the new partition return.

New test pinned: TestSQSEncodeRejectsMissingPartitionOnPartitionedQueue.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 3, 2026

@claude review

v3 addresses codex P1 + 3 gemini mediums + 2 claude findings:

  1. Codex P1 (L27) + Gemini Configure Renovate #1: Switched Partition uint32 to *uint32. Pre-M5-3 dumps have Partition == nil; M5-3+ dumps explicitly carry the partition (including 0). Decision matrix table added covering all 6 cases of (partition_count, Partition).

  2. Gemini Update dependency go to v1.21.4 #2 (sort key): Sort key now (partition, send_timestamp_millis, sequence_number, message_id) — matches the classic path's tiebreakers.

  3. Gemini Update docker/setup-qemu-action action to v3 #3 (fail-closed nil): New ErrSQSEncodeMissingPartition sentinel fires when partition_count > 1 AND rec.Partition == nil (i.e., a partitioned-queue dump that's missing partition info). Pinned by TestSQSEncodeRejectsMissingPartitionOnPartitionedQueue.

  4. Claude (HandleMessageData): Explicit note that the decoder write site wires rec.Partition = &partition only when isPartitioned.

  5. Claude (parseSQSGenericKey): Wrapper updated mechanically to discard the new partition return; decodeSQSMessageValue keeps its (value) -> (*record, error) signature.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 3, 2026

Claude finished @bootjp's task in 3m 25s —— View job


Review: M5-3 SQS partitioned-FIFO reverse encoder (v3)

  • Load tools
  • Read design doc (v3)
  • Read relevant source files (sqs.go, encode_sqs.go, encode_sqs_side.go, sqs_keys.go)
  • Verify v3 fixes for all 5 claimed improvements
  • Post comprehensive review

All 5 claimed v3 improvements verified ✅

1. Codex P1 / Gemini #1*uint32 for Partition.
Doc line 27 correctly shows Partition *uint32 \json:"partition,omitempty"`. Decision matrix (lines 34–41) covers all 6 (partition_count, Partition)combinations. The nil-vs-explicit-zero distinction is correctly documented, with the rationale. Theomitemptytag on a pointer is correct Go: a non-nil*uint32(0)serializes as"partition": 0`, which is what the partitioned-zero-partition case needs. ✅

2. Gemini #2 — sort key.
Line 93: (partition, send_timestamp_millis, sequence_number, message_id). Cross-checked against the live sortMessagesForEmit (sqs.go:815–827), which uses (send_timestamp_millis, sequence_number, message_id). Adding partition as the leading tiebreaker is correct and matches the live classic tiebreaker set exactly. ✅

3. Gemini #3ErrSQSEncodeMissingPartition.
New sentinel documented at lines 39, 43, 105. Test TestSQSEncodeRejectsMissingPartitionOnPartitionedQueue (line 142) pins the regression. ✅

4. Claude (HandleMessageData) — call-site wiring.
Line 74 explicitly specifies: receive (encQueue, partition uint32, isPartitioned bool, err), set rec.Partition = &partition only when isPartitioned. Classic path leaves rec.Partition == nil. Correct. ✅

5. Claude (parseSQSGenericKey) — wrapper mechanical update.
Line 76 documents that parseSQSGenericKey discards the new partition return value (side-record handlers route by queue only). The compile-time obligation is called out. ✅


Minor inaccuracies — fix before or alongside the implementation PR

1. "Classic path uses the same four-field sort" (line 93) — classic is 3-field.

The live sortMessagesForEmit (sqs.go:815–827) is:

case a.SendTimestampMillis != b.SendTimestampMillis: ...
case a.SequenceNumber != b.SequenceNumber: ...
default: a.MessageID < b.MessageID

That's 3 fields, not 4. The partitioned path is the 4-field version (adds partition as the leading key). The sentence "the classic path uses the same four-field sort" is inaccurate — rephrase to "the classic path uses the same 3-field tiebreaker tuple; the partitioned path prepends partition as the leading key."

2. decodeSQSMessageValue return type on line 75 — value, not pointer.

Line 75 says: "decodeSQSMessageValue keeps its current (value []byte) → (*sqsMessageRecord, error) signature"

The live signature (sqs.go:719) is:

func decodeSQSMessageValue(value []byte) (sqsMessageRecord, error)

It returns a value type, not a pointer. The design doc's description is wrong; either a typo or a copy of the proposed M5-3 change. Clarify whether M5-3 keeps the value return or switches to a pointer return — the current doc implies the latter while claiming "keeps its current signature."


Low-severity implementation gaps — note for the impl PR

3. encodeMessageSideRecords needs a partition parameter, not documented.

encodeQueueMessages calls encodeMessageSideRecords per message for vis/byage/dedup key construction (encode_sqs_side.go). For partitioned queues, encodeMessageSideRecords needs to know (partition uint32, isPartitioned bool) to select the partitioned vs classic constructor. The doc adds three partitioned constructors to encode_sqs_side.go (lines 95–99) but doesn't spell out that encodeMessageSideRecords' signature or call site also needs updating. Worth a one-liner in §"Encoder lift" so the impl PR is mechanically complete.

4. addMessage / partitioned dispatch not named.

addMessage (encode_sqs.go:238–273) constructs the !sqs|msg|data| key via sqsMsgDataKeyBytes(queueName, sqsRestoreGeneration, rec.MessageID). For partitioned queues this must call sqsPartitionedMsgDataKeyBytes. The doc describes "Branch on PartitionCount" broadly (line 91–92) but doesn't name addMessage or indicate whether M5-3 adds an addPartitionedMessage peer or threads a partition *uint32 parameter through the existing function. Either approach is fine — just name it so the impl PR doesn't have to re-derive it.


Editorial

  • "Files to modify" for sqs.go (line 124) — comment says sqsMessageRecord +Partition; decodeSQSMessageValue plumbs partition but the bigger changes are the parseSQSMessageDataKey and parseSQSPartitionedQueueAndTrailer signature extensions. The §"Decoder lift" section covers them, but the summary table comment is misleading by omission. Consider: sqsMessageRecord +Partition; parseSQSMessageDataKey + parseSQSPartitionedQueueAndTrailer new return values; HandleMessageData + parseSQSGenericKey call-site updates.

Summary

All 5 v3 improvements are correctly implemented in the doc — the *uint32 decision, the 6-row decision matrix, the ErrSQSEncodeMissingPartition sentinel, and both implementation call-site notes from the prior review. The doc is ready to merge with the two minor inaccuracies corrected (classic sort is 3-field, decodeSQSMessageValue return type). Items 3 and 4 can land as a one-paragraph addition to §"Encoder lift" or be resolved in the implementation PR itself — either is workable.

Item Severity Recommendation
"classic path four-field sort" wording (line 93) Minor — inaccuracy Correct to "3-field, partition prepended"
decodeSQSMessageValue return type (line 75) Minor — wrong type Clarify value vs pointer intention
encodeMessageSideRecords partition param Low — impl gap Note in §"Encoder lift" or impl PR
addMessage / addPartitionedMessage not named Low — impl gap Name in §"Encoder lift"
"Files to modify" sqs.go comment Editorial Expand to include parser signature changes

…cher (claude v914 v3)

Four findings from the v3 claude review, all editorial / mechanical:

1. Minor #1: "classic path four-field sort" wording at L93 was wrong.
   The live sortMessagesForEmit (sqs.go:815) is a 3-field tuple
   (send_timestamp_millis, sequence_number, message_id). Partitioned
   path prepends partition as the leading key. Reworded.

2. Minor #2: decodeSQSMessageValue return type at L75 — live sig is
   (value []byte) (sqsMessageRecord, error), a value not a pointer.
   Doc previously said *sqsMessageRecord — corrected.

3. Low #3: encodeMessageSideRecords partition param plumbing now
   spelled out explicitly in the §"Encoder lift" side-record section.
   Branches on partition != nil rather than re-reading meta.PartitionCount.

4. Low #4: addMessage / partitioned dispatch named. v4 threads
   partition *uint32 through addMessage (rather than spawning a peer
   addPartitionedMessage) so addMessage + encodeMessageSideRecords
   share the same dispatch source.

5. Editorial: Files-to-modify comment for sqs.go now lists all four
   touched functions (sqsMessageRecord +Partition, parseSQSMessageDataKey
   + parseSQSPartitionedQueueAndTrailer new returns, HandleMessageData
   wiring, parseSQSGenericKey wrapper discard).
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 3, 2026

@claude review

v4 addresses all four v3 findings:

  1. Minor Configure Renovate #1 (classic sort field count): Live sortMessagesForEmit (sqs.go:815) is a 3-field tuple (send_timestamp_millis, sequence_number, message_id). Doc previously said "four-field sort" — corrected to "3-field, partition prepended."

  2. Minor Update dependency go to v1.21.4 #2 (decodeSQSMessageValue return type): Live signature is (value []byte) (sqsMessageRecord, error) — value, not pointer. Doc said *sqsMessageRecord — corrected.

  3. Low Update docker/setup-qemu-action action to v3 #3 (encodeMessageSideRecords partition param): Now explicit in §"Encoder lift" — encodeMessageSideRecords takes partition *uint32 and branches on != nil rather than re-reading meta.PartitionCount.

  4. Low Dependency Dashboard #4 (addMessage dispatch): v4 threads partition *uint32 through addMessage (rather than spawning a peer addPartitionedMessage), so addMessage + encodeMessageSideRecords share the same dispatch source.

  5. Editorial: sqs.go files-list comment now lists all four touched functions: sqsMessageRecord +Partition, parseSQSMessageDataKey + parseSQSPartitionedQueueAndTrailer new returns, HandleMessageData wiring, parseSQSGenericKey wrapper discard.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 3, 2026

Claude finished @bootjp's task in 2m 39s —— View job


Review: M5-3 SQS partitioned-FIFO reverse encoder (v4)

  • Load tools
  • Read design doc (v4)
  • Verify v4 fixes for all 5 claimed improvements
  • Cross-check against live source files
  • Post comprehensive review

All 5 v4 improvements verified ✅

1. Classic sort field count (line 93).
Live sortMessagesForEmit (sqs.go:815–824) is exactly:

case a.SendTimestampMillis != b.SendTimestampMillis: ...
case a.SequenceNumber != b.SequenceNumber: ...
default: a.MessageID < b.MessageID

Doc now correctly says "3-field, partition prepended" — not the erroneous "four-field sort" from v3. ✅

2. decodeSQSMessageValue return type (line 75).
Live signature (sqs.go:719): func decodeSQSMessageValue(value []byte) (sqsMessageRecord, error) — value, not pointer. Doc now correctly states "note the return is a value, not a pointer." ✅

3. encodeMessageSideRecords partition param (line 103).
Doc now explicitly specifies: gains partition *uint32, branches by != nil rather than re-reading meta.PartitionCount. ✅

4. addMessage dispatch (line 95).
Doc specifies threading partition *uint32 through addMessage (rather than a peer function), so the side-record call-site shares the same dispatch source. ✅

5. Editorial: sqs.go files-list comment (line 128).
Now comprehensively lists all four affected functions: sqsMessageRecord +Partition, parseSQSMessageDataKey + parseSQSPartitionedQueueAndTrailer new returns, HandleMessageData wiring, parseSQSGenericKey wrapper discard. ✅


One remaining naming error — correct before or during the implementation PR

encodeMessageSideRecords does not exist; the live function is addSideRecords.

The doc refers to the per-message side-record dispatcher as encodeMessageSideRecords in three places:

  • Line 97 (leading sentence of the encode_sqs_side.go additions)
  • Line 103 ("encodeMessageSideRecords (the per-message side-record dispatcher…) gains the same partition *uint32 parameter")
  • Implicitly in the encode_sqs_side.go entry in §"Files to add/modify" (line 130)

The actual live function is addSideRecords (encode_sqs_side.go:133), called at encode_sqs.go:214:

if err := e.addSideRecords(b, meta.Name, &meta, &records[i]); err != nil {

This is the only concrete naming error remaining. An implementation PR author searching for encodeMessageSideRecords will not find it. The fix is a one-line rename in the doc. Either correct it before merge or add a comment to the implementation PR noting the correct name — the latter works too since the doc is _proposed_.

Fix this →


Everything else looks correct

  • *uint32 decision matrix (lines 34–41): all 6 (partition_count, Partition) combinations are correctly mapped. The omitempty + pointer combination behaves correctly: a non-nil *uint32(0) serializes as "partition": 0. ✅
  • sqsPartitionedMsgDataKeyBytes naming (line 129): the *KeyBytes suffix is the established convention in encode_sqs.go (sqsMsgDataKeyBytes at line 397), and the new sqsPartitionedMsgDataKeyBytes follows it correctly. ✅
  • Dedup key shape in table (line 16): <group-seg>|<dedupID-seg> matches sqsPartitionedMsgDedupKey exactly (sqs_keys.go:389–408 emits encodeSQSSegment(groupID) + terminator + encodeSQSSegment(dedupID)). ✅
  • Partition-from-key sourcing (line 70): correctly observes that decodeSQSMessageValue never sees the key, so partition must be wired in at HandleMessageData, not inside the decoder. ✅
  • parseSQSGenericKey wrapper (line 76): correctly specified as a mechanical discard; the wrapper currently calls parseSQSPartitionedQueueAndTrailer at sqs.go:577 and routes by queue only. ✅

Summary

All 5 v3 findings are cleanly resolved in v4. The doc is ready to merge with one correction: encodeMessageSideRecordsaddSideRecords (lines 97–103 and the encode_sqs_side.go comment in §"Files to modify"). Fix in the doc before merge, or note it in the implementation PR — either is workable.

Item Severity Status
Classic sort "four-field" wording Minor ✅ Fixed
decodeSQSMessageValue return type Minor ✅ Fixed
encodeMessageSideRecords partition param Low ✅ Fixed
addMessage dispatch strategy Low ✅ Fixed
sqs.go files-list completeness Editorial ✅ Fixed
encodeMessageSideRecordsaddSideRecords name Low–Medium ❌ Remaining

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 810634d596

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

The full decision matrix lives in the table under §"Dump-format change." The encoder fails closed with these sentinels:

- `meta.PartitionCount > 1` AND `rec.Partition == nil` (pre-M5-3 dump under lifted gate, or M5-3 decoder bug) → **new sentinel `ErrSQSEncodeMissingPartition`**. The operator must re-decode with an M5-3 decoder; replaying a legacy dump into a partitioned queue would silently move every message to partition 0 (codex P1 / gemini #914 v2).
- `meta.PartitionCount > 1` AND `*rec.Partition >= meta.PartitionCount` → **new sentinel `ErrSQSEncodeOutOfRangePartition`** (out-of-range partition number, dump is malformed).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject nonzero partitions for perQueue HT-FIFO

For valid HT-FIFO queues with partition_count > 1 but fifo_throughput_limit == "perQueue", the live router forces every group to partition 0 (adapter/sqs_partitioning.go:71-72) and ReceiveMessage only scans one effective partition (adapter/sqs_keys_dispatch.go:125-126). This invariant only checks *rec.Partition < meta.PartitionCount, so a dump line with partition 1..N-1 would be accepted and restored under |p|<nonzero>, where it will not be found by the live receive fanout. Please fail closed against effectivePartitionCount/perQueue semantics, or explicitly require partition 0 in that mode.

Useful? React with 👍 / 👎.

… + claude v914 v4)

Two findings from the v4 review:

1. Codex P2 v914 v4 (L110): For FIFO partitioned queues with
   FifoThroughputLimit == "perQueue", the live partitionFor at
   adapter/sqs_partitioning.go:71-72 forces every group to partition 0
   regardless of PartitionCount, and ReceiveMessage only scans the
   partition-0 lane (adapter/sqs_keys_dispatch.go:125-126). The v4
   invariants only checked *rec.Partition < PartitionCount, so a dump
   line with partition 1..N-1 would be accepted and restored onto
   |p|<nonzero> lanes the live receive never visits — silent data
   loss on first read. v5 adds new sentinel
   ErrSQSEncodePartitionRoutingMismatch and the
   TestSQSEncodeRejectsNonzeroPartitionOnPerQueueHTFIFO regression.
   Also documents the effectivePartitionCount(meta) helper concept
   so the per-gate inline checks stay one-line clean.

2. Claude v914 v4 (Low-Medium): The per-message side-record dispatcher
   is named addSideRecords (encode_sqs_side.go:133), called from
   encode_sqs.go:214 as e.addSideRecords(b, meta.Name, &meta,
   &records[i]). v4 used encodeMessageSideRecords in three places —
   that function does not exist. Renamed throughout the doc.

The perQueue gate is semantic-changing (new fail-closed branch). Callers
of the new helper effectivePartitionCount: only addMessage and
addSideRecords in M5-3's encoder. The validation site itself is the
encoder's per-message loop, which is the only place that has both
meta and the candidate rec.Partition.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 3, 2026

@claude review

v5 addresses both v4 findings:

  1. Codex P2 (perQueue HT-FIFO routing): Live partitionFor at adapter/sqs_partitioning.go:71-72 forces every group to partition 0 when FifoThroughputLimit == "perQueue", regardless of PartitionCount; ReceiveMessage only scans partition 0. The v4 gates would accept *rec.Partition=1 for such queues — silent data loss. v5 adds ErrSQSEncodePartitionRoutingMismatch and TestSQSEncodeRejectsNonzeroPartitionOnPerQueueHTFIFO, plus documents an effectivePartitionCount(meta) helper.

  2. Claude Low-Medium (function name): Renamed encodeMessageSideRecordsaddSideRecords everywhere (live function lives at encode_sqs_side.go:133, called at encode_sqs.go:214).

@claude
Copy link
Copy Markdown

claude Bot commented Jun 3, 2026

Claude finished @bootjp's task in 3m 20s —— View job


Review: M5-3 SQS partitioned-FIFO reverse encoder (v5)

  • Load tools
  • Read design doc (v5)
  • Read relevant source files (sqs_partitioning.go, sqs_keys_dispatch.go, encode_sqs_side.go, encode_sqs.go, sqs_keys.go:337+)
  • Verify v5 fixes for both claimed improvements
  • Post comprehensive review

Both v5 improvements verified ✅

1. Codex P2 (perQueue HT-FIFO routing).
The live partitionFor at adapter/sqs_partitioning.go:71-72 is confirmed:

if meta.FifoThroughputLimit == htfifoThroughputPerQueue {
    return 0
}

Every group is forced to partition 0 for perQueue mode, regardless of PartitionCount. v5 correctly adds ErrSQSEncodePartitionRoutingMismatch (line 112), the test TestSQSEncodeRejectsNonzeroPartitionOnPerQueueHTFIFO (line 152), and the effectivePartitionCount(meta) helper recommendation (line 114). ✅

2. addSideRecords name.
Live function is addSideRecords at encode_sqs_side.go:133, called at encode_sqs.go:214. The doc uses the correct name throughout. ✅


New finding: effectivePartitionCount naming collision — note for the impl PR

The doc at line 114 recommends "An impl-side helper effectivePartitionCount(meta) uint32". An identically-named function already exists in adapter/sqs_keys_dispatch.go:121-129:

func effectivePartitionCount(meta *sqsQueueMeta) uint32 {
    if meta == nil || meta.PartitionCount <= 1 { return 1 }
    if meta.FifoThroughputLimit == htfifoThroughputPerQueue { return 1 }
    return meta.PartitionCount
}

The backup package cannot import this — importing adapter from internal/backup/encode_sqs.go would violate the M3b-3 circular-dependency pattern (same reason sqsFifoDedupWindowMillis is mirrored in encode_sqs_side.go:15 rather than imported from adapter/sqs_fifo.go). The impl PR must define a new copy in internal/backup/ operating on sqsQueueMetaPublic (not *adapter.sqsQueueMeta). An impl author reading the doc alongside sqs_keys_dispatch.go might attempt to import and reuse the adapter version — adding a one-liner clarification ("define a local copy in internal/backup/, per M3b-3 pattern; cannot import from adapter") prevents that mistake.

The constant htfifoThroughputPerQueue = "perQueue" has the same issue — the adapter's copy is unexported and unimportable; a mirrored constant is needed in the backup package. This follows the same pattern as sqsFifoDedupWindowMillis.


Low-severity note: key dispatch vs. effective-partition-count

The "Branch on PartitionCount" (line 92) correctly uses raw PartitionCount > 1 for key-shape selection. For a perQueue queue with PartitionCount=2, the live adapter writes partitioned-shape keys (|p| prefix, partition=0) — effectivePartitionCount == 1 does NOT mean "use classic keys". The effective count governs the validation gates and the ReceiveMessage scan fanout, not key construction.

This distinction is implicit in the doc but could trip up an impl author. A one-liner in §"Encoder lift": "The effective partition count governs validation gates only; the key-shape branch (PartitionCount > 1) is intentionally separate, ensuring perQueue queues with PartitionCount=2 restore under partitioned-shape keys, matching what the live adapter wrote."


Summary

Both v5 improvements are correctly reflected in the doc. The design is ready to merge with two notes for the implementation PR:

Item Severity Recommendation
perQueue routing mismatch gate ✅ Fixed in v5
addSideRecords name ✅ Fixed in v5
effectivePartitionCount naming collision with adapter Low–Medium Note in doc or impl PR: must define a new copy in internal/backup/, cannot import from adapter; mirror "perQueue" constant too
Key dispatch vs. effective-count separation Low One-liner clarification in §"Encoder lift"

The doc can land as-is with these two items tracked in the implementation PR, or the naming-collision note can be added to §"Encoder lift" before merge.

…distinction (claude v914 v5)

Two impl-PR notes from claude's v5 review, both low-severity but worth
pinning in the doc so the impl PR doesn't trip on them:

1. effectivePartitionCount naming collision: the adapter already has
   an identically-named function operating on *adapter.sqsQueueMeta
   (sqs_keys_dispatch.go, sqs_admin_peek.go callers). The backup
   package cannot import it (M3b-3 circular-dep pattern, same as
   sqsFifoDedupWindowMillis mirrored at encode_sqs_side.go:15). v6
   spells out: define a NEW copy in internal/backup/ operating on
   sqsQueueMetaPublic, and mirror the htfifoThroughputPerQueue
   constant (= "perQueue") locally.

2. Key-shape vs validation distinction: 'Branch on PartitionCount'
   in Encoder lift uses raw meta.PartitionCount > 1 for key shape,
   NOT effectivePartitionCount. perQueue queues with PartitionCount=2
   still use partitioned-shape keys (|p| prefix, partition=0) because
   that's what the live adapter writes. effectivePartitionCount
   controls only validation gates and ReceiveMessage fan-out. v6
   adds an explicit note so an impl author doesn't conflate them
   and accidentally restore perQueue dumps under classic keys.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 3, 2026

@claude review

v6 addresses both v5 impl-PR notes:

  1. effectivePartitionCount naming collision: Doc now explicitly mandates a NEW copy in internal/backup/ operating on sqsQueueMetaPublic (since internal/backup/ can't import adapter/ per the M3b-3 circular-dependency pattern, same as sqsFifoDedupWindowMillis). Also mandates mirroring the "perQueue" constant locally — the adapter's htfifoThroughputPerQueue is unexported.

  2. Key-shape vs. effective-count distinction: §"Validation invariants" now explicitly notes that "Branch on PartitionCount" uses raw meta.PartitionCount > 1 for key construction, not effectivePartitionCount. perQueue queues with PartitionCount=2 still write partitioned-shape keys (|p| prefix, partition=0) because that's what the live adapter wrote. effectivePartitionCount only governs validation gates and ReceiveMessage fan-out.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 3, 2026

Claude finished @bootjp's task in 1m 58s —— View job


Review: M5-3 SQS partitioned-FIFO reverse encoder (v6)

  • Load tools
  • Read design doc (v6)
  • Read relevant source files (sqs_keys_dispatch.go, sqs_partitioning.go, encode_sqs.go, encode_sqs_side.go, sqs.go)
  • Verify v6 fixes for both claimed improvements
  • Post comprehensive review

Both v6 improvements verified ✅

1. effectivePartitionCount naming collision (Claude v5 finding).
Line 116 now explicitly mandates defining a NEW copy in internal/backup/ operating on sqsQueueMetaPublic, with the rationale cross-referencing the established M3b-3 pattern (sqsFifoDedupWindowMillis mirrored at encode_sqs_side.go:15). Cross-checked:

  • Live effectivePartitionCount at adapter/sqs_keys_dispatch.go:121-129 — confirmed to take *sqsQueueMeta (unexported adapter type), unimportable from internal/backup/. ✅
  • sqsFifoDedupWindowMillis mirrored at encode_sqs_side.go:15 — confirmed; the doc's precedent citation is accurate. ✅
  • htfifoThroughputPerQueue = "perQueue" at adapter/sqs_partitioning.go:37 — confirmed unexported; doc correctly requires a local mirror. ✅

2. Key-shape vs. effective-count distinction (Claude v5 finding).
Line 118 now explicitly states: raw meta.PartitionCount > 1 governs key construction; effectivePartitionCount governs validation gates and ReceiveMessage fan-out only. Cross-checked against live dispatch helpers in sqs_keys_dispatch.go: sqsMsgDataKeyDispatch (line 39), sqsMsgVisKeyDispatch (line 48), and siblings all branch on meta.PartitionCount > 1 — NOT on effectivePartitionCount. A perQueue queue with PartitionCount=2 still emits partitioned-shape |p| keys at partition=0 via the live adapter (partitionFor returns 0 for perQueue, lines 71-73 of sqs_partitioning.go). The doc's claim is verified. ✅


One remaining minor issue — wrong file path for sqsQueueMetaPublic

Line 116 reads: "The new copy operates on sqsQueueMetaPublic (the public/dump struct, encode_sqs.go:71-87)."

The struct name is correct (sqsQueueMetaPublic), but the file path is wrong. sqsQueueMetaPublic lives at internal/backup/sqs.go:148-162, not encode_sqs.go:71-87. The struct at encode_sqs.go:66-80 is sqsStoredQueueMeta — the live-adapter-compatible JSON struct used for the !sqs|queue|meta| value.

An implementation PR author searching encode_sqs.go:71-87 won't find the struct. Fix this →


Everything else looks correct

  • Decision matrix (lines 34–41): all 6 (partition_count, Partition) cases correctly mapped; nil-vs-explicit-zero distinction verified against *uint32 + omitempty Go semantics. ✅
  • ErrSQSEncodePartitionRoutingMismatch gate (line 112): correctly requires meta.FifoThroughputLimit == "perQueue" AND *rec.Partition != 0. The live partitionFor at sqs_partitioning.go:71-72 forces all groups to partition 0 in perQueue mode; accepting any other partition value would restore messages to |p|N| lanes effectivePartitionCount never scans. ✅
  • addSideRecords name (lines 97–103): confirmed live at encode_sqs_side.go:133, called at encode_sqs.go:214. Correct throughout v6. ✅
  • parseSQSGenericKey wrapper discard (line 76): correctly specified. Wrapper at sqs.go:571 calls parseSQSPartitionedQueueAndTrailer; the partition return can be discarded since HandleSideRecord routes by queue name only. ✅
  • Sort key (line 93): (partition, send_timestamp_millis, sequence_number, message_id) — prepends partition to the live 3-field tiebreaker from sortMessagesForEmit (sqs.go:815). Wording "3-field tiebreaker" is correct after v4 fixed the "four-field" error. ✅

Summary

Both v6 improvements — naming-collision note and key-shape/effective-count distinction — are correctly and fully documented. The design doc is ready to merge with one minor correction: the sqsQueueMetaPublic file reference on line 116 should point to sqs.go:148-162, not encode_sqs.go:71-87. Fix in the doc before merge, or note it in the implementation PR.

Item Severity Status
effectivePartitionCount naming collision + new copy note ✅ Fixed in v6
Key-shape vs. effective-count distinction ✅ Fixed in v6
sqsQueueMetaPublic file reference (encode_sqs.go:71-87sqs.go:148-162) Minor — wrong file path ❌ Remaining

claude v6 caught that v6's reference to sqsQueueMetaPublic at
encode_sqs.go:71-87 was wrong:

- sqsQueueMetaPublic actually lives at internal/backup/sqs.go:148
- encode_sqs.go:66 defines a DIFFERENT struct, sqsStoredQueueMeta,
  used for the !sqs|queue|meta| record value (live-adapter shape),
  not the dump-side public projection.

An impl PR author searching encode_sqs.go:71-87 would have found
the wrong struct. v7 corrects the file path and adds a one-line
note distinguishing the two structs.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 3, 2026

@claude review

v7 fixes the file path:

  • sqsQueueMetaPublic is at internal/backup/sqs.go:148, not encode_sqs.go:71-87.
  • encode_sqs.go:66 defines a different struct (sqsStoredQueueMeta) used for the !sqs|queue|meta| record value (live-adapter shape), not the dump-side public projection.
  • Doc now points to the correct file and notes the distinction.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 3, 2026

Claude finished @bootjp's task in 1m 56s —— View job


Review: M5-3 SQS partitioned-FIFO reverse encoder (v7)

  • Load tools
  • Read design doc (v7)
  • Verify v7 file path fix for sqsQueueMetaPublic
  • Cross-check all line number references against live source files
  • Post comprehensive review

v7 fix verified ✅

sqsQueueMetaPublic file path (Claude v6 finding).

v7 doc (line 116) now correctly reads: "The new copy operates on sqsQueueMetaPublic (the public/dump struct, internal/backup/sqs.go:148). Note: encode_sqs.go:66 defines a different struct, sqsStoredQueueMeta, used for the !sqs|queue|meta| value — not the dump-side struct."

Cross-checked against source:

  • internal/backup/sqs.go:148type sqsQueueMetaPublic struct {
  • internal/backup/encode_sqs.go:66type sqsStoredQueueMeta struct {

The distinction (dump-side projection vs. live-adapter JSON shape for !sqs|queue|meta| value records) is correctly noted and will prevent an implementation author from wiring the wrong type into the effectivePartitionCount helper. ✅


All other previously-verified items remain correct

Item Status
sqsPartitionedMsgDedupKey group/dedup terminator (Codex P2)
Partition plumbing — parseSQSMessageDataKey + parseSQSPartitionedQueueAndTrailer signature extensions (Codex P2)
*uint32 for Partition + 6-row decision matrix (Codex P1 / Gemini)
ErrSQSEncodeMissingPartition sentinel + test pinning (Gemini)
Sort key (partition, send_timestamp_millis, sequence_number, message_id) + "3-field tiebreaker" wording (Gemini / Claude v3)
decodeSQSMessageValue returns value type, not pointer (Claude v3)
addSideRecords name everywhere (Claude v4)
parseSQSGenericKey wrapper mechanical discard (Claude v2)
addMessage threads partition *uint32; addSideRecords branches on != nil (Claude v3)
effectivePartitionCount naming collision — new copy in internal/backup/, mirrors "perQueue" constant (Claude v5)
Key-shape (PartitionCount > 1) vs. validation gates (effectivePartitionCount) distinction (Claude v5)
ErrSQSEncodePartitionRoutingMismatch for perQueue + nonzero partition (Codex P2 v4)
sqs.go files-list comment completeness (Claude v3 editorial)
encode_sqs.go:66 / sqs.go:148 file path distinction (Claude v6 — this PR)

No new issues found

All line number references verified against the live codebase:

Doc reference Live location Status
sqsMessageRecord at sqs.go:233 sqs.go:233
parseSQSMessageDataKey at sqs.go:534 sqs.go:534
parseSQSPartitionedQueueAndTrailer at sqs.go:611 sqs.go:611
parseSQSGenericKey at sqs.go:571 sqs.go:571
decodeSQSMessageValue at sqs.go:719 sqs.go:719
sortMessagesForEmit at sqs.go:815 sqs.go:815
addMessage at encode_sqs.go:241 encode_sqs.go:241
addSideRecords at encode_sqs_side.go:133 encode_sqs_side.go:133
sqsFifoDedupWindowMillis mirror at encode_sqs_side.go:15 encode_sqs_side.go:15
sqsQueueMetaPublic at sqs.go:148 sqs.go:148
sqsStoredQueueMeta at encode_sqs.go:66 encode_sqs.go:66

Summary

The v7 fix is correct and complete. The design doc is ready to merge — all 14 findings from 7 review rounds are resolved, every line number reference checks out against the live source, and the mechanical implementation notes are precise enough that the implementation PR has a fully-specified contract.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0c7782fc0c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- `meta.PartitionCount == 1` (classic) AND `rec.Partition != nil && *rec.Partition != 0` → reuses **`ErrSQSInvalidMessage`** (dump is internally inconsistent).
- `meta.PartitionCount > 1` AND `meta.FifoThroughputLimit == "perQueue"` AND `*rec.Partition != 0` → **new sentinel `ErrSQSEncodePartitionRoutingMismatch`**. The live router (`adapter/sqs_partitioning.go:71-72` in `partitionFor`) forces every group to partition 0 whenever `FifoThroughputLimit == "perQueue"`, regardless of `PartitionCount`; ReceiveMessage only scans the partition-0 lane (`adapter/sqs_keys_dispatch.go:125-126`). Accepting `*rec.Partition >= 1` for a `perQueue` queue would restore messages onto `|p|1|...` lanes the live receive fan-out never visits — silent data loss on first read. Codex P2 v914 v4 caught this gap. Pinned by `TestSQSEncodeRejectsNonzeroPartitionOnPerQueueHTFIFO`.

The effective partition count is `1` whenever `FifoThroughputLimit == "perQueue"`, regardless of the declared `PartitionCount`. The first two gates above only fire when the effective count is `> 1` (i.e. partitioned by message group, not collapsed by per-queue throughput limiting). An impl-side helper `effectivePartitionCount(meta) uint32` is therefore preferable to inlining the perQueue check at every gate.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not collapse missing-partition checks for perQueue

For PartitionCount > 1 queues with FifoThroughputLimit == "perQueue", this note exempts the missing/out-of-range gates by switching them to effectivePartitionCount, but the live key dispatch still selects the partitioned keyspace solely from meta.PartitionCount > 1 (adapter/sqs_keys_dispatch.go) and this doc also says key construction must use raw PartitionCount. If a perQueue dump lacks rec.Partition, the proposed partition != nil dispatch can fall back to classic keys (or otherwise lacks the required partition-0 value), making restored rows invisible to the partitioned readers. Please keep the presence/range checks based on meta.PartitionCount > 1, with only the nonzero routing mismatch using the perQueue collapse.

Useful? React with 👍 / 👎.

…odex P2 v914 v7)

v6 introduced a subtle data-loss bug by saying the missing-partition and
out-of-range gates should use effectivePartitionCount. Codex P2 v914 v7
caught it:

For perQueue queues with PartitionCount=2:
- effectivePartitionCount collapses to 1
- v6 said: missing-partition gate doesn't fire (since effective=1)
- So a perQueue dump with rec.Partition == nil would slip past validation
- addMessage branches on partition != nil; nil -> classic sqsMsgDataKeyBytes
- But the LIVE key dispatch selects the partitioned keyspace from raw
  meta.PartitionCount > 1 (sqs_keys_dispatch.go)
- Result: restored messages get classic keys for a queue whose readers
  scan only the partitioned keyspace -> silent data loss

v8 reverts: ALL four gates use raw meta.PartitionCount > 1 as the
partitioned-queue predicate. effectivePartitionCount stays useful for
diagnostics and the ReceiveMessage scan fan-out, but MUST NOT appear
in any encoder-side validation gate predicate. Updated the
'Key-shape vs validation distinction' section to unify both choices
(key shape AND validation) on raw PartitionCount.

Caller audit: this is a semantic change in the gate predicate from
'effective' back to 'raw'. The only callers of these predicates are
the encoder's per-message loop (single site), which is what we're
fixing. The effectivePartitionCount helper itself is unchanged — it
still mirrors adapter/sqs_keys_dispatch.go's helper for the
ReceiveMessage fan-out cross-check.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 3, 2026

@claude review

v8 addresses codex P2 v7 — a real data-loss bug in my v6 effectivePartitionCount reframing.

The bug v6 introduced: I said the missing-partition and out-of-range gates should use effectivePartitionCount. For a perQueue queue with PartitionCount=2:

  • effectivePartitionCount collapses to 1
  • v6's gate would not fire on rec.Partition == nil
  • addMessage's partition != nil branch falls to classic sqsMsgDataKeyBytes
  • But the live key dispatch (sqs_keys_dispatch.go) selects the partitioned keyspace from raw meta.PartitionCount > 1
  • Result: restored messages get classic keys → invisible to the live partitioned reader → silent data loss on first read.

v8 fix: All four validation gates use raw meta.PartitionCount > 1 as the partitioned-queue predicate. The effectivePartitionCount helper stays useful for diagnostics and the ReceiveMessage scan fan-out, but never appears in a gate predicate.

Caller audit: semantic-changing fix to the predicate. Only call site is the encoder's per-message loop in addMessage / addSideRecords (single site, which is what we're fixing). effectivePartitionCount itself is unchanged.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 3, 2026

Claude finished @bootjp's task in 2m 6s —— View job


Review: M5-3 SQS partitioned-FIFO reverse encoder (v8)

  • Load tools
  • Read design doc (v8)
  • Read relevant source files (sqs_keys_dispatch.go, sqs_partitioning.go, encode_sqs.go, sqs.go)
  • Verify v8 fix for the effectivePartitionCount gate predicate bug
  • Cross-check gate ordering and coverage against live source
  • Post comprehensive review

Codex P2 v7 fix — verified ✅

The v6 bug was real. For a perQueue queue with PartitionCount=2:

  • v6 proposed using effectivePartitionCount in the missing-partition gate
  • effectivePartitionCount returns 1 for perQueue queues (sqs_keys_dispatch.go:121-129, FifoThroughputLimit == "perQueue"return 1)
  • So the v6 gate effectivePartitionCount > 1 AND rec.Partition == nil would NOT fire for perQueue
  • addMessage's partition != nil branch falls to classic sqsMsgDataKeyBytes
  • But live dispatch (sqs_keys_dispatch.go:39-44) branches on meta.PartitionCount > 1 — NOT effectivePartitionCount:
    func sqsMsgDataKeyDispatch(meta *sqsQueueMeta, ...) []byte {
        if meta != nil && meta.PartitionCount > 1 {
            return sqsPartitionedMsgDataKey(...)  // partitioned shape, always
        }
        return sqsMsgDataKey(...)
    }
    A perQueue queue with PartitionCount=2 gets partitioned keys written by the live adapter (|p|0|...), so restoring with classic keys makes messages invisible. Data loss confirmed.

v8 correctly fixes this. All four gates (lines 109-112) now use raw meta.PartitionCount > 1:

Gate Predicate Sentinel
missing partition PartitionCount > 1 AND rec.Partition == nil ErrSQSEncodeMissingPartition
out-of-range partition PartitionCount > 1 AND *rec.Partition >= PartitionCount ErrSQSEncodeOutOfRangePartition
classic + non-zero PartitionCount == 1 AND *rec.Partition != 0 ErrSQSInvalidMessage
perQueue + nonzero PartitionCount > 1 AND perQueue AND *rec.Partition != 0 ErrSQSEncodePartitionRoutingMismatch

Line 114 explicitly states the fix and the rationale, calling out the data-loss failure mode of v6. ✅


Gate ordering is correct for the perQueue+nil case

For a perQueue queue with PartitionCount=2 and rec.Partition == nil:

  • Gate 1 (ErrSQSEncodeMissingPartition): fires immediately because 2 > 1 AND nil — correct, this is a legacy dump replayed under a partitioned queue. ✅
  • Gates 2–4 are never reached.

For a perQueue queue with PartitionCount=2 and *rec.Partition == 1:

  • Gate 1: doesn't fire (non-nil).
  • Gate 2: doesn't fire (1 < 2).
  • Gate 3: doesn't fire (not classic).
  • Gate 4: fires (PartitionCount > 1 AND perQueue AND 1 != 0) → ErrSQSEncodePartitionRoutingMismatch. ✅ This is what TestSQSEncodeRejectsNonzeroPartitionOnPerQueueHTFIFO pins.

For a perQueue queue with PartitionCount=2 and *rec.Partition == 0:

  • All four gates pass → emit partitioned-shape keys at partition 0. ✅ Matches what the live adapter wrote via partitionFor (which forces 0 for perQueue, sqs_partitioning.go:71-72).

The gate ordering and coverage is complete.


Key-shape / effective-count separation — verified ✅

Lines 114–118 now explicitly document:

  • Raw meta.PartitionCount > 1 governs BOTH key construction AND all four validation gates.
  • effectivePartitionCount is scoped to diagnostics and ReceiveMessage scan fan-out only.
  • A perQueue queue with PartitionCount=2 still writes |p| prefix keys (because PartitionCount > 1); the live partitionFor separately collapses every group to partition 0.

This is the correct mental model and matches the live dispatch helpers. ✅


All previously verified items remain correct

Item Status
Dedup key group/dedup terminator (Codex P2 v1)
Partition plumbing from key parsers (Codex P2 v1)
*uint32 for Partition + 6-row decision matrix (Codex P1 / Gemini)
ErrSQSEncodeMissingPartition + test (Gemini v2)
Sort key (partition, send_ts, sequence_number, message_id) (Gemini v2 / Claude v3)
decodeSQSMessageValue returns value, not pointer (Claude v3)
addSideRecords name (Claude v4)
parseSQSGenericKey wrapper mechanical discard (Claude v2)
addMessage threads partition *uint32; addSideRecords branches on != nil (Claude v3)
effectivePartitionCount naming collision — new copy in internal/backup/ (Claude v5)
sqsQueueMetaPublic at sqs.go:148, not encode_sqs.go:71-87 (Claude v6)
ErrSQSEncodePartitionRoutingMismatch for perQueue + nonzero (Codex P2 v4)
Raw PartitionCount > 1 predicate in ALL gates — not effectivePartitionCount (Codex P2 v7) Fixed in v8

No new issues found

The design doc is ready to merge. The v8 fix is correct, complete, and well-explained. All 15 findings from 8 review rounds are resolved, gate ordering is correct for all six (PartitionCount, Partition) matrix combinations including the perQueue+nil corner case, and every line-number reference checked out against the live source in prior rounds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant