Proposal Details
This is a proposal to introduce intrinsic support for ARM64 SVE (Scalable Vector Extension) instructions. It is a child proposal of #73787.
SVE is a recent architecture extension introduced to the ARM64 architecture. Its defining feature is a Vector Length Agnostic (VLA) programming model, which allows developers to write SIMD code once and have it scale automatically to the hardware's available vector length, much like standard scalar code. This proposal aims to provide a clean, accessible API that feels idiomatic to Go, mirrors the existing AMD64 archsimd API where semantics overlap, and integrates smoothly with Midway.
This proposal only covers SVE and some SVE2. Specifically, loads from and stores to register lists are not supported. Each type supported will map to one Z or P register and we assume their length to be at most 256 bits and 32 bits. As a result, PN registers are also not supported. SVE2.1 and SME are not within the scope of this proposal.
Question: how many machines support SVE2? Is it worth supporting full-fledged SVE2?
Naming alignment
Where an operation has a direct semantic counterpart in simd/archsimd (the AMD64 API in #73787), this proposal uses the same method name (Add, Sub, Mul, Min, Max, Sqrt, Abs, Neg, And, Or, Xor, AndNot, Not, ShiftLeft/ShiftRight, RotateLeft/RotateRight, AddSaturated/SubSaturated, MulAdd, OnesCount, LeadingZeros, Equal/NotEqual/Greater/GreaterEqual, Masked, IfElse, etc.). The element/vector type names follow the Midway-style plural convention (Int8s, Float32s, ...) because that matches SVE's length-agnostic nature. Signatures use Midway types throughout so that user code can switch between architectures with minimal surface change.
For names that are already specified in Midway, unless documented in comment, they have the same semantic as Midway. // Asm documents their Arm64 instruction title in the spec table.
Predication
All SVE API entries come without predication; the user can use .Masked(m) and .IfElse(m) to ask for zero predication and merging predication.
Many instructions take a predicate P; a lot of them also come with an unpredicated form, but some do not. For instructions that come with an unpredicated form, the intrinsic will map to that one; and when combined with Masked and IfElse, the compiler will try to peephole it to a predicated form if it exists. For instructions that do not come with an unpredicated form, an all-active predicate will be constructed in place by the compiler and provided to the predicated instruction, and the intrinsic maps to this two-instruction sequence. The compiler peepholes can strip away this all-true predicate when the user calls Masked and IfElse right after.
MOVPRFX
For destructive operations, an MOVPRFX will always be generated to prepare the destination register.
API Overview
Types
Scalable vector types use ElementType + "s". For example, a scalable vector of int8 elements is typed as Int8s. For scalable predicates (masks), the naming convention is "Mask" + LaneBitWidth + "s", such as Mask8s.
| Element Type |
Bit Width |
Vector Type |
Mask Type |
| Signed Integer |
8-bit |
Int8s |
Mask8s |
|
16-bit |
Int16s |
Mask16s |
|
32-bit |
Int32s |
Mask32s |
|
64-bit |
Int64s |
Mask64s |
| Unsigned Integer |
8-bit |
Uint8s |
Mask8s |
|
16-bit |
Uint16s |
Mask16s |
|
32-bit |
Uint32s |
Mask32s |
|
64-bit |
Uint64s |
Mask64s |
| Floating-Point |
32-bit |
Float32s |
Mask32s |
|
64-bit |
Float64s |
Mask64s |
No 16-bit floats (yet). SVE hardware supports half-precision (fp16) and brain-float (bf16) lanes, but Go has no float16 / bfloat16 primitive scalar type, and we want the SVE vector element types in archsimd to remain a one-to-one mapping to Go's primitive scalar types. If/when Go gains those scalar types, Float16s and BFloat16s (with mask Mask16s) can be added without disturbing the rest of this surface.
Because Go does not currently support dynamic stack allocations, scalable vectors and predicates are assumed to fit within a predefined maximum bound (currently set to 32 bytes for vector types and 4 bytes for predicate types).
All types come with these utility methods:
func (x <Vector>) Len() int // number of elements (vector types)
func (x <Vector>) String() string // human-readable form (vector and mask types)
ARM provides the RDVL instruction to read the hardware's actual vector length at runtime. When the archsimd package is imported, RDVL will be checked during initialization; if the hardware vector length exceeds 32 bytes, the package will panic. We believe 32 bytes (256 bits) covers the vast majority of SVE chips currently on the market (e.g., Neoverse V1). Please let us know if this constraint needs to be expanded. The APIs supported are based on ISA_A64_xml_A_profile-2025-12.
Memory Operations (Loads and Stores)
Vector Loads
Consecutive loads:
func LoadInt8s(s []int8) Int8s // Emulated (predicate construction + "LD1B (scalar plus immediate, single register)")
func LoadInt8sPart(s []int8) Int8s // Emulated (predicate construction + "LD1B (scalar plus immediate, single register)")
// ... analogous for all other vector types
Gather loads (from a slice plus a vector of indices):
// GatherInt8sPart gathers value into the result vector.
// result[i] = base[idx[i]].
// Out of bound elements will be zeroed.
//
// Asm: Emulated (predicate construction + "LD1B (scalar plus vector)")
func (idx Uint8s) GatherInt8sPart(base []int8) Int8s
// ... analogous for all other vector types
Vector Stores
Consecutive stores:
func (x Int8s) Store(s []int8) // Asm: Emulated (predicate construction + "ST1B (scalar plus immediate, single register)")
func (x Int8s) StorePart(s []int8) // Asm: Emulated (predicate construction + "ST1B (scalar plus immediate, single register)")
// ... analogous for all other vector types
Scatter stores:
// ScatterInt8sPart stores value into a slice.
// base[idx[i]] = x[i].
// Out of bound elements will be skipped.
//
// Asm: Emulated (predicate construction + "ST1B (scalar plus vector)")
func (x Int8s) ScatterInt8sPart(idx Uint8s, base []int8)
// ... analogous for all other vector types
Note: with the proper predicate constructed by the compiler, these gather/scatter loads/stores can be bound-safe for Go.
Mask Loads and Stores
A predicate is loaded from / stored to a bitmask, where bit i corresponds to lane i's active state:
// LoadMask8s loads a predicate from a bitmask. The bits are concatenated
// together in little-endian order.
// If the bits slice doesn't have enough elements to fill the full mask, it will panic.
//
// Asm: Emulated (predicate construction + "LDR (predicate)")
func LoadMask8s(bits []uint16) Mask8s
// StoreMask8s stores the predicate to a slice of bits. The bits are concatenated
// together in little-endian order.
// If the slice doesn't have enough elements to store the full mask, it will panic.
//
// Asm: Emulated (predicate construction + "STR (predicate)")
func (m Mask8s) Store(bits []uint16)
// ... analogous for all other mask types
Mask Operations
Generation
// Mask8sFromCount returns a mask that activates the first count elements.
//
// Asm: WHILELO (predicate)
func Mask8sFromCount(count int) Mask8s
// Mask8sAllTrue returns a mask that has all its elements active.
//
// Asm: PTRUE (predicate)
func Mask8sAllTrue() Mask8s
// We don't need a Mask8sAllFalse, which is PFALSE (predicate), that would be the zero value of P.
// First returns a mask that activates only the first active element of m.
//
// Asm: PFIRST
func (m Mask8s) First() Mask8s
// Next returns a mask that activates the next element after the last active element of m.
// All other elements will be set to inactive.
// If m is all inactive, the returned mask will have its first element activated.
//
// Asm: PNEXT
func (m Mask8s) Next() Mask8s
// ... analogous for all other mask types
Note: PTRUE has more variants (e.g., a fixed power-of-two count). They can be exposed later if useful.
Logic and Bitwise Operations
Element-wise logical operations between masks of the same lane width.
func (m Mask8s) And(n Mask8s) Mask8s // Asm: AND (predicate)
func (m Mask8s) Or(n Mask8s) Mask8s // Asm: ORR (predicate)
func (m Mask8s) Xor(n Mask8s) Mask8s // Asm: EOR (predicate)
func (m Mask8s) AndNot(n Mask8s) Mask8s // Asm: BIC (predicate)
func (m Mask8s) Not() Mask8s // Asm: NOT (predicate)
// ... analogous for all other mask types
Tests and Reductions
// CountActive returns the number of active elements of m
//
// Asm: CNTP (predicate)
func (m Mask8s) CountActive() int
// FirstIsActive returns true if the first element in m is active.
//
// Asm: Emulated (with PTEST)
func (m Mask8s) FirstIsActive() bool
// LastIsActive returns true if the last element in m is active.
//
// Asm: Emulated (with PTEST)
func (m Mask8s) LastIsActive() bool
// ... analogous for wider masks
Note: PTEST could be peepholed with PFIRST and PNEXT.
Note: FIRSTP looks like a useful instruction, however it's SVE2, should we support it?
Conversions
SVE predicates are layout-identical bitmasks (1 bit per byte), but are typed by lane width in Go for type safety. Conversions can change a mask type to another.
Widen Elements
Unpack and widen (by padding 0 bits) a predicate of narrower lanes into wider lanes.
// UnpackWidenLo unpacks the lanes from the lower half of the source mask m
// and widens them by padding 0s to twice their lane width.
//
// Asm: PUNPKLO
func (m Mask8s) UnpackWidenLo() Mask16s
// UnpackWidenHi unpacks the lanes from the upper half of the source mask m
// and widens them by padding 0s to twice their lane width.
//
// Asm: PUNPKHI
func (m Mask8s) UnpackWidenHi() Mask16s
// ... analogous for wider masks
Narrow Elements
Pack two wider-lane masks (low/high halves) into a single narrower-lane mask.
// PackNarrow packs two wider-lane masks into a single narrower-lane mask.
// lo will be placed in the lower half of the result, and hi will be placed in the upper half.
//
// Asm: UZP1 (predicates)
func (lo Mask16s) PackNarrow(hi Mask16s) Mask8s
// ... analogous for wider masks
Note: UZP1 on predicates can also do interleaved deinterleaving when operand and result types are the same; that is exposed below under PackEven/PackOdd.
Same-width Packing
Pack the even- or odd-indexed lanes of two masks into a single mask.
// PackEven extracts the even-indexed elements from lo and hi and concatenates them.
// lo's even elements are placed in the lower half of the result, and hi's even
// elements are placed in the upper half.
//
// Asm: UZP1 (predicates)
func (lo Mask8s) PackEven(hi Mask8s) Mask8s
// PackOdd extracts the odd-indexed elements from lo and hi and concatenates them.
// lo's odd elements are placed in the lower half of the result, and hi's odd
// elements are placed in the upper half.
//
// Asm: UZP2 (predicates)
func (lo Mask8s) PackOdd(hi Mask8s) Mask8s
// ... analogous for wider masks
No-op conversions
These conversions reinterpret the bits of a mask. They are no-op operations.
func (m Mask8s) AsMask16s() Mask16s
func (m Mask8s) AsMask32s() Mask32s
func (m Mask8s) AsMask64s() Mask64s
// ... analogous for wider masks
These conversions do not exist on amd64 or arm64. However, this proposal still includes them because SVE predicates are universally applied on byte lane basis regardless of arrangement. Larger mask types use only the bits whose indices are a multiple of the byte size of the mask lane. So it's possible that the user may want to reinterpret the bits in a predicate to apply on different vector arrangements. Discussions are welcomed!
Permutations
// Reverse reverses the order of elements in m.
// result[i] = m[m.Len() - 1 - i]
//
// Asm: REV (predicate)
func (m Mask8s) Reverse() Mask8s
// InterleaveLo interleaves the lower half of x with the lower half of y.
//
// Asm: ZIP1 (predicate)
func (x Mask8s) InterleaveLo(y Mask8s) Mask8s
// InterleaveHi interleaves the upper half of x with the upper half of y.
//
// Asm: ZIP2 (predicate)
func (x Mask8s) InterleaveHi(y Mask8s) Mask8s
// InterleaveEven interleaves the even-indexed lanes of x and y:
// result[2i] = x[2i]
// result[2i+1] = y[2i]
//
// Asm: TRN1 (predicate)
func (x Mask8s) InterleaveEven(y Mask8s) Mask8s
// InterleaveOdd interleaves the odd-indexed lanes of x and y:
// result[2i] = x[2i+1]
// result[2i+1] = y[2i+1]
//
// Asm: TRN2 (predicate)
func (x Mask8s) InterleaveOdd(y Mask8s) Mask8s
// ... analogous for wider masks
Comparisons Producing Masks
All scalable vector types support element-wise comparisons that yield the corresponding mask type:
func (x Int8s) Equal(y Int8s) Mask8s // Asm: CMPEQ (vectors)
func (x Int8s) NotEqual(y Int8s) Mask8s // Asm: CMPNE (vectors)
func (x Int8s) Greater(y Int8s) Mask8s // Asm: CMPGT (vectors)
func (x Int8s) GreaterEqual(y Int8s) Mask8s // Asm: CMPGE (vectors)
// ... analogous for all other integer / float vector types
Floating-point types additionally provide:
func (x Float32s) IsNaN() Mask32s // Asm: FCMUO (vectors)
func (x Float64s) IsNaN() Mask64s // Asm: FCMUO (vectors)
Vector Operations
SVE vector operations are presented in their unconditional form. For operations that lack an unpredicated SVE form, the compiler constructs an implicit all-true predicate at lowering (see the Predication section). Predication is applied through the Masked / IfElse chaining described at the end of this section, and the compiler folds the chain into a single predicated instruction.
Element-wise Arithmetic
func (x Int8s) Add(y Int8s) Int8s // Asm: ADD (vectors, unpredicated)
func (x Int8s) Sub(y Int8s) Int8s // Asm: SUB (vectors, unpredicated)
func (x Int8s) Mul(y Int8s) Int8s // Asm: MUL (vectors, unpredicated)
func (x Int8s) Abs() Int8s // Asm: ABS
func (x Int8s) Min(y Int8s) Int8s // Asm: SMIN (vectors)
func (x Int8s) Max(y Int8s) Int8s // Asm: SMAX (vectors)
// ... analogous for all other vector types
Division and square root are float-only (integer SDIV/UDIV exist only for 32-/64-bit lanes in SVE and are listed separately):
func (x Float32s) Div(y Float32s) Float32s // Asm: FDIV
func (x Float32s) Sqrt() Float32s // Asm: FSQRT
func (x Float32s) Reciprocal() Float32s // Asm: FRECPE
func (x Float32s) ReciprocalSqrt() Float32s // Asm: FRSQRTE
// TODO: should we also support FRECPS, FRECPX, FRSQRTS?
// ... analogous for Float64s
Integer division (32- and 64-bit only):
func (x Int32s) Div(y Int32s) Int32s // Asm: SDIV
func (x Uint32s) Div(y Uint32s) Uint32s // Asm: UDIV
// ... analogous for 64-bit
Multiply-high (upper half of a widening multiply, without widening the result type):
func (x Int8s) MulHigh(y Int8s) Int8s // Asm: SMULH (unpredicated)
func (x Uint8s) MulHigh(y Uint8s) Uint8s // Asm: UMULH (unpredicated)
// ... analogous for 16/32/64-bit integer lanes
Fused multiply-add (single-rounding):
func (x Float32s) MulAdd(y, z Float32s) Float32s // Asm: FMLA (vectors)
// MulSub performs a fused (x * y) - z.
//
// Asm: FMLS (vectors)
func (x Float32s) MulSub(y, z Float32s) Float32s
// ... analogous for Float64s
Saturating Arithmetic (integer types)
func (x Int8s) AddSaturated(y Int8s) Int8s // Asm: SQADD (vectors, unpredicated)
func (x Uint8s) AddSaturated(y Uint8s) Uint8s // Asm: UQADD (vectors, unpredicated)
func (x Int8s) SubSaturated(y Int8s) Int8s // Asm: SQSUB (vectors, unpredicated)
func (x Uint8s) SubSaturated(y Uint8s) Uint8s // Asm: UQSUB (vectors, unpredicated)
// ... analogous for 16/32/64-bit integer lanes
Bitwise Logic (integer types)
func (x Int8s) And(y Int8s) Int8s // Asm: AND (vectors, unpredicated)
func (x Int8s) Or(y Int8s) Int8s // Asm: ORR (vectors, unpredicated)
func (x Int8s) Xor(y Int8s) Int8s // Asm: EOR (vectors, unpredicated)
func (x Int8s) AndNot(y Int8s) Int8s // Asm: BIC (vectors, unpredicated)
func (x Int8s) Not() Int8s // Asm: NOT (vectors)
// ... analogous for all other integer vector types
Shifts and Rotations (integer types)
Vector-by-scalar (single shift amount applied to every lane), and vector-by-vector (per-lane shift amount) forms. Right shift is logical for unsigned lanes and arithmetic for signed lanes.
func (x Int8s) ShiftAllLeft(shift uint64) Int8s // Asm: LSL (immediate, unpredicated)
func (x Int8s) ShiftLeft(y Uint8s) Int8s // Asm: LSL (vectors)
func (x Int8s) ShiftAllRight(shift uint64) Int8s // Asm: ASR (immediate, unpredicated)
func (x Int8s) ShiftRight(y Uint8s) Int8s // Asm: ASR (vectors)
func (x Uint8s) ShiftAllRight(shift uint64) Uint8s // Asm: LSR (immediate, unpredicated)
func (x Uint8s) ShiftRight(y Uint8s) Uint8s // Asm: LSR (vectors)
// ... analogous for 16/32/64-bit integer lanes
Rotations are supported in RAX1 and XAR instructions as only part of their semantics; how should we support them?
Bit Manipulation (integer types)
func (x Uint8s) OnesCount() Uint8s // Asm: CNT
func (x Uint8s) LeadingZeros() Uint8s // Asm: CLZ
// ... analogous for signed lanes
Floating-Point Rounding
func (x Float32s) Round() Float32s // Asm: FRINTN
func (x Float32s) Ceil() Float32s // Asm: FRINTP
func (x Float32s) Floor() Float32s // Asm: FRINTM
func (x Float32s) Trunc() Float32s // Asm: FRINTZ
// ... analogous for Float64s
Type Conversions and Lane-Width Changes
Reinterpret (no instruction; bit-cast)
func (x Int8s) ToBits() Uint8s
func (x Uint8s) ReshapeToUint16s() Uint16s
func (x Uint8s) ReshapeToUint32s() Uint32s
func (x Uint8s) ReshapeToUint64s() Uint64s
// ... analogous for all same-width integer vectors
Widen (sign- or zero-extend lanes)
// UnpackWidenLo takes the lower half of the vector and sign-extends it to 16-bit lanes.
//
// Asm: SUNPKLO
func (x Int8s) UnpackWidenLo() Int16s
// UnpackWidenHi takes the higher half of the vector and sign-extends it to 16-bit lanes.
//
// Asm: SUNPKHI
func (x Int8s) UnpackWidenHi() Int16s
// UnpackWidenLo takes the lower half of the vector and zero-extends it to 16-bit lanes.
//
// Asm: UUNPKLO
func (x Uint8s) UnpackWidenLo() Uint16s
// UnpackWidenHi takes the higher half of the vector and zero-extends it to 16-bit lanes.
//
// Asm: UUNPKHI
func (x Uint8s) UnpackWidenHi() Uint16s
// ... analogous for Int16s↔Int32s, Int32s↔Int64s, Uint16s↔Uint32s, Uint32s↔Uint64s
Narrow (truncate to half-width lanes)
// PackTrunc truncates lo and hi and pack them to the lower and higher halves of
// the result vector.
//
// Asm: UZP1
func (lo Int16s) PackTrunc(hi Int16s) Int8s
// ... analogous for Uint16s, Int32s, Uint32s, Int64s, Uint64s
Same-width Packing
Pack the even- or odd-indexed lanes of two vectors into a single vector.
// PackEven extracts the even-indexed elements from lo and hi and concatenates them.
// lo's even elements are placed in the lower half of the result, and hi's even
// elements are placed in the upper half.
//
// Asm: UZP1 (vectors)
func (lo Int8s) PackEven(hi Int8s) Int8s
// PackOdd extracts the odd-indexed elements from lo and hi and concatenates them.
// lo's odd elements are placed in the lower half of the result, and hi's odd
// elements are placed in the upper half.
//
// Asm: UZP2 (vectors)
func (lo Int8s) PackOdd(hi Int8s) Int8s
// ... analogous for Int16s, Int32s, Int64s, Uint16s, Uint32s, Uint64s
Integer ↔ Float
func (x Int32s) ConvertToFloat32s() Float32s // Asm: SCVTF (predicated)
func (x Uint32s) ConvertToFloat32s() Float32s // Asm: UCVTF (predicated)
func (x Float32s) ConvertToInt32s() Int32s // Asm: FCVTZS
func (x Float32s) ConvertToUint32s() Uint32s // Asm: FCVTZU
func (x Int64s) ConvertToFloat64s() Float64s // Asm: SCVTF (predicated)
func (x Uint64s) ConvertToFloat64s() Float64s // Asm: UCVTF (predicated)
func (x Float64s) ConvertToInt64s() Int64s // Asm: FCVTZS
func (x Float64s) ConvertToUint64s() Uint64s // Asm: FCVTZU
Cross-width float conversions:
// UnpackWidenEvenToFloat64s performs the following operation:
// result[i] = float64(x[2i])
//
// Asm: FCVT
func (x Float32s) UnpackWidenEvenToFloat64s() Float64s
// EvenNarrowToFloat32s performs the following operation:
// result[2i] = float32(x[i])
// result[2i+1] = 0
//
// Asm: FCVT
func (x Float64s) EvenNarrowToFloat32s() Float32s
Horizontal Reductions
Compute a scalar value across all active lanes of a scalable vector:
// SumReduce reduces x to the sum of all elements in x.
//
// Asm: SADDV
func (x Int8s) SumReduce() int8
// MinReduce reduces x to the minimum value among all active elements.
//
// Asm: SMINV
func (x Int8s) MinReduce() int8
// MaxReduce reduces x to the maximum value among all active elements.
//
// Asm: SMAXV
func (x Int8s) MaxReduce() int8
// AndReduce reduces x to the logical AND of all active elements.
//
// Asm: ANDV
func (x Int8s) AndReduce() int8
// OrReduce reduces x to the logical OR of all active elements.
//
// Asm: ORV
func (x Int8s) OrReduce() int8
// XorReduce reduces x to the logical XOR of all active elements.
//
// Asm: EORV
func (x Int8s) XorReduce() int8
// ... analogous for all other vector types (And/Or/Xor are integer-only)
Broadcast, Index, and Lane Access
func BroadcastInt8s(v int8) Int8s // Asm: DUP (scalar)
// ... analogous for all other vector types.
// ArithSeqInt8s creates an arithmetic sequence with the given start and step:
// result[i] = start + i * step
//
// A non-constant value of step may result in significantly worse performance for this operation.
//
// Asm: INDEX (scalar, immediate)
func ArithSeqInt8s(start, step int8) Int8s
// ... analogous for Int16s, Int32s, Int64s, Uint*s
Element Getters:
// GetElemLastActive extracts the last active element in x governed by m
// If m is all false, the highest-numbered element is extracted.
//
// Asm: LASTB
func (x Int8s) GetElemLastActive(m Mask8s) int8
// GetElemAfterLastActive extracts the element right after the last active element in x governed by m.
// Letting j be the index of the last active element in m (or -1 if m is all false),
// the result is x[(j + 1) mod x.Len()].
//
// Asm: LASTA
func (x Int8s) GetElemAfterLastActive(m Mask8s) int8
// ... analogous for all other vector types
Note: These are the intrinsic SVE GetElem operations and they look strange, should we support a clean version of func (x Int8s) GetElem(i int) int8 that mimics the amd64 API? They will be emulations based on these intrinsics.
Element Setters:
// SetElem sets the element at index % x.Len() to v.
//
// Asm: Emulated (Predicate construction + "CPY (scalar)")
func (x Int8s) SetElem(index uint8, v int8) Int8s
// ... analogous for all other vector types
Permutations
// Reverse reverses the order of elements in x.
// result[i] = x[x.Len() - 1 - i]
//
// Asm: REV (vectors)
func (x Int8s) Reverse() Int8s
func (x Int8s) InterleaveLo(y Int8s) Int8s // Asm: ZIP1 (vectors)
func (x Int8s) InterleaveHi(y Int8s) Int8s // Asm: ZIP2 (vectors)
// InterleaveEven interleaves the even-indexed lanes of x and y:
// result[2i] = x[2i]
// result[2i+1] = y[2i]
//
// Asm: TRN1 (vectors)
func (x Int8s) InterleaveEven(y Int8s) Int8s
// InterleaveOdd interleaves the odd-indexed lanes of x and y:
// result[2i] = x[2i+1]
// result[2i+1] = y[2i+1]
//
// Asm: TRN2 (vectors)
func (x Int8s) InterleaveOdd(y Int8s) Int8s
func (x Int8s) PermuteOrZero(idx Uint8s) Int8s // Asm: TBL
func (x Int32s) Compress(m Mask32s) Int32s // Asm: COMPACT (32/64-bit only on base SVE)
// Splice splices x and y with m: the range in x governed by m's first and last active element will
// be copied to the result's lower part, and the remaining high part will be copied from y's low part.
// For example:
// x = [1, 2, 3, 4], m = [T, F, T, F], y = [5, 6, 7, 8]
// result = [1, 2, 3, 5]
//
// Asm: SPLICE
func (x Int8s) Splice(y Int8s, m Mask8s) Int8s
// ... analogous for other element widths
CPU Feature Check
Two new CPU features will be added: cpu.ARM64.HasSVE and cpu.ARM64.HasSVE2.
With more extensions we support, we can potentially include more features like SVE2 crypto extensions, etc.
Example
Below are two examples. The first is a Vector Length Agnostic (VLA) loop that adds two slices of int8s together. SVE allows the loop stride to safely scale to the hardware's vector length while automatically masking the tail end of the slice.
func AddSlice(x, y []int8) []int8 {
// Any stride <= the hardware VL works. We use 5 here as an example.
stride := min(5, len(x), len(y))
commonLen := min(len(x), len(y))
res := make([]int8, commonLen)
for i := 0; i < commonLen; i += stride {
xv := archsimd.LoadInt8sPart(x[i:])
yv := archsimd.LoadInt8sPart(y[i:])
zv := xv.Add(yv)
zv.StorePart(res[i:])
}
return res
}
The second example is a row-major float32 matrix multiply C = A * B, where A is M×K, B is K×N, and C is M×N. The j-loop walks along a row of C in scalable chunks — the stride is the hardware vector length, and LoadFloat32sPart / StorePart mask the tail of each row automatically, so the code is correct for every N without a separate scalar epilogue. The accumulator stays in a Float32s register across the entire k-loop, and MulAdd lowers to a single fused multiply-add per iteration. The comments below trace the inner k-loop for i = 0, j = 0 at a vector length of 4 float32 lanes.
func MatMul(a, b, c []float32, M, K, N int) {
// The stride scales with the hardware's vector length for float32 lanes.
var probe archsimd.Float32s
stride := probe.Len()
// Suppose a = b = [[ 1, 2, 3, 4],
// [ 5, 6, 7, 8],
// [ 9, 10, 11, 12],
// [13, 14, 15, 16]]
for i := 0; i < M; i++ {
// Vectorize across the columns of the result row.
for j := 0; j < N; j += stride {
var acc archsimd.Float32s // zero
for k := 0; k < K; k++ {
// k = 0: aik = [1, 1, 1, 1]
// k = 1: aik = [2, 2, 2, 2]
// k = 2: aik = [3, 3, 3, 3]
// k = 3: aik = [4, 4, 4, 4]
aik := archsimd.BroadcastFloat32s(a[i*K+k])
// k = 0: bkj = [ 1, 2, 3, 4]
// k = 1: bkj = [ 5, 6, 7, 8]
// k = 2: bkj = [ 9, 10, 11, 12]
// k = 3: bkj = [13, 14, 15, 16]
bkj := archsimd.LoadFloat32sPart(b[k*N+j : k*N+N])
// k = 0: acc += 1 * [ 1, 2, 3, 4]
// k = 1: acc += 2 * [ 5, 6, 7, 8]
// k = 2: acc += 3 * [ 9, 10, 11, 12]
// k = 3: acc += 4 * [13, 14, 15, 16]
acc = aik.MulAdd(bkj, acc) // acc += aik * bkj
}
acc.StorePart(c[i*N+j : i*N+N])
}
}
}
Impact on Tooling
We expect the impact on tooling to be minimal, these are all concrete types with a fixed size. They will flow through the compiler just like amd64 types.
@AWSjswinney @cherrymui @dr2chase @aclements @amusman
Proposal Details
This is a proposal to introduce intrinsic support for ARM64 SVE (Scalable Vector Extension) instructions. It is a child proposal of #73787.
SVE is a recent architecture extension introduced to the ARM64 architecture. Its defining feature is a Vector Length Agnostic (VLA) programming model, which allows developers to write SIMD code once and have it scale automatically to the hardware's available vector length, much like standard scalar code. This proposal aims to provide a clean, accessible API that feels idiomatic to Go, mirrors the existing AMD64
archsimdAPI where semantics overlap, and integrates smoothly with Midway.This proposal only covers SVE and some SVE2. Specifically, loads from and stores to register lists are not supported. Each type supported will map to one
ZorPregister and we assume their length to be at most 256 bits and 32 bits. As a result,PNregisters are also not supported. SVE2.1 and SME are not within the scope of this proposal.Question: how many machines support SVE2? Is it worth supporting full-fledged SVE2?
Naming alignment
Where an operation has a direct semantic counterpart in
simd/archsimd(the AMD64 API in #73787), this proposal uses the same method name (Add,Sub,Mul,Min,Max,Sqrt,Abs,Neg,And,Or,Xor,AndNot,Not,ShiftLeft/ShiftRight,RotateLeft/RotateRight,AddSaturated/SubSaturated,MulAdd,OnesCount,LeadingZeros,Equal/NotEqual/Greater/GreaterEqual,Masked,IfElse, etc.). The element/vector type names follow the Midway-style plural convention (Int8s,Float32s, ...) because that matches SVE's length-agnostic nature. Signatures use Midway types throughout so that user code can switch between architectures with minimal surface change.For names that are already specified in Midway, unless documented in comment, they have the same semantic as Midway.
// Asmdocuments their Arm64 instruction title in the spec table.Predication
All SVE API entries come without predication; the user can use
.Masked(m)and.IfElse(m)to ask for zero predication and merging predication.Many instructions take a predicate
P; a lot of them also come with an unpredicated form, but some do not. For instructions that come with an unpredicated form, the intrinsic will map to that one; and when combined withMaskedandIfElse, the compiler will try to peephole it to a predicated form if it exists. For instructions that do not come with an unpredicated form, an all-active predicate will be constructed in place by the compiler and provided to the predicated instruction, and the intrinsic maps to this two-instruction sequence. The compiler peepholes can strip away this all-true predicate when the user callsMaskedandIfElseright after.MOVPRFXFor destructive operations, an
MOVPRFXwill always be generated to prepare the destination register.API Overview
Types
Scalable vector types use
ElementType + "s". For example, a scalable vector ofint8elements is typed asInt8s. For scalable predicates (masks), the naming convention is"Mask" + LaneBitWidth + "s", such asMask8s.Int8sMask8sInt16sMask16sInt32sMask32sInt64sMask64sUint8sMask8sUint16sMask16sUint32sMask32sUint64sMask64sFloat32sMask32sFloat64sMask64sNo 16-bit floats (yet). SVE hardware supports half-precision (
fp16) and brain-float (bf16) lanes, but Go has nofloat16/bfloat16primitive scalar type, and we want the SVE vector element types inarchsimdto remain a one-to-one mapping to Go's primitive scalar types. If/when Go gains those scalar types,Float16sandBFloat16s(with maskMask16s) can be added without disturbing the rest of this surface.Because Go does not currently support dynamic stack allocations, scalable vectors and predicates are assumed to fit within a predefined maximum bound (currently set to 32 bytes for vector types and 4 bytes for predicate types).
All types come with these utility methods:
ARM provides the
RDVLinstruction to read the hardware's actual vector length at runtime. When thearchsimdpackage is imported,RDVLwill be checked during initialization; if the hardware vector length exceeds 32 bytes, the package will panic. We believe 32 bytes (256 bits) covers the vast majority of SVE chips currently on the market (e.g., Neoverse V1). Please let us know if this constraint needs to be expanded. The APIs supported are based onISA_A64_xml_A_profile-2025-12.Memory Operations (Loads and Stores)
Vector Loads
Consecutive loads:
Gather loads (from a slice plus a vector of indices):
Vector Stores
Consecutive stores:
Scatter stores:
Note: with the proper predicate constructed by the compiler, these gather/scatter loads/stores can be bound-safe for Go.
Mask Loads and Stores
A predicate is loaded from / stored to a bitmask, where bit
icorresponds to lanei's active state:Mask Operations
Generation
Note:
PTRUEhas more variants (e.g., a fixed power-of-two count). They can be exposed later if useful.Logic and Bitwise Operations
Element-wise logical operations between masks of the same lane width.
Tests and Reductions
Note: PTEST could be peepholed with PFIRST and PNEXT.
Note: FIRSTP looks like a useful instruction, however it's SVE2, should we support it?
Conversions
SVE predicates are layout-identical bitmasks (1 bit per byte), but are typed by lane width in Go for type safety. Conversions can change a mask type to another.
Widen Elements
Unpack and widen (by padding 0 bits) a predicate of narrower lanes into wider lanes.
Narrow Elements
Pack two wider-lane masks (low/high halves) into a single narrower-lane mask.
Note:
UZP1on predicates can also do interleaved deinterleaving when operand and result types are the same; that is exposed below underPackEven/PackOdd.Same-width Packing
Pack the even- or odd-indexed lanes of two masks into a single mask.
No-op conversions
These conversions reinterpret the bits of a mask. They are no-op operations.
These conversions do not exist on amd64 or arm64. However, this proposal still includes them because SVE predicates are universally applied on byte lane basis regardless of arrangement. Larger mask types use only the bits whose indices are a multiple of the byte size of the mask lane. So it's possible that the user may want to reinterpret the bits in a predicate to apply on different vector arrangements. Discussions are welcomed!
Permutations
Comparisons Producing Masks
All scalable vector types support element-wise comparisons that yield the corresponding mask type:
Floating-point types additionally provide:
Vector Operations
SVE vector operations are presented in their unconditional form. For operations that lack an unpredicated SVE form, the compiler constructs an implicit all-true predicate at lowering (see the Predication section). Predication is applied through the
Masked/IfElsechaining described at the end of this section, and the compiler folds the chain into a single predicated instruction.Element-wise Arithmetic
Division and square root are float-only (integer SDIV/UDIV exist only for 32-/64-bit lanes in SVE and are listed separately):
Integer division (32- and 64-bit only):
Multiply-high (upper half of a widening multiply, without widening the result type):
Fused multiply-add (single-rounding):
Saturating Arithmetic (integer types)
Bitwise Logic (integer types)
Shifts and Rotations (integer types)
Vector-by-scalar (single shift amount applied to every lane), and vector-by-vector (per-lane shift amount) forms. Right shift is logical for unsigned lanes and arithmetic for signed lanes.
Rotations are supported in RAX1 and XAR instructions as only part of their semantics; how should we support them?
Bit Manipulation (integer types)
Floating-Point Rounding
Type Conversions and Lane-Width Changes
Reinterpret (no instruction; bit-cast)
Widen (sign- or zero-extend lanes)
Narrow (truncate to half-width lanes)
Same-width Packing
Pack the even- or odd-indexed lanes of two vectors into a single vector.
Integer ↔ Float
Cross-width float conversions:
Horizontal Reductions
Compute a scalar value across all active lanes of a scalable vector:
Broadcast, Index, and Lane Access
Element Getters:
Note: These are the intrinsic SVE GetElem operations and they look strange, should we support a clean version of
func (x Int8s) GetElem(i int) int8that mimics the amd64 API? They will be emulations based on these intrinsics.Element Setters:
Permutations
CPU Feature Check
Two new CPU features will be added:
cpu.ARM64.HasSVEandcpu.ARM64.HasSVE2.With more extensions we support, we can potentially include more features like SVE2 crypto extensions, etc.
Example
Below are two examples. The first is a Vector Length Agnostic (VLA) loop that adds two slices of
int8s together. SVE allows the loop stride to safely scale to the hardware's vector length while automatically masking the tail end of the slice.The second example is a row-major
float32matrix multiplyC = A * B, whereAisM×K,BisK×N, andCisM×N. The j-loop walks along a row ofCin scalable chunks — the stride is the hardware vector length, andLoadFloat32sPart/StorePartmask the tail of each row automatically, so the code is correct for everyNwithout a separate scalar epilogue. The accumulator stays in aFloat32sregister across the entire k-loop, andMulAddlowers to a single fused multiply-add per iteration. The comments below trace the inner k-loop fori = 0, j = 0at a vector length of 4 float32 lanes.Impact on Tooling
We expect the impact on tooling to be minimal, these are all concrete types with a fixed size. They will flow through the compiler just like amd64 types.
@AWSjswinney @cherrymui @dr2chase @aclements @amusman