## Subgroup Operations
<!--
Expose a subset of subgroup operations in WGSL shaders:
  - support: https://github.com/gpuweb/gpuweb/issues/78
  - investigation: https://github.com/gpuweb/WSL/issues/5
  - PR: https://github.com/gpuweb/gpuweb/pull/954
-->

[Motivation]: We should have subgroup operations.
  + <Standard Library>: Common across all three APIs, see
                        [#667](https://github.com/gpuweb/gpuweb/issues/667),
                        [#78](https://github.com/gpuweb/gpuweb/issues/78), and
                        [WSL#5](https://github.com/gpuweb/WSL/issues/5).
  + <Performance>: Proven contribution to general purpose algorithms to make
                   them run [multiple times](https://developer.nvidia.com/blog/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/)
                   faster.
  + <Viable>: There is a safe subset of subgroup operations.
  +> [Extension]
  +> [Shuffle Operations]
  +> [Quad Operations]
  +> [Explicit Operations]


[Extension]: Subgroup operations should be exposed as an extension.
  + <Target>: Subgroup operations are not available on all WebGPU
              target hardware.

[Host Interface]: Subgroup size control and properties should be exposed
                  to the host API.
  - <Pipeline Properties>: Exact subgroup size can't be queried in DirectX 12.
  - <Size Control>: Only possible for Vulkan with
                    [VK_EXT_subgroup_size_control](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_EXT_subgroup_size_control.html).


[Compute Only]: Subgroup operations should only work in compute kernels.
  + <Hardware>: Restricting to compute only increases market (e.g., Adreno).
  + <Definition>: Operations are better defined for compute. And this
                  makes concern of helper invocations irrelevant.
    +> <Viable>

[Shuffle Operations]: There should be shuffle and relative shuffle operations.
  - <DirectX 12>: DirectX 12 doesn't have shuffle operations.
  - <Narrow Support>: Not having shuffle operations increases
                      support (e.g., ARM).

[Quad Operations]: Quad operations should be their own extension.
  - <Varying Support>: Not ubiquitously supported (e.g., PowerVR).
  - <Quadgroup Extension>: If quad operations are made into their own
                           extension, both its potential market becomes
                           larger and subgroup operations' market grows.

[Explicit Operations]: Operations that take in active mask or lane index
                       should exist.
  - <Undefined Behavior>: Explicit operations such as broadcasting with
                          index depends on successful query of active lanes
                          which can't be guaranteed with targeted
                          underlying APIs.
  - <Avoiding Helps>: Implicitly active operations are useful enough
                      while making concerns of divergence or
                      reconvergence irrelevant.
    +> <Viable>

[Non-Uniform]: We should support the non-uniform subgroup model.
  + <Ubiquitous>: All WebGPU target APIs use non-uniform model.
  - <Ambiguous Divergence>: when invocations in a subgroup are executing "together", how long is that guaranteed?
  - <Ambiguous Reconvergence>: once invocations diverge, what are the guarantees about where you reconverge?
    + Vulkan has weak guarantees, D3D12 and Metal don't have anything.
    - we can just say invocations never reconverge, which matches AMD ISA and CUDA models.
  - <Ambiguous Forward Progress>: how blocks affect progress of other blocks, and invocations within blocks affect progress on other invocations?
    + D3D, Metal, and Vulkan are silent on both of these.
  - <Ambiguous Helpers>: do helper invocations participate in subgroup operations?
Argument Map cluster_1 Subgroup Operations n0 Motivation We should have subgroup operations. n1 Extension Subgroup operations should be exposed as an extension. n0->n1 n2 Shuffle Operations There should be shuffle and relative shuffle operations. n0->n2 n3 Quad Operations Quad operations should be their own extension. n0->n3 n4 Explicit Operations Operations that take in active mask or lane index should exist. n0->n4 n5 Host Interface Subgroup size control and properties should be exposed to the host API. n6 Compute Only Subgroup operations should only work in compute kernels. n7 Non-Uniform We should support the non-uniform subgroup model. n8 Vulkan has weak guarantees, D3D12 and Metal don't have anything. n27 Ambiguous Reconvergence once invocations diverge, what are the guarantees about where you reconverge? n8->n27 n9 we can just say invocations never reconverge, which matches AMD ISA and CUDA models. n9->n27 n10 D3D, Metal, and Vulkan are silent on both of these. n28 Ambiguous Forward Progress how blocks affect progress of other blocks, and invocations within blocks affect progress on other invocations? n10->n28 n11 Standard Library Common across all three APIs, see #667, #78, and WSL#5. n11->n0 n12 Performance Proven contribution to general purpose algorithms to make them run multiple times faster. n12->n0 n13 Viable There is a safe subset of subgroup operations. n13->n0 n14 Target Subgroup operations are not available on all WebGPU target hardware. n14->n1 n15 Pipeline Properties Exact subgroup size can't be queried in DirectX 12. n15->n5 n16 Size Control Only possible for Vulkan with VK_EXT_subgroup_size_control. n16->n5 n17 Hardware Restricting to compute only increases market (e.g., Adreno). n17->n6 n18 Definition Operations are better defined for compute. And this makes concern of helper invocations irrelevant. n18->n6 n18->n13 n19 DirectX 12 DirectX 12 doesn't have shuffle operations. n19->n2 n20 Narrow Support Not having shuffle operations increases support (e.g., ARM). n20->n2 n21 Varying Support Not ubiquitously supported (e.g., PowerVR). n21->n3 n22 Quadgroup Extension If quad operations are made into their own extension, both its potential market becomes larger and subgroup operations' market grows. n22->n3 n23 Undefined Behavior Explicit operations such as broadcasting with index depends on successful query of active lanes which can't be guaranteed with targeted underlying APIs. n23->n4 n24 Avoiding Helps Implicitly active operations are useful enough while making concerns of divergence or reconvergence irrelevant. n24->n4 n24->n13 n25 Ubiquitous All WebGPU target APIs use non-uniform model. n25->n7 n26 Ambiguous Divergence when invocations in a subgroup are executing "together", how long is that guaranteed? n26->n7 n27->n7 n28->n7 n29 Ambiguous Helpers do helper invocations participate in subgroup operations? n29->n7