SplatGPT

SplatGPT Part 2A: Engineering GPT for Set Completion

Introducing SplatGPT, a deep learning model using set-transformers to effortlessly navigate Splatoon 3’s intricate 160B gear-build possibilities

Cesar Garza

May 13, 2025 — 7 min read

Recap

Last time, we saw why Splatoon 3 loadouts overwhelm naive recommenders:
160 billion gear-weapon options when the dataset is at least three orders of magnitude smaller, non-linear stacking curves, and "noise" that's indistinguishable from signal without taking the full context into account.

We distilled three hard requirements:

Permutation Invariance
Non-linear interaction capture
Deep contextual understanding

Introducing SplatGPT: A Set-Transformer based model that predicts probability buckets, then assembles legal builds with a beam search variant.

Embracing Sets and Structured Prediction

The failure of naive recommendation systems isn't just about data scarcity or noise, it's a representation mismatch. A Splatoon gear build is not an ordered list, it is a set whose value comes from which abilities appear at which quantities and how they interact, not so much where they sit. Modeling the data as a set automatically satisfies our first requirement, permutation invariance.

The other two requirements, capturing non-linear interactions and understanding deep contextual relationships, are strengths of Transformer architectures like those found in Large Language Models (LLMs). However, standard LLMs are typically designed for ordered sequences: they emit token after token and score themselves on next-token accuracy. Even without positional embeddings, the causal mask inherently encodes an order bias.

So I asked: "What if I built an LLM that uses sets instead of sequences of tokens?" SplatGPT was my solution: fuse GPT-2's residual stack with the permutation invariance provided by using Set Transformers(Lee et al., 2019). However, this hybrid approach demanded a shift in the prediction task itself: the fused model no longer predicted the "next token", but should instead predict the entire collection of abilities likely to be within the set all at once. This naturally framed the problem as a multilabel classification task: for every possible "ability token" (more on this later), the model predicts the probability of that token being a member of the target set. This changes the training process, loss functions, data handling, and anything else downstream of adopting this set-based, multilabel paradigm.

Build Space and Token Space

Predicting probabilities for "ability tokens", however, buries the lede a little bit. How exactly do we define these tokens to accurately capture Splatoon's gear mechanics? A naive approach might be to create tokens for each ability and the type of slot it occupies (main or sub). But this runs into a significant issue: within the game's design, three "sub" slots (3 AP each, totaling 9 AP) have nearly the same internal point value as a single "main" slot (10 AP). Indeed, the community often sees these quantities as essentially equivalent for build purposes. Listing main vs subs would throw away this crucial fungibility of AP. Additionally, enforcing slot limits directly in the model would be a heavy ask adding needless complexity.

I solve this with two separate spaces:

Space	What lives here	Who uses it
Build Space	The literal gear UI: 3 mains + 9 subs, slot rules, brand quirks.	Input from / output to the player
Token Space	Discretised AP buckets plus one token for each main-only ability.	The model’s embeddings and logits

Example: From Build to Tokens and Back

Consider the input build fed into the system:

Player Input Build (Build Space):
- Headgear Main: Ink Saver (Main) (10 AP)
- Clothing Sub 1: Ink Saver (Main) (3 AP)
- Clothing Sub 2: Quick Super Jump (3 AP)
- (Other slots empty for simplicity)
"Melting-down" to Token Space:
- Ink Saver (Main): 13 AP (falls into the ISM_12_14_AP bucket, as an example)
- Quick Super Jump: 3 AP (falls into the QSJ_3_5_AP bucket)
- The model receives this as ["ISM_12_14_AP", "QSJ_3_5_AP"]

Let's say that SplatGPT returns the exact same combination for this example. We now have to reconstruct, which has two equivalent representations:

"Reconstruction Step": Finding Valid Builds for the Predicted Tokens
- Option 1 (Mirroring Original Structure):
  - 1 main slot to Ink Saver (Main) (10 AP)
  - 1 sub slot to Ink Saver (Main) (3 AP)
  - 1 sub slot to Quick Super Jump (3 AP)
  - Result: ISM = 13 AP, QSJ = 3 AP. Both targets hit.
- Option 2 (Prioritizing Sub Slots for ISM):
  - 4 sub slots to Ink Saver (Main) (12 AP)
  - 1 sub slot to Quick Super Jump (3 AP)
  - Result: ISM = 12 AP, QSJ = 3 AP. Both targets hit.

Both of these options are valid ways to achieve the AP targets predicted by the model in Token Space. This becomes significantly more complex once you consider a full build, requiring the reconstruction algorithm to pick one based on its own heuristics or possibly offer multiple choices to the user. This example also demonstrates how Token Space abstracts away the main/sub concepts altogether and allows the model to focus entirely on total AP, while reconstruction handles the combinatorial arrangement of the final predictions.

Introducing SplatGPT

At its heart, SplatGPT processes a given weapon and (potentially incomplete) set of gear abilities, aiming to predict a complete and optimal set of ability tokens.

Embedding Layer
Discretised ability tokens and weapon IDs are embedded in the same vector space. We add the weapon embedding to every ability embedding (rather than concatenate), giving each token built-in context about the weapon and letting the model transfer synergies from popular to data-poor weapons. This is an important as abilities can either enhance a weapon's strengths or shore up its weaknesses, abilities can therefore only be understood within the context of the weapon they're supporting. This has a nice secondary effect that encourages the model to learn and transfer synergistic relationships observed with popular, data-rich weapons to less common, data-poor weapons (or equivalently, observe high amounts of signal from data-poor weapons and transfer it to noisy data-rich weapons).
SetTransformer Layers (x N)
The core of SplatGPT's power lies in the stack of SetTransformerLayer blocks. Each layer is a three-stage block designed to progressively deepen the model's understanding of the interactions within the set of abilities while preserving permutation invariance and sequence length.
1. Set Transformer:
  The input stream is first processed by a Set Transformer module, which uses Induced Set Attention Blocks and Set Attention Blocks followed by a Pooling by Multihead Attention operation, internal attention mechanisms to analyze all-to-all interactions between the input tokens within the current set. The result is a condensed, fixed-size summary representation that essentially captures the essence of the input as a collective. This is foundational for permutation invariance.
2. Cross-Attention
  One of the key challenges of fusing Set Transformers with a GPT-like architecture is that the pooling operation in Set Transformers requires a predetermined output sequence length. However, this interferes with the ability to have residual connections throughout the model which requires both input and output sequences to keep a fixed length.
  SplatGPT elegantly solves this by employing cross-attention. Here, the original input sequence for the SetTransformerLayer (before the Set Transformer module) acts as the query. The global summary vectors produced by the Set Transformer module serve as both the key and value. In essence, each token effectively "asks" the global summary what information from that summary is most relevant to it and updates itself based on that result. This allows the model to broadcast the rich, set-level summary back to each individual input token while maintaining the original sequence length.
  By reinjecting this global perspective, each input token becomes more aware of its role, dramatically enhancing the contextual understanding at each layer. Because attention mechanisms are inherently permutation equivariant, the later Masked-Mean layer achieves full permutation invariance.
Interpretability aside
I should note that it is difficult to understand what the later layers of SetTransformerLayer might represent, the first layer allows the raw input tokens to be aware of their relationship to all other input tokens. Perhaps the second layer allows these newly context-aware tokens to become aware of not only their own status but also the understanding the other tokens have of their goals. Characterizing past the first layer is quite challenging, and further interpretation work must be done to meaningfully characterize what the third layer (and possibly beyond) might represent.
1. Feed-Forward Network (FFN)
  After the cross-attention step has enriched each input token in the stream with the global set context, each token in the sequence is passed through a standard Feed-Forward Network. This is a two-layer MLP: it first expands the dimensionality of the embedding, applies the GELU (Gaussian Error Linear Unit) activation function, and then contracts the dimensionality back. This allows nonlinear transformations to be applied to each token individually, and some research suggests this is where "knowledge" resides in a model.
The entire three-stage block is wrapped in a residual connection: the output of this entire block is added back to its original input. These connections not only help train deep networks, but as Anthropic's research into transformer circuits demonstrates, it also allows for higher order contextual representations to flow through the model.
Masked-Mean Pool
Once the data flows through the entire transformer circuit, it's processed one last time by the Masked Mean Pooling layer. Its job is to aggregate the now highly refined and informed token embeddings into a single vector that holistically represents the entire (potentially partial) input gear set. It does this while respecting padding masks, which are used to handle sets of abilities of varying sizes within a batch, so that they do not influence the final average. This averaging process is inherently permutation-invariant and provides a final summary of the set's characteristics.
Output Head
This final vector is fed into the Output Head. This is a simple linear projection that transforms the vector's dimensionality to match the total number of unique tokens in SplatGPT's vocabulary. These resulting scores (logits) are then passed through a sigmoid activation function. This allows the model to assign an independent probability for each ability token in the vocabulary, deliberately framing the problem as multilabel classification. This is essential, as it allows the model to predict high probabilities for multiple different ability tokens simultaneously.

Together, these blocks let SplatGPT capture stacking curves, weapon synergy, and deep context in a single, permutation-invariant architecture.

💡

It is important to remember that SplatGPT outputs these probabilities in Token Space. A subsequent algorithm, such as a specialized beam search, is then required to translate these probabilities into a legal Build Space gear set that a player can actually use in-game.

This deep dive into SplatGPT's architecture reveals a model fundamentally built to understand Splatoon 3 gear in its native, set-based context. By splitting the problem into Token Space and Build Space, we have engineered a model designed for set completion with considerations for weighted multisets (sets where elements can appear multiple times and have different weights).

In part 2B, we'll walk through the dataset. How was it scraped, cleaned, how some biases were removed or compensated for while others were intentionally introduced. We'll go through the training regiment and discuss results, including the fascinating <NULL> token that convinced me the model is reasoning deeply rather than memorizing. Stay tuned!