Lexicons
Every record on this PDS, every XRPC procedure it serves, every event on its firehose has a shape — a list of fields, their types, what's required, what's optional, what nested objects look like. The shape is defined by a lexicon: a JSON schema file with a few AT-Protocol-specific extensions, identified by an NSID.
This chapter covers what lexicons are and why they're shaped the way they are. The implementation that turns a lexicon file into runtime validators ships in a later session — by then you'll know exactly what those validators have to do.
A lexicon, by example
Here's a (lightly trimmed) version of
app.bsky.feed.post,
the schema that defines what a Bluesky post is:
{
"lexicon": 1,
"id": "app.bsky.feed.post",
"defs": {
"main": {
"type": "record",
"key": "tid",
"record": {
"type": "object",
"required": ["text", "createdAt"],
"properties": {
"text": {
"type": "string",
"maxLength": 3000,
"maxGraphemes": 300
},
"embed": {
"type": "union",
"refs": [
"app.bsky.embed.images",
"app.bsky.embed.video",
"app.bsky.embed.external",
"app.bsky.embed.record"
]
},
"reply": {
"type": "ref",
"ref": "app.bsky.feed.post#replyRef"
},
"createdAt": {
"type": "string",
"format": "datetime"
}
}
}
},
"replyRef": {
"type": "object",
"required": ["root", "parent"],
"properties": {
"root": { "type": "ref", "ref": "com.atproto.repo.strongRef" },
"parent": { "type": "ref", "ref": "com.atproto.repo.strongRef" }
}
}
}
}
A few things to notice:
- The file's NSID is in the
idfield:app.bsky.feed.post. That's also what a record's$typefield will say when it conforms to this schema. - The top of the lexicon is the
defsmap. Themaindefinition is what's named by the file's NSID. Auxiliary types (likereplyRefabove) get internal names and are referenced as<file-id>#<def-name>. - The
type: "record"def has akeyfield declaring how rkeys are generated ("tid"= TID-shaped,"literal:self"= always the string "self","any"= caller-chosen,"nsid"= an NSID). - The schema language is JSON Schema with AT-Protocol-specific types
(
cid-link,blob,uniondiscriminated by$type,reffor cross-file references). - Fields like
maxGraphemesandformat: datetimeare constraints the validator enforces. JSON Schema's ownformatis purely informational; here it's normative.
Why a custom schema language?
The honest answer: because plain JSON Schema is almost enough but not quite, and a small custom layer was cheaper than reimplementing the parts of JSON Schema that the AT Protocol doesn't want.
What plain JSON Schema lacks for this use case:
- CID references. A like points at a post by CID + URI. The CID is a
typed value (the
cid-linkcodec tag in CBOR, an object{ $link }in JSON), and JSON Schema doesn't have a built-in for that. - Blob references. Image attachments are
{ $type: "blob", ref, mimeType, size }. Same shape problem. - Discriminated unions. AT Protocol unions are tagged: every variant
carries a
$typefield that names its lexicon. Validators select the variant by reading$type. JSON Schema unions are untagged (oneOf/anyOf), which is solvable in pure JSON Schema but verbose. - Graphemes and other Unicode-aware constraints. A 300-grapheme limit
on
textenforces "300 user-visible characters" even when the text contains complex emoji that span multiple code points. Plain JSON Schema can constrainmaxLength(UTF-16 code units), which gives the wrong answer for emoji. - Method definitions. Lexicons describe XRPC procedures too — parameters, input bodies, output bodies, error names. JSON Schema is schema-only; AT Protocol wanted a single language for the whole API surface.
So lexicons are JSON Schema + five-or-so extensions. A validator written against one schema language can do everything the protocol needs.
NSIDs
A Namespaced Identifier is a reverse-DNS dotted name:
com.atproto.repo.createRecord, app.bsky.feed.post, dev.acme.notes.note.
The leftmost label is a TLD; the namespace conceptually belongs to whoever
controls that TLD's DNS. There's no central registry of lexicons.
NSIDs serve two distinct roles:
- Collection names inside a repository (e.g.
app.bsky.feed.post). - XRPC procedure names on the wire (e.g.
com.atproto.repo.createRecord).
The two namespaces are mostly disjoint by convention: app.bsky.* is
collection NSIDs, com.atproto.* is procedure NSIDs. There's nothing
preventing collisions — they're just different uses of the same kind of
string — but the conventions hold up well enough that no client confuses
them.
📖 What if I want to invent a new lexicon? Pick an NSID under a TLD you control (
dev.acme.cool.thingworks if you ownacme.dev), write the JSON, publish it somewhere stable so other parties can read it. A PDS that doesn't recognize your NSID will store records under it anyway — it doesn't care — but an AppView that doesn't know your schema can't display them.
The three kinds of definition
Every lexicon's main def is one of:
record— a record type that lives in a repo atat://<did>/<nsid>/<rkey>. Has akeyto declare rkey shape and arecordfield (the object schema).query— a GET XRPC method. Hasparameters(query string args),output(response body schema), anderrors(named error variants).procedure— a POST XRPC method. Likequeryplus aninput(request body schema, usually JSON).subscription— a WebSocket XRPC method, used only for the firehose. Hasparametersand amessageschema (the union of all possible event types).
defs other than main are auxiliary — object shapes, unions, etc. —
referenced from main (and from other lexicons) via <id>#<defname>.
Type vocabulary
The complete list of primitive types a lexicon can use:
| Type | What it is | Notes |
|---|---|---|
null |
always null | |
boolean |
true / false | |
integer |
signed integer | optional minimum/maximum |
string |
UTF-8 string | optional maxLength, maxGraphemes, format |
bytes |
raw bytes | optional minLength/maxLength |
cid-link |
a CID | encodes as { $link } in JSON, tag 42 in CBOR |
blob |
a blob ref | encodes as { $type: "blob", ref, mimeType, size } |
array |
typed array | has items (any schema), optional bounds |
object |
nested object | has properties and required |
params |
URL query params | only valid inside a query/procedure |
token |
a sentinel value | used to declare named values (e.g. for enums) |
ref |
reference to another def | ref: "app.bsky.feed.post#replyRef" |
union |
tagged union | refs: ["a", "b"]; discriminated by $type |
unknown |
anything | escape hatch; validators just check it's present |
format on a string can be datetime (ISO 8601), uri, at-uri,
did, handle, at-identifier (handle or DID), nsid, cid,
language, tid, record-key.
How XRPC fits in
A procedure or query definition is the spec for a single XRPC endpoint.
Sketch (from com.atproto.server.createSession):
{
"lexicon": 1,
"id": "com.atproto.server.createSession",
"defs": {
"main": {
"type": "procedure",
"input": {
"encoding": "application/json",
"schema": {
"type": "object",
"required": ["identifier", "password"],
"properties": {
"identifier": { "type": "string" },
"password": { "type": "string" }
}
}
},
"output": {
"encoding": "application/json",
"schema": {
"type": "object",
"required": ["accessJwt", "refreshJwt", "handle", "did"],
"properties": {
"accessJwt": { "type": "string" },
"refreshJwt": { "type": "string" },
"handle": { "type": "string", "format": "handle" },
"did": { "type": "string", "format": "did" }
}
}
},
"errors": [
{ "name": "AccountTakedown" },
{ "name": "AuthFactorTokenRequired" }
]
}
}
}
When the lexicon-driven validator is wired up (a later chapter), the XRPC dispatcher will:
- Look up the lexicon for the requested NSID.
- Validate the input body (or query string) against the
input.schema. - Call the handler with the typed input.
- Validate the handler's return value against
output.schema. - Translate thrown errors to the named variants in
errors.
The handler doesn't have to think about validation; the lexicon does it on both sides.
Validation: lenient on read, strict on write
The convention every implementation follows:
- On write (a client sending a record or procedure input): validate strictly. Reject unknown fields, malformed types, missing required fields. Return a 400 with a clear error.
- On read (a client receiving an output, or a relay receiving a firehose event): validate leniently. Unknown fields pass through. Future protocol extensions add fields; old clients shouldn't crash on them.
This asymmetry is what lets the protocol evolve. New fields can be added to a record type without coordinating across every existing reader.
📖 The same principle in action: when the AT Protocol added the
langsfield toapp.bsky.feed.post, every existing post became implicitly "no langs declared." Readers that didn't know aboutlangsjust skipped it. No flag-day migration.
Bundled vs resolved
Two strategies for getting lexicons into a server:
-
Bundled at build time. The server ships with copies of every lexicon it understands, baked into the binary. New lexicons require a redeploy. This is what this PDS will do — we vendor the
com.atproto.*andapp.bsky.*lexicons we serve and validate against the bundled copies. -
Resolved at runtime. Given an NSID, the server fetches the lexicon over HTTP (from a well-known URL based on the NSID's TLD). The server handles whatever it learns about. This is more flexible but adds a network dependency to validation, plus a caching strategy, plus a trust model for whose lexicon is authoritative.
The reference Bluesky PDS bundles. We bundle. Production self-hosters who
want to serve a new lexicon (their own dev.acme.*) add it to the bundle
and redeploy.
Codegen vs runtime validation
For each lexicon, an implementation can either:
- Generate TypeScript types at build time so handlers get typed input and output. The trade-off: stale types if lexicons change at runtime.
- Validate at runtime using a generic schema interpreter. The
trade-off: handler input/output is
unknownuntil cast.
We pick runtime here because the docs site renders the bundled lexicons live, and codegen-based docs would require recompilation every time we edit a schema for the chapter. In production you'd typically codegen.
The implementation that lands in a later session will look roughly like:
const lexicon = loadLexicon('com.atproto.server.createSession')
const inputValidator = compileSchema(lexicon.defs.main.input.schema)
const outputValidator = compileSchema(lexicon.defs.main.output.schema)
// In the dispatcher:
const input = inputValidator(rawBodyJson)
const output = await handler({ input, ... })
return outputValidator(output)
Where compileSchema turns a lexicon schema into a (value) => value
function that throws on validation failure. (We'll likely build it as a
small interpreter rather than codegenning a validator — easier to read,
and performance isn't a bottleneck at PDS scale.)
What's still missing
🚧 The validator now runs alongside every XRPC request via
src/pds/xrpc/lexicon-bridge.ts. On each call we look up the lexicon for the NSID and validate input, params, and output against its schemas. Today's mode is observe-only: mismatches log[lexicon:input] <nsid>: <reason>but don't fail the request — handlers still own validation through their hand-rolledzodschemas. SettingLEXICON_STRICT=truein the environment flips the observer to a hard rejection, which is the next step once the log goes quiet and the stub lexicons are fully transcribed.
Today: of the 36 bundled lexicons, six are transcribed in full
(app.bsky.feed.post, app.bsky.actor.profile, app.bsky.richtext.facet,
all three app.bsky.embed.*, plus a handful of com.atproto.* defs +
com.atproto.server.createAccount / createSession and
com.atproto.repo.createRecord / getRecord / strongRef). The rest
are stubs marked "TODO: full schema in a future session." — enough that
refs into them resolve, not enough to actually constrain anything.
Try it
pnpm tsx -e "import('./src/pds/lexicon/selfTest').then(m => m.runLexiconSelfTest())"
That loads the bundled catalog, compiles app.bsky.feed.post's schema,
and runs four cases through the validator (valid post; missing
text; over maxGraphemes; a 1-grapheme family-emoji ZWJ sequence).
It prints all self-tests passed when the runtime is healthy.
Exercises
- Pick a lexicon you've never seen (browse
bluesky-social/atproto/lexicons/).
Identify the
maindef's type, list its required fields, and decide what an invalid input to that lexicon would look like. - Why is
app.bsky.feed.post'stextconstrained on graphemes and not code points? Construct a 12-character string that's exactly 1 grapheme. - The lexicon for
com.atproto.repo.uploadBloblets the client send any mime type. What stops a malicious client from uploading a gigabyte ofapplication/octet-stream? Where does that constraint live? - Tagged unions identify variants by
$type. What happens to a record whose embed has$type: "app.bsky.embed.images.v2"(which doesn't exist yet) when the lexicon for v2 ships?
Up next
Chapter 10 — XRPC walks the HTTP dispatcher that turns
incoming requests into handler calls. A later session swaps each
handler's hand-written zod schema for a lookup into the lexicon
catalog this chapter just built.
← 08 — CAR files · → 10 — XRPC