Performance work — Johnny Castaway PS1

A labor of love by Hunter Davis. This page is the running summary of perf work on the PS1 port at v0.8.12-ps1: where the bottleneck is, what got measured, which experiments stayed in the build, and which got reverted. The full per-experiment ledger lives in the source tree; the link is at the bottom. The retrospective on how the matrix moved from the compact baseline to the current battle card — From 87 to 99.5: the post-validation performance loop — is in the Lab. If you paid for this, you were cheated. Open source and free.

On this page

The constraint
What was measured
Experiments that didn’t work
Experiments that did
Where it sits at v0.8.12-ps1
Scene Battle Card
Non-goals
Related pages
View source on GitHub

The constraint

“Performance” on a PS1 means a different shape of problem than performance on anything modern.

The MIPS R3000A core runs at 33.8688 MHz with no FPU. The GPU is fixed-function — sprites, primitives, an ordering table, no shaders. Audio is a separate processor with its own RAM. The CD is a 2x drive: 300 KB/s sustained, 150 ms cold seek. There is no memory bandwidth budget worth talking about for a 16-color screensaver port; the bandwidth budget is the CD’s, and it gets spent in seek latency, not transfer time.

The frame budget at 60 Hz is 16.6 ms. Johnny Castaway is a 1992 VGA screensaver — at the source level, foreground content changes roughly four times per second. The PS1 still has to draw a frame at 60 Hz, but it can hold the same content frame after frame for many VBlanks at a stretch. The VBlank cadence is the rendering loop’s heartbeat; the interesting timing is which VBlanks have actual work in them and which are held idle.

That asymmetry is what shapes the runtime. A held VBlank is free CPU and free CD bus. The whole optimization story is about scheduling work — CD reads, RAM tile composition, dirty-row uploads — into held VBlanks before the next “real” frame arrives. When that scheduling fails, the active frame’s VBlank gets stretched and loop_vb goes up.

The frame budget for a screensaver is more forgiving than a game. Nothing the user does requires sub-frame latency. But the project’s acceptance bar is pixel-perfect playback against host-captured reference frames, which means the runtime cannot drop frames or compress timing files to “feel faster” — it has to render every captured entry on the captured beat. Slack exists in the held intervals; it does not exist in the entries.

What was measured

The perf instrumentation lives in src/ps1_perf.c. It is gated so it adds zero cost when off.

Three signal sources:

TTY printf at scene-start and scene-end with structured JCPERF / JCPERF2 records. Levels: OFF, SUMMARY, DETAIL, DEBUG. Only the on-demand records cross the TTY surface; per-frame text is forbidden in hot paths because it perturbs timing.
ps1_perf module counters for VBlank-level metrics: loop_vb, target_vb, overrun_vb, blocking_vb, prefetch_overrun_vb, due_misses, restore_bytes, upload_bytes, dirty_rows, upload_rects, loop_reads. Each scene-end record dumps the steady-state values for that run.
Regtest harness frame timing. The headless DuckStation in scripts/run-regtest.sh boots the disc image, captures PNGs, and ingests the TTY records into per-run summary JSON files under scratch/ps1-perf-iterate/<runId>/.

Every experiment goes through the same gate: scripts/ps1-perf-iterate.sh runs the case, compares it to a baseline summary.json, and either promotes (if a key metric improved without a material regression in loop_vb / blocking_vb / prefetch_overrun_vb / scene identity) or rejects with a recorded failure reason.

The full experiment log is at docs/ps1/performance-experiment-log.md. At the time of writing it contains 600+ experiment rows going back to 2026-04-25. Most of them failed.

The full scene/tide battle card is docs/ps1/performance-scene-matrix.csv and is rendered as the live, sortable, color-coded battle card at /perf/. It is not the human scene-promotion ledger at /scenes/; the two ledgers stay separate on purpose — different bars, different cadences, different failure modes.

The current compiler-flag sweep is tracked in docs/ps1/performance-o2-audit.md and its machine-readable performance-o2-audit.csv. That report is regenerated from build-ps1/compile_commands.json and build-ps1/jcreborn.map before each -O2 probe.

The current pack-time graphics preprocessing target sheet is docs/ps1/performance-preprocess-opportunities.md and its machine-readable performance-preprocess-opportunities.csv. It ranks today’s FG2/FGP3 packs for selective upload-ready or cleanup metadata work without changing the runtime baseline.

The per-pack detail analyzer scripts/analyze-fg2-preprocess-plans.py now parses both FGP2 and FGP3 temporal-residual payloads. Its VISITOR3 output splits cap-hit frames from saving-heavy frames, which keeps the next upload-ready experiment selective instead of a whole-pack conversion. The current VISITOR3 frame sheet is docs/ps1/performance-preprocess-visitor3-hotspots.csv. The current default VISITOR3 high-tide selective plan is still too large for a same-footprint append: it models 5730024 selected upload bytes saved, but the upload-ready payload plus rect metadata needs 2111224 bytes against only 970076 bytes of padded zero-tail slack. The analyzer now emits the same-footprint budgeted target too: 78 / 92 default-selected frames fit in 968904 payload+rect bytes, leave 1172 bytes of slack, and retain 4232112 modeled upload bytes saved. The analyzer now also reports whether those x-band uploads are safe to emit from foreground data alone. For VISITOR3, 0 selected x-band bytes are fully covered by current opaque draw spans, so a raw pack-emitted upload payload would have to bake restored background pixels that are dynamic at runtime. The next probe should use a different generated data shape, explicit scheduler ownership, compression plus a safe pixel source, or a deliberate layout-moving experiment.

A tempting VISITOR3 shortcut was rejected: pruning visually no-op FGP3 entries reduced active payload and high-tide visible blocking, but hidden prefetch overrun regressed from 0 to 56 high and 17 low. That confirms the next VISITOR3 route needs explicit scheduler ownership or budgeted upload-ready data, not isolated entry-count pruning. The safer pack-side empty-hold recast also found 0 current VISITOR3 high/low entries whose cleanup and draw pixel counts are both zero, so there is no cadence-preserving no-op payload to erase under the current FGP3/v4 data.

The current post--O2 tooling pass also records compact baseline fingerprints in every perf summary and classifies foreground read-plan candidates by observed append-start ownership, current grouped-read capacity, and visible-CD cost class. That makes stale-baseline comparisons, no-op read groups, and tight visible-cluster candidates visible before a runtime source edit.

Those foreground read-plan candidates are now rolled up into docs/ps1/performance-read-candidate-matrix.md and its machine-readable performance-read-candidate-matrix.csv. The current report has one guarded BUILDING2 candidate, no standalone-safe rows, and keeps VISITOR3 in the scheduler-owned or closed lane. Remaining read-timing candidates should not be promoted as raw hand-authored table ranges without the same kind of slack/scheduler proof. The BUILDING6 v353 181..197 / 269..285 probe is now the concrete counterexample for direct-stage clusters: the source table crossed the PS-EXE bucket, never produced a group_hit, and left active read counts unchanged, so BUILDING6 needs generated direct-stage ownership or a pack-side data-shape change rather than another local read-group row.

Experiments that didn’t work

A representative slice of rejected experiments and why each one didn’t stick. The pattern is more useful than any individual line — almost every “obvious” idea gets discarded because the PS1 runtime has counter-intuitive cost structure.

Larger stream windows. 40 KB, 56 KB, 64 KB. Larger windows reduce CD transaction count but overrun held slack more often. The current default is 20 KB after a long sweep; everything bigger lost.
Smaller stream windows. 12 KB, 14 KB, 16 KB. Smaller windows reduce per-refill overrun but starve due frames — due_misses rises and blocking_vb follows. The knee is sharp; one sector size in either direction matters.
Disabling stage1 isolation. Booting with no-stage1 to test whether stage-copy overhead was a real cost. The headless harness exited 137 before JCPERF2 could record anything; the test was structurally inconclusive. Kept staging on.
Partial tail reads when a staged frame straddles the window end. Sounded right on paper. In practice, smaller tail reads multiply CD transaction count and due_misses rises faster than the byte savings help. Rejected.
Compose-before-VSync sequencing. Move the FG2 RAM composition before the VBlank wait so CPU work overlaps with previous-frame scanout, then upload after VBlank. Reduced prefetch_overrun_vb but stole held-prefetch time elsewhere; total loop_vb regressed by 12.
Held-loop no-slack wait skip. Looked like a clean one-VBlank overshoot fix. Regressed loop, blocking, and refill metrics simultaneously; the skipped wait was load-bearing.
Async stream-window refill. Naive async polling regressed blocking_vb badly. The CD subsystem has implicit ownership rules the synchronous path was respecting; the async path violated them. Rejected without a first-class CD-state ownership model.
-O3 on hot translation units. Less prepared RAM work in some scenes, but worse loop/blocking/refill timing overall. The optimization changed code shape enough that CD scheduling phase shifted unfavorably. Kept -O2.
Holiday overlap restamping. Seed holiday decoration into the clean backdrop and only restamp it when the current FG2 frame overlaps. Logically sound, but the active fishing1 frames overlap the Christmas decoration enough that this didn’t reduce dirty work. Pure no-op, rejected.
vprintf inline diagnostics. Adding a CD-read histogram inline with JCPERF regressed timing even with detail-gating. The act of having the code present changed binary shape enough to move scheduling phase. Reverted; histograms now live in post-processing.
FG2 sound-event table in the metadata prefix. Setup reads improved, but moving the table ahead of the payload shifted every payload by 36 bytes and badly worsened active CD phase. The pack layout is more sensitive to byte offsets than is comfortable.

The recurring lesson: changes that look like clean wins on paper often shift CD scheduling phase in ways that are not visible until the full scene runs. The headless gate is what catches this; experiments that regress loop_vb or blocking_vb against a baseline get rejected even when they “obviously” should have helped.

Experiments that did

A condensed list of changes that survived and are in the runtime today. They cluster into a few themes.

Foreground prefetch and stream window:

Stage1 staging buffer for the next FG2 entry, prefetched during held VBlanks.
Stream window default of 20 KB, reduced from earlier 32 KB after the post-pause-merge sweep showed it as the local minimum.
3 VBlank refill guard, raised from earlier 2/1 thresholds after smaller guards caused due-frame starvation.
Forward-extend stream window when a straddling entry is detected: preserve the resident suffix and append-read only the missing aligned tail. Replaces overlapping full-window refills.
Stage-copy fallthrough at 5 VBlanks: after a zero-VBlank stage copy from the resident window, immediately prefetch the following window if at least 5 held VBlanks remain. Converts idle held time into hidden CD work.
Tight-slack direct staging up to 8 KB for immediate payloads when the window refill would otherwise be skipped.

Compositor:

PAL4 opaque-span compositor — FG2 PAL4 spans contain only visible pixels, so the per-pixel transparent-index branch was removable.
Tile-local PAL4 fast path — split each span by destination tile once instead of per-pixel.
Per-tile PAL4 row dirty marking — track which rows of which tiles changed, not just which tiles.
Base-diff FG2 pack format — the active path requires base-diff packs, which makes RAM tile compositing the only render path and lets grBeginFrame() / ClearOTagR() skip when nothing’s queued.

Dirty-rect bookkeeping:

X-aware clean-rect restore — track dirty X extents per tile so RAM clean-background restore only touches the changed region.
Vertical dirty-row upload bands with an 11-row gap merge — collapses adjacent uploads into wider rectangles.
Long-hold host-deadline catch-up — a small render bookkeeping adjustment that traded seven extra speculative restore/compose calls for five fewer loop VBlanks.

Code shape and link:

-ffunction-sections -fdata-sections plus --gc-sections for the PS1 link. The legacy ADS / TTM / FG1 / FOC runtime paths are still in the source tree but get stripped at link time.
Removal of the foreground visual telemetry hot-path body, the legacy foreground diagnostic gate, the unused foreground “ever” diagnostics, the unused ADS foreground start hook, the obsolete FGPILOT ADS dispatch, the unused foreground status accessors, the dead foreground requested-mode state.

Diagnostic gating:

Pad / SPI diagnostics gated default-off. The pause-menu work introduced always-on JCPAD / JCSPI sampling; a strict-gate red-team pass showed the diagnostics were costing 52 VBlanks of loop time. Default-off recovered that; pad-diag / pad-debug boot tokens still enable them on demand.

The cumulative effect is visible in the current accepted baseline: fishing1 high-tide playback at loop_vb=1068 against a target of target_vb=1074. The original headless perf-loop baseline was loop_vb=1426, so the FISHING 1 canary is down 358 VBlanks (25.11% loop reduction).

Where it sits at v0.8.12-ps1

The current accepted fishing1 high-tide run, captured in the perf log:

policy = stage1_window
buf    = 137048
hits   = 155
due_misses = 0
blocking_vb = 2
prefetch.overrun_vb = 2
loop_vb = 1068
overrun_vb = 0
target_vb = 1074
restore_bytes = 251,144
upload_bytes  = 10,646,400
dirty_rows    = 16,635
upload_rects  = 456
trip = 0   fallback = 0   frame_mismatch = 0
sound_late = 0   cd_fail = 0

That is 0.0% public over target, or 100.0% public target speed. The raw signed CSV row is -0.4% / 100.4%. Across the 126 timing-bearing battle-card rows, the public average is +0.3% over target / 99.7% target speed (0.2708% exact public over target / 99.7337% exact public target speed); the raw signed optimization matrix is -0.4963% / 100.5160%.

The latest WALKSTUF1 high scalar retained-read closure tested the remaining shared append rows after the 427..443 CD-work baseline. Some candidates were exact-flat, and the rows that saved reads paid the win back as visible-loop, target, or refill debt. The next high-side attempt should use generated deadline ownership, pack-side byte/phase reduction, or upload/restore work removal rather than another hand-authored scalar append.

The latest WALKSTUF1 low scheduler sweep tested post-prepare window refill thresholds. Conservative slack did not fire; lower thresholds fired but regressed loop, blocking, refill, and due misses. That closes the cheap prepare-then-refill branch and leaves generated frame-deadline ownership or pack/upload work reduction as the next low-side path.

The latest VISITOR3 high promotion reuses the proven low compact frame143/144 cleanup payloads and repacks frames 141/140/142/143/144 plus sound events inside the existing 277..293 setup segment. It improves high to 1063/1040, overrun 23, blocking/read time 35, and reads/due 6/6, while pack bytes/LBA/sectors and the PS-EXE bucket stay fixed. BUILDING2 low now keeps the earlier 218..229 slack-8 row and adds v739 draw-tail trimming, improving to 1339/1317.

Scene Battle Card

As of 2026-05-14, all 126 scene/tide variants have current headless perf measurements. The latest updated rows are stamped building2-low-trimtails-v739, visitor3-high-tail-pack-v629, visitor5-high-rg30-46-v496, visitor3-low-frame137-primegap-v510, walkstuf1-low-rg78-91-v474, walkstuf1-high-current-v458-refresh, building2-low-rg218-229-slack8-v626, building2-low-delta-v454, visitor5-low-compact-rg23-47-v451, walkstuf1-high-shared-dual-tail-v428, walkstuf1-low-shared-dual-tail-v428, building2-high-rg206-230-cap24-v441, building6-window-slack4-v364, johnny6-compact-fgp3-v354, visitor3-low-tail-pack-only-v338, visitor3-low-f128-resident-seg27-v302, visitor3-high-f131-resident-alias121123-v299, visitor3-low-alias-noop114117-v292, visitor3-high-f140-segment-copy-v291, visitor3-low-noop113-v249, visitor3-low-noop114117-v248, visitor3-high-f127-f130-resident-copy-v238, visitor3-drop-unused-motion-dispatch-v197, activity9-low-compact-fgp3-v174, johnny1-compact-fgp3-v173, walkstuf3-low-compact-fgp3-v171, activity9-high-compact-fgp3-v167, building6-compact-fgp3-v165, walkstuf3-high-compact-fgp3-v163, building2-low-restore-window-slack4-v160, visitor5-high-current-v401, building1-compact-fgp3-noautoprime-v157, mary3-preserve-window-slack8-v149, missing-scenes-current-v001, visitor3-tail-trim-stageguard-v127, graphics-composite-os-v111, building2-low-group365-381-v110, building2-high-group60-72-v109, building2-high-restore-minus-current-v108, visitor3-low-offscreen-exitright-v106, visitor3-high-offscreen-drawclip-v105, walkstuf1-high-primecap144-v089, visitor3-low-readgroup-prune-v088, building4-restore-minus-current-v087, visitor3-restore-minus-current-v086, visitor3-high-readgroup-prune-v084, compact-u16-inline-v083, fgp3v4-drawcompact-all-v082, activity9-dead-readgroup-prune-v082, read-group-selector-single-assign-v082, visitor3-high-group138-162-slack4-v081, walkstuf1-low-primecap160-v081, johnny2-prefetch-relief-v081, activity9-low-fgp3-cleanup-compact-v081, activity9-current-v081-refresh, building4-fgp3-cleanup-compact-window-v081, building2-fgp3-cleanup-compact-v081, visitor3-fgp3-cleanup-compact-v081, mary2-prefetch-relief-v081, mary2-fgp3-padded-v081, johnny2-fgp3-padded-v081, mary5-fgp3-padded-v081, activity11-fgp3-padded-v081, building5-fgp3-padded-v080, walkstuf1-fgp2-setup-prime-v080, visitor3-setup-prime-192k-v080, visitor3-high-group170-186-v080-current, activity9-lowgroup-v072c, activity9-fgp3-v072c, activity9-window-v072c, activity4-fishing4-v072c-prefetch-relief, activity1-v072c-current-refresh, activity11-12-v072c-prefetch-relief, stale-next-v072c-current-refresh, mary1-v072c-prefetch-relief, stale-layout-v072c-current-refresh, activity9-v072c-prefetch-relief, stale-pressure2-v072c-current-refresh, johnny1-v072c-prefetch-relief, stale-pressure-v072c-current-refresh, activity10-johnny3-v072-prefetch-relief, stale-zero2-v072b-current-refresh, stale-zero-v072b-current-refresh, stale-top-v072b-current-refresh, visitor5-v072-prefetch-relief, mismatch-top-v072-current-refresh, stand-family-v072-current-refresh, visitor4-v072-current-refresh, stand1-v072-current-refresh, visitor3-v072-prefetch-relief, fishing5-v065-current-ledger-overlay, compact-fgp3-v66-final-frame-hold, compact-fgp3-v64-building2-group318-330, compact-fgp3-v63-building2low-prime, and indexed8-row-local-dirty-v1; other refreshed rows include compact-fgp3-v62-fishing3low-group253-265, compact-fgp3-v61-fishing3low-group163-175, compact-fgp3-v60-visitor3high-group230-242, compact-fgp3-v59-visitor3high-group72-84, indexed8-tile-local-compose-v1, compact-fgp3-v58-activity9high-window20-table, compact-fgp3-v57-policy-table-refactor, and compact-fgp3-v49-walkstuf2-auto-prime through compact-fgp3-v29-smallprime, and the full-matrix baseline rows are stamped compact-fgp3-v2-fullmatrix. 63 of 63 scenes have at least one routed variant, and 63 scenes have both high- and low-tide variants routed. All 126 rows now carry active-loop timing; suzy1 needs the longer 12000-frame matrix budget because its valid scene-end lands after the default 7200-frame window. The latest matrix run is 2026-05-13T21:31:34; per-row freshness and stats version are shown on the battle card. The values below are public-capped over target / target speed (loop_vb/target_vb), with blk and due called out when nonzero. Faster-than-target rows display 0.0% / 100.0%; their raw signed values remain in docs/ps1/performance-scene-matrix.csv.

The complete matrix pass is compact-fgp3-v2-fullmatrix; accepted follow-up rows now use visitor3-high-tail-pack-v629, visitor5-high-rg30-46-v496, visitor3-low-frame137-primegap-v510, walkstuf1-low-rg78-91-v474, walkstuf1-high-current-v458-refresh, building2-low-trimtails-v739, building2-low-rg218-229-slack8-v626, building2-low-delta-v454, visitor5-low-compact-rg23-47-v451, walkstuf1-high-shared-dual-tail-v428, walkstuf1-low-shared-dual-tail-v428, building2-low-rg238-250-v445, building2-high-rg206-230-cap24-v441, building6-window-slack4-v364, visitor3-high-f131-resident-alias121123-v299, visitor3-low-tail-pack-only-v338, visitor3-low-f128-resident-seg27-v302, visitor3-low-alias-noop114117-v292, visitor3-high-f140-segment-copy-v291, visitor3-low-noop113-v249, visitor3-low-noop114117-v248, visitor3-high-f127-f130-resident-copy-v238, visitor3-drop-unused-motion-dispatch-v197, johnny1-compact-fgp3-v173, walkstuf3-low-compact-fgp3-v171, activity9-high-compact-fgp3-v167, building6-compact-fgp3-v165, walkstuf3-high-compact-fgp3-v163, building2-low-restore-window-slack4-v160, visitor5-high-current-v401, building1-compact-fgp3-noautoprime-v157, mary3-preserve-window-slack8-v149, visitor3-tail-trim-stageguard-v127, graphics-composite-os-v111, building2-low-group365-381-v110, building2-high-group60-72-v109, building2-high-restore-minus-current-v108, visitor3-low-offscreen-exitright-v106, visitor3-high-offscreen-drawclip-v105, walkstuf1-compact-fgp3-v141, visitor3-low-readgroup-prune-v088, building4-restore-minus-current-v087, visitor3-restore-minus-current-v086, visitor3-high-readgroup-prune-v084, fgp3v4-drawcompact-all-v082, compact-u16-inline-v083, visitor3-fgp3-cleanup-compact-v081, walkstuf1-low-primecap160-v081, johnny2-prefetch-relief-v081, mary2-prefetch-relief-v081, mary2-fgp3-padded-v081, johnny2-fgp3-padded-v081, mary5-fgp3-padded-v081, activity11-fgp3-padded-v081, building5-fgp3-padded-v080, walkstuf1-fgp2-setup-prime-v080, visitor3-setup-prime-192k-v080, visitor3-high-group170-186-v080-current, activity9-lowgroup-v072c, activity9-fgp3-v072c, activity9-window-v072c, johnny6-compact-fgp3-v354, activity4-fishing4-v072c-prefetch-relief, activity1-v072c-current-refresh, activity11-12-v072c-prefetch-relief, stale-next-v072c-current-refresh, mary1-v072c-prefetch-relief, stale-layout-v072c-current-refresh, activity9-v072c-prefetch-relief, stale-pressure2-v072c-current-refresh, johnny1-v072c-prefetch-relief, stale-pressure-v072c-current-refresh, activity10-johnny3-v072-prefetch-relief, stale-zero2-v072b-current-refresh, stale-zero-v072b-current-refresh, stale-top-v072b-current-refresh, visitor5-v072-prefetch-relief, mismatch-top-v072-current-refresh, stand-family-v072-current-refresh, visitor4-v072-current-refresh, stand1-v072-current-refresh, visitor3-v072-prefetch-relief, compact-fgp3-v66-final-frame-hold, fishing5-v065-current-ledger-overlay, compact-fgp3-v64-building2-group318-330, compact-fgp3-v63-building2low-prime, and indexed8-row-local-dirty-v1; other refreshed rows include compact-fgp3-v62-fishing3low-group253-265, compact-fgp3-v61-fishing3low-group163-175, compact-fgp3-v60-visitor3high-group230-242, compact-fgp3-v59-visitor3high-group72-84, indexed8-tile-local-compose-v1, compact-fgp3-v58-activity9high-window20-table, compact-fgp3-v57-policy-table-refactor, and compact-fgp3-v49-walkstuf2-auto-prime through compact-fgp3-v29-smallprime. Older padded-fgp3-v1 / compact-fgp3-v1 rows are historical only.

Scene	High tide	Low tide
`activity1`	0.0% / 100.0% (2754/2764); blk 1	0.0% / 100.0% (2754/2765)
`activity4`	0.0% / 100.0% (1065/1066); blk 4	0.0% / 100.0% (1064/1068); blk 1
`activity5`	0.0% / 100.0% (1730/1749); blk 2	0.0% / 100.0% (1731/1749); blk 2
`activity6`	+0.1% / 99.9% (912/911)	+0.1% / 99.9% (912/911)
`activity7`	0.0% / 100.0% (593/596)	0.0% / 100.0% (594/596)
`activity8`	0.0% / 100.0% (898/904); blk 1	0.0% / 100.0% (899/904); blk 2
`activity9`	+1.0% / 99.0% (2082/2062); due 1; blk 24	+0.7% / 99.3% (2075/2061); due 1; blk 17
`activity10`	0.0% / 100.0% (1259/1259); due 1; blk 7	0.0% / 100.0% (1255/1256); due 2; blk 17
`activity11`	0.0% / 100.0% (1715/1722); blk 2	0.0% / 100.0% (1717/1722); blk 4
`activity12`	0.0% / 100.0% (1411/1412); blk 7	0.0% / 100.0% (1409/1411); due 1; blk 10
`building1`	+2.1% / 98.0% (794/778); blk 21	+1.9% / 98.1% (794/779); blk 21
`building2`	+3.0% / 97.0% (1351/1311); due 7; blk 54	+2.3% / 97.8% (1349/1319); due 17; blk 80
`building3`	0.0% / 100.0% (5460/5465)	0.0% / 100.0% (5460/5465)
`building4`	+1.0% / 99.0% (2844/2816); due 1; blk 37	+1.3% / 98.7% (2853/2816); due 1; blk 40
`building5`	0.0% / 100.0% (3343/3348); blk 5	0.0% / 100.0% (3345/3347); blk 8
`building6`	+1.0% / 99.0% (2482/2457); blk 25	+1.2% / 98.8% (2485/2456); blk 28
`building7`	0.0% / 100.0% (3132/3133); blk 9	0.0% / 100.0% (3130/3133); blk 7
`fishing1`	0.0% / 100.0% (1068/1074); blk 2	0.0% / 100.0% (1067/1074); blk 1
`fishing2`	0.0% / 100.0% (1761/1763); blk 6	0.0% / 100.0% (1759/1765); blk 3
`fishing3`	+0.6% / 99.4% (1962/1950); due 1; blk 17	+0.1% / 99.9% (1957/1955); blk 9
`fishing4`	0.0% / 100.0% (835/842); blk 2	0.0% / 100.0% (834/843)
`fishing5`	0.0% / 100.0% (885/890)	0.0% / 100.0% (885/890)
`fishing6`	0.0% / 100.0% (744/753)	0.0% / 100.0% (744/753)
`fishing7`	0.0% / 100.0% (715/725)	0.0% / 100.0% (715/725)
`fishing8`	0.0% / 100.0% (1243/1253)	0.0% / 100.0% (1243/1253)
`johnny1`	+1.4% / 98.6% (1973/1945); blk 25	+1.4% / 98.6% (1973/1945); blk 25
`johnny2`	0.0% / 100.0% (1741/1751)	0.0% / 100.0% (1741/1751)
`johnny3`	0.0% / 100.0% (1158/1161); due 1; blk 10	0.0% / 100.0% (1157/1166)
`johnny4`	0.0% / 100.0% (1204/1214)	0.0% / 100.0% (1204/1214)
`johnny5`	0.0% / 100.0% (811/820)	0.0% / 100.0% (810/820)
`johnny6`	+1.0% / 99.0% (2829/2802); blk 24	+1.0% / 99.0% (2830/2802); blk 25
`mary1`	+0.8% / 99.2% (4867/4830); due 2; blk 47	+0.4% / 99.6% (4860/4840); due 1; blk 31
`mary2`	0.0% / 100.0% (2241/2248); blk 2	0.0% / 100.0% (2242/2250); blk 2
`mary3`	+0.1% / 99.9% (2296/2294); due 13; blk 53	+0.1% / 99.9% (2297/2295); due 13; blk 51
`mary4`	0.0% / 100.0% (1968/2016); due 3; blk 28	0.0% / 100.0% (1966/2019); due 3; blk 24
`mary5`	0.0% / 100.0% (1581/1586); due 1; blk 5	0.0% / 100.0% (1581/1584); due 1; blk 6
`miscgag1`	0.0% / 100.0% (953/961)	0.0% / 100.0% (953/961)
`miscgag2`	0.0% / 100.0% (1352/1356)	0.0% / 100.0% (1352/1356)
`stand1`	0.0% / 100.0% (194/202)	0.0% / 100.0% (194/202)
`stand2`	0.0% / 100.0% (480/490)	0.0% / 100.0% (480/490)
`stand3`	0.0% / 100.0% (547/557)	0.0% / 100.0% (547/557)
`stand4`	0.0% / 100.0% (1202/1220)	0.0% / 100.0% (1203/1218); blk 3
`stand5`	0.0% / 100.0% (1442/1460)	0.0% / 100.0% (1442/1460)
`stand6`	0.0% / 100.0% (1346/1364)	0.0% / 100.0% (1346/1364)
`stand7`	0.0% / 100.0% (520/538)	0.0% / 100.0% (520/538)
`stand8`	0.0% / 100.0% (483/499); blk 2	0.0% / 100.0% (483/499); blk 2
`stand9`	0.0% / 100.0% (520/538)	0.0% / 100.0% (522/538)
`stand10`	0.0% / 100.0% (528/538)	0.0% / 100.0% (528/538)
`stand11`	0.0% / 100.0% (528/538)	0.0% / 100.0% (528/538)
`stand12`	0.0% / 100.0% (1450/1459); blk 1	0.0% / 100.0% (1450/1460)
`stand15`	0.0% / 100.0% (444/452)	0.0% / 100.0% (444/452)
`stand16`	+0.2% / 99.8% (473/472)	+0.2% / 99.8% (473/472)
`suzy1`	no active loop	no active loop
`suzy2`	no active loop	no active loop
`visitor1`	0.0% / 100.0% (672/677)	0.0% / 100.0% (672/677)
`visitor3`	+2.2% / 97.8% (1063/1040); due 6; blk 35	+2.1% / 97.9% (1062/1040); due 7; blk 42
`visitor4`	0.0% / 100.0% (424/428)	0.0% / 100.0% (424/428)
`visitor5`	+1.1% / 98.9% (1104/1092); blk 11	+2.0% / 98.0% (1112/1090); blk 12
`visitor6`	0.0% / 100.0% (2043/2047); blk 1	0.0% / 100.0% (2043/2047); blk 1
`visitor7`	0.0% / 100.0% (1619/1625)	0.0% / 100.0% (1619/1625)
`walkstuf1`	+3.4% / 96.8% (1480/1432); due 16; blk 83	+3.7% / 96.4% (1484/1431); due 12; blk 72
`walkstuf2`	0.0% / 100.0% (451/461)	0.0% / 100.0% (451/461)
`walkstuf3`	+0.9% / 99.1% (2310/2290); due 6; blk 47	+1.2% / 98.8% (2321/2293); due 5; blk 41

Detail-tier attribution for the canary currently points at render and restore pressure rather than CD stalls:

sched.wait       = 722
sched.present    = 99
sched.cd_stage   = 137
sched.cd_window  = 19
gfx.restore_bytes = 251,144
gfx.upload_bytes  = 8,643,840

The FISHING1 canary remains at the public 100.0% cap with raw signed headroom, but the full battle card still has CD-heavy scenes (visitor3, building2, building6, walkstuf1, and building4). The clean-pressure relief rows prove scene-local CD policy can recover large due-miss collapses, while the refreshed stale rows prove current-pack baselines must be cleared before ranking fixed overhead.

Next plausible wins, in priority order:

Generated read grouping or setup/data-shape work. WALKSTUF1 low/high are now the largest gaps at +53/+48 VBlanks after the latest high 444..456 same-speed CD-work reduction. BUILDING2 high/low (+38/+31) and VISITOR3 low/high (+32/+31) are the next tight rows after the VISITOR3 motion-copy, setup-segment, setup-prime, guarded second-segment, resident-copy, and low no-op residual passes; its local C read-table rows are exhausted, so the next CD-shape pass needs generated scheduler ownership, selective preprocessing, or further pack data-shape work rather than hand-authored ranges. The default selective upload-ready plan is footprint-closed as a same-layout append because 2111224 bytes of payload plus rect metadata exceed the current 970076 bytes of VISITOR3 high-pack slack. The budgeted analyzer target keeps this same-footprint lane alive with 78 selected frames, 968904 payload+rect bytes, and 4232112 modeled upload bytes saved before runtime implementation. The empty-hold no-op recast is closed because the current packs expose 0 zero-visual-work entries. The packed-draw metadata probes prove a real VISITOR3 byte-reduction signal: the v4 draw-tail trim plus VISITOR3 stage guard is now promoted, while the v7 runtime decoder shape remains rejected because it perturbs BUILDING2 and BUILDING4 canaries. A layout-neutral packed-delta retry keeps LBAs and the PS-EXE bucket fixed, but its function-scoped PAL4 span -Os trade regresses VISITOR3 high while improving low tide, so that C-side shape is closed too. An entry-origin recentering size gate also saves 0 bytes on current VISITOR3 high/low FGP3/v4 payloads, so that zero-runtime-code coordinate-shift lane is closed before emulator time.
FG2-specific present pipeline with explicit slack budgeting. Earlier present-prep experiments regressed because they stole CD prefetch slack; the next scheduler needs separate render-prep and CD-prefetch budgets.
X-aware dirty upload and rect-pressure control. The FISHING 1 canary still restores 251 KB and uploads 8.5 MB; larger scenes carry more upload pressure.
Specialized indexed8 and PAL4 compositors. The pack-format wins reduce bytes, but dense scenes still pay per-span/per-pixel runtime costs.
Generated scheduler ownership for the remaining under-99 rows. MARY3 is now green after the guarded prefetch-preserve pass, and BUILDING6 moved to the bottom of the yellow band after compact-pack promotion. The remaining hard rows are VISITOR3 low/high, WALKSTUF1 low/high, BUILDING2 high/low, VISITOR5 low/high, JOHNNY1 high/low, BUILDING4 low, and BUILDING6 high/low, where hand-authored read groups and scalar window changes have repeatedly shifted cadence instead of safely removing work. The latest WALKSTUF1 low v747/v749/v750/v751/v753/v755/v756/v757/v759/v762/v763/v766/v767/v769/v770/v771/v772/v773/v774/v775/v776/v777/v779/v780/v781/v782/v783/v784/v785/v786/v787/v788/v789/v790/v791 pass keeps the row exact-flat while shrinking frames 51, 49, 47, 61, 62, 58, 45, 37, 35, 43, 41, 57, 33, 67, 68, 69, 32, 133, 5, 141, 70, 30, 6, 71, 72, 142, 73, 131, 74, 19, 28, 138, 145, 75, and 76 in-place (879801 -> 801103 active payload), and v760 restores the bounded CD fast-poll runtime to 60/272 read time, so W1-low now has a safe no-shift payload lane but still needs a sector/read timing conversion. The BUILDING4 low v387 pass closes the local 178..202 append group and 40/48 KiB stream-window growth: reads fell, but visible blocking and loop overrun rose sharply. The newer v746 in-place frame291 shrink proves no-shift payload reduction is safe, cutting active payload 855284 -> 849109 while staying exact-flat, so that row now needs sector-changing no-shift byte reduction, generated deadline ownership, or selective preprocessing rather than larger raw fresh fills.

The author considers the current build comfortable for the validated scenes, not yet headroom-clean. The canary bottleneck is no longer raw CD stall; the matrix bottleneck is uneven per-scene payload/read shape plus render/restore pressure.

Non-goals

A few things the perf work explicitly does not chase, with reasons:

Frame dropping. Violates pixel-perfect playback. The acceptance bar requires every captured entry to render on its captured beat.
Timing compression before throughput work. The timing-bearing matrix public average is now +0.2708% over target / 99.7337% target speed, with several worse CD-bound outliers; compressing the timing files would expose the same throughput bottleneck without fixing it.
Reintroducing FG1 / ADS / TTM runtime paths. Those are retired from the active public path. The PS1 executable links only the scene-playback runtime plus the minimal background / audio / input / CD layers it needs.
Fixed island assumptions. The runtime must randomly place the island per scene, so all optimizations must preserve scene-relative FG2 placement.
Direct framebuffer or progressive-mode experiments as first moves. Prior history says these were unstable. Exhaust stable scene playback first.

Performance battle card — the live timing matrix this reference manual describes the columns of. 126 scene/tide variants, sortable, color-coded.
From 87 to 99.5: the post-validation performance loop — the retrospective on the optimization arc, including which experiments landed and which got rejected.
v0.8.1: what the soak found that the matrix didn’t — the soak-loop war story; matrix and soak are not redundant.
The 24/7 build farm — the magazine treatment of the parallel Docker machinery that iterates the perf experiments this reference describes the output of. Same JCPERF / JCPERF2 records, but framed as methodology for keeping a 126-row matrix moving.
Hardware — what the optimizations are running against.
Build & toolchain — how the PS1 binary is produced.
Build infrastructure — the wrapper around the perf iterate script.
Audio pipeline — the SPU side, which has its own scheduling concerns.
Story-loop walks — the walk subsystem’s persistent clean buffer is part of the same pressure-accounting envelope the matrix above measures; the v0.8.0 clean-rect retry path and v0.8.1 wave-band/split-rect pressure changes are documented there.
Vision-classifier work — the validation layer that runs against perf-experiment outputs.
Devlog — perf work shows up day-by-day there.

View source on GitHub

The body cites a dozen files; this section collects them. Grouped by purpose — plan and ledgers, runtime, iterate gate, the scene matrix, the compiler-flag and preprocessing sweeps, the read-plan rollup, and the regtest runner.

docs/ps1/performance-optimization-plan.md · docs/ps1/performance-experiment-log.md — the optimization plan and the 600+ experiment ledger.
src/ps1_perf.c · src/foreground_pilot.c — runtime: the JCPERF/JCPERF2 instrumentation and the FG2 dispatcher whose per-frame budget the matrix measures.
scripts/ps1-perf-iterate.sh — the experiment gate every probe goes through (run → compare → promote-or-reject).
docs/ps1/performance-scene-matrix.csv — the full scene/tide battle card; rendered as the live sortable matrix at /perf/.
docs/ps1/performance-o2-audit.md · docs/ps1/performance-o2-audit.csv — current compiler-flag sweep, regenerated from build-ps1/compile_commands.json + build-ps1/jcreborn.map before each -O2 probe.
docs/ps1/performance-preprocess-opportunities.md · docs/ps1/performance-preprocess-opportunities.csv · scripts/analyze-fg2-preprocess-plans.py · docs/ps1/performance-preprocess-visitor3-hotspots.csv — pack-time graphics preprocessing target sheet, the FGP2/FGP3 per-pack analyzer, and the VISITOR3 cap-hit / saving-heavy frame sheet.
docs/ps1/performance-read-candidate-matrix.md · docs/ps1/performance-read-candidate-matrix.csv — foreground read-plan candidates classified by append-start ownership, grouped-read capacity, and visible-CD cost class.
scripts/run-regtest.sh — headless DuckStation runner that captures PNGs and ingests TTY records into per-run summary JSON files.