Reference + log
Performance work
What "performance" means on a 33 MHz machine, what got measured, what got tried, and what stuck.
~18 min read · 4600 words
A labor of love by Hunter Davis. This page is the running summary of perf work on the PS1 port at v0.8.12-ps1: where the bottleneck is, what got measured, which experiments stayed in the build, and which got reverted. The full per-experiment ledger lives in the source tree; the link is at the bottom. The retrospective on how the matrix moved from the compact baseline to the current battle card — From 87 to 99.5: the post-validation performance loop — is in the Lab. If you paid for this, you were cheated. Open source and free.
On this page
The constraint
“Performance” on a PS1 means a different shape of problem than performance on anything modern.
The MIPS R3000A core runs at 33.8688 MHz with no FPU. The GPU is fixed-function — sprites, primitives, an ordering table, no shaders. Audio is a separate processor with its own RAM. The CD is a 2x drive: 300 KB/s sustained, 150 ms cold seek. There is no memory bandwidth budget worth talking about for a 16-color screensaver port; the bandwidth budget is the CD’s, and it gets spent in seek latency, not transfer time.
The frame budget at 60 Hz is 16.6 ms. Johnny Castaway is a 1992 VGA screensaver — at the source level, foreground content changes roughly four times per second. The PS1 still has to draw a frame at 60 Hz, but it can hold the same content frame after frame for many VBlanks at a stretch. The VBlank cadence is the rendering loop’s heartbeat; the interesting timing is which VBlanks have actual work in them and which are held idle.
That asymmetry is what shapes the runtime. A held VBlank is free CPU and
free CD bus. The whole optimization story is about scheduling work — CD
reads, RAM tile composition, dirty-row uploads — into held VBlanks before
the next “real” frame arrives. When that scheduling fails, the active
frame’s VBlank gets stretched and loop_vb goes up.
The frame budget for a screensaver is more forgiving than a game. Nothing the user does requires sub-frame latency. But the project’s acceptance bar is pixel-perfect playback against host-captured reference frames, which means the runtime cannot drop frames or compress timing files to “feel faster” — it has to render every captured entry on the captured beat. Slack exists in the held intervals; it does not exist in the entries.
What was measured
The perf instrumentation lives in
src/ps1_perf.c. It is
gated so it adds zero cost when off.
Three signal sources:
- TTY printf at scene-start and scene-end with structured
JCPERF/JCPERF2records. Levels:OFF,SUMMARY,DETAIL,DEBUG. Only the on-demand records cross the TTY surface; per-frame text is forbidden in hot paths because it perturbs timing. ps1_perfmodule counters for VBlank-level metrics:loop_vb,target_vb,overrun_vb,blocking_vb,prefetch_overrun_vb,due_misses,restore_bytes,upload_bytes,dirty_rows,upload_rects,loop_reads. Each scene-end record dumps the steady-state values for that run.- Regtest harness frame timing. The headless DuckStation in
scripts/run-regtest.shboots the disc image, captures PNGs, and ingests the TTY records into per-run summary JSON files underscratch/ps1-perf-iterate/<runId>/.
Every experiment goes through the same gate:
scripts/ps1-perf-iterate.sh
runs the case, compares it to a baseline summary.json, and either
promotes (if a key metric improved without a material regression in
loop_vb / blocking_vb / prefetch_overrun_vb / scene identity) or
rejects with a recorded failure reason.
The full experiment log is at
docs/ps1/performance-experiment-log.md.
At the time of writing it contains 600+ experiment rows going back to
2026-04-25. Most of them failed.
The full scene/tide battle card is
docs/ps1/performance-scene-matrix.csv
and is rendered as the live, sortable, color-coded battle card at
/perf/. It is not the human
scene-promotion ledger at /scenes/;
the two ledgers stay separate on purpose — different bars,
different cadences, different failure modes.
The current compiler-flag sweep is tracked in
docs/ps1/performance-o2-audit.md
and its machine-readable
performance-o2-audit.csv.
That report is regenerated from build-ps1/compile_commands.json and
build-ps1/jcreborn.map before each -O2 probe.
The current pack-time graphics preprocessing target sheet is
docs/ps1/performance-preprocess-opportunities.md
and its machine-readable
performance-preprocess-opportunities.csv.
It ranks today’s FG2/FGP3 packs for selective upload-ready or cleanup metadata
work without changing the runtime baseline.
The per-pack detail analyzer
scripts/analyze-fg2-preprocess-plans.py
now parses both FGP2 and FGP3 temporal-residual payloads. Its VISITOR3 output
splits cap-hit frames from saving-heavy frames, which keeps the next
upload-ready experiment selective instead of a whole-pack conversion. The
current VISITOR3 frame sheet is
docs/ps1/performance-preprocess-visitor3-hotspots.csv.
The current default VISITOR3 high-tide selective plan is still too large for a
same-footprint append: it models 5730024 selected upload bytes saved, but the
upload-ready payload plus rect metadata needs 2111224 bytes against only
970076 bytes of padded zero-tail slack. The analyzer now emits the
same-footprint budgeted target too: 78 / 92 default-selected frames fit in
968904 payload+rect bytes, leave 1172 bytes of slack, and retain 4232112
modeled upload bytes saved. The analyzer now also reports whether those x-band
uploads are safe to emit from foreground data alone. For VISITOR3, 0
selected x-band bytes are fully covered by current opaque draw spans, so a raw
pack-emitted upload payload would have to bake restored background pixels that
are dynamic at runtime. The next probe should use a different generated data
shape, explicit scheduler ownership, compression plus a safe pixel source, or a
deliberate layout-moving experiment.
A tempting VISITOR3 shortcut was rejected: pruning visually no-op FGP3 entries
reduced active payload and high-tide visible blocking, but hidden prefetch
overrun regressed from 0 to 56 high and 17 low. That confirms the next
VISITOR3 route needs explicit scheduler ownership or budgeted upload-ready data,
not isolated entry-count pruning. The safer pack-side empty-hold recast also
found 0 current VISITOR3 high/low entries whose cleanup and draw pixel counts
are both zero, so there is no cadence-preserving no-op payload to erase under
the current FGP3/v4 data.
The current post--O2 tooling pass also records compact baseline
fingerprints in every perf summary and classifies foreground read-plan
candidates by observed append-start ownership, current grouped-read capacity,
and visible-CD cost class. That makes stale-baseline comparisons, no-op read
groups, and tight visible-cluster candidates visible before a runtime source
edit.
Those foreground read-plan candidates are now rolled up into
docs/ps1/performance-read-candidate-matrix.md
and its machine-readable
performance-read-candidate-matrix.csv.
The current report has one guarded BUILDING2 candidate, no standalone-safe
rows, and keeps VISITOR3 in the scheduler-owned or closed lane. Remaining read-timing candidates should not
be promoted as raw hand-authored table ranges without the same kind of
slack/scheduler proof. The BUILDING6 v353 181..197 / 269..285 probe is
now the concrete counterexample for direct-stage clusters: the source table
crossed the PS-EXE bucket, never produced a group_hit, and left active read
counts unchanged, so BUILDING6 needs generated direct-stage ownership or a
pack-side data-shape change rather than another local read-group row.
Experiments that didn’t work
A representative slice of rejected experiments and why each one didn’t stick. The pattern is more useful than any individual line — almost every “obvious” idea gets discarded because the PS1 runtime has counter-intuitive cost structure.
- Larger stream windows.
40 KB,56 KB,64 KB. Larger windows reduce CD transaction count but overrun held slack more often. The current default is20 KBafter a long sweep; everything bigger lost. - Smaller stream windows.
12 KB,14 KB,16 KB. Smaller windows reduce per-refill overrun but starve due frames —due_missesrises andblocking_vbfollows. The knee is sharp; one sector size in either direction matters. - Disabling stage1 isolation. Booting with
no-stage1to test whether stage-copy overhead was a real cost. The headless harness exited 137 beforeJCPERF2could record anything; the test was structurally inconclusive. Kept staging on. - Partial tail reads when a staged frame straddles the window end.
Sounded right on paper. In practice, smaller tail reads multiply CD
transaction count and
due_missesrises faster than the byte savings help. Rejected. - Compose-before-VSync sequencing. Move the FG2 RAM composition
before the VBlank wait so CPU work overlaps with previous-frame
scanout, then upload after VBlank. Reduced
prefetch_overrun_vbbut stole held-prefetch time elsewhere; totalloop_vbregressed by 12. - Held-loop no-slack wait skip. Looked like a clean one-VBlank overshoot fix. Regressed loop, blocking, and refill metrics simultaneously; the skipped wait was load-bearing.
- Async stream-window refill. Naive async polling regressed
blocking_vbbadly. The CD subsystem has implicit ownership rules the synchronous path was respecting; the async path violated them. Rejected without a first-class CD-state ownership model. -O3on hot translation units. Less prepared RAM work in some scenes, but worse loop/blocking/refill timing overall. The optimization changed code shape enough that CD scheduling phase shifted unfavorably. Kept-O2.- Holiday overlap restamping. Seed holiday decoration into the clean backdrop and only restamp it when the current FG2 frame overlaps. Logically sound, but the active fishing1 frames overlap the Christmas decoration enough that this didn’t reduce dirty work. Pure no-op, rejected.
vprintfinline diagnostics. Adding a CD-read histogram inline withJCPERFregressed timing even with detail-gating. The act of having the code present changed binary shape enough to move scheduling phase. Reverted; histograms now live in post-processing.- FG2 sound-event table in the metadata prefix. Setup reads improved, but moving the table ahead of the payload shifted every payload by 36 bytes and badly worsened active CD phase. The pack layout is more sensitive to byte offsets than is comfortable.
The recurring lesson: changes that look like clean wins on paper often
shift CD scheduling phase in ways that are not visible until the full
scene runs. The headless gate is what catches this; experiments that
regress loop_vb or blocking_vb against a baseline get rejected
even when they “obviously” should have helped.
Experiments that did
A condensed list of changes that survived and are in the runtime today. They cluster into a few themes.
Foreground prefetch and stream window:
- Stage1 staging buffer for the next FG2 entry, prefetched during held VBlanks.
- Stream window default of
20 KB, reduced from earlier32 KBafter the post-pause-merge sweep showed it as the local minimum. - 3 VBlank refill guard, raised from earlier 2/1 thresholds after smaller guards caused due-frame starvation.
- Forward-extend stream window when a straddling entry is detected: preserve the resident suffix and append-read only the missing aligned tail. Replaces overlapping full-window refills.
- Stage-copy fallthrough at 5 VBlanks: after a zero-VBlank stage copy from the resident window, immediately prefetch the following window if at least 5 held VBlanks remain. Converts idle held time into hidden CD work.
- Tight-slack direct staging up to
8 KBfor immediate payloads when the window refill would otherwise be skipped.
Compositor:
- PAL4 opaque-span compositor — FG2 PAL4 spans contain only visible pixels, so the per-pixel transparent-index branch was removable.
- Tile-local PAL4 fast path — split each span by destination tile once instead of per-pixel.
- Per-tile PAL4 row dirty marking — track which rows of which tiles changed, not just which tiles.
- Base-diff FG2 pack format — the active path requires base-diff
packs, which makes RAM tile compositing the only render path and
lets
grBeginFrame()/ClearOTagR()skip when nothing’s queued.
Dirty-rect bookkeeping:
- X-aware clean-rect restore — track dirty X extents per tile so RAM clean-background restore only touches the changed region.
- Vertical dirty-row upload bands with an 11-row gap merge — collapses adjacent uploads into wider rectangles.
- Long-hold host-deadline catch-up — a small render bookkeeping adjustment that traded seven extra speculative restore/compose calls for five fewer loop VBlanks.
Code shape and link:
-ffunction-sections -fdata-sectionsplus--gc-sectionsfor the PS1 link. The legacy ADS / TTM / FG1 / FOC runtime paths are still in the source tree but get stripped at link time.- Removal of the foreground visual telemetry hot-path body, the legacy foreground diagnostic gate, the unused foreground “ever” diagnostics, the unused ADS foreground start hook, the obsolete FGPILOT ADS dispatch, the unused foreground status accessors, the dead foreground requested-mode state.
Diagnostic gating:
- Pad / SPI diagnostics gated default-off. The pause-menu work
introduced always-on
JCPAD/JCSPIsampling; a strict-gate red-team pass showed the diagnostics were costing 52 VBlanks of loop time. Default-off recovered that;pad-diag/pad-debugboot tokens still enable them on demand.
The cumulative effect is visible in the current accepted baseline:
fishing1 high-tide playback at loop_vb=1068 against a target of
target_vb=1074. The original headless perf-loop baseline was
loop_vb=1426, so the FISHING 1 canary is down 358 VBlanks
(25.11% loop reduction).
Where it sits at v0.8.12-ps1
The current accepted fishing1 high-tide run, captured in the perf log:
policy = stage1_window
buf = 137048
hits = 155
due_misses = 0
blocking_vb = 2
prefetch.overrun_vb = 2
loop_vb = 1068
overrun_vb = 0
target_vb = 1074
restore_bytes = 251,144
upload_bytes = 10,646,400
dirty_rows = 16,635
upload_rects = 456
trip = 0 fallback = 0 frame_mismatch = 0
sound_late = 0 cd_fail = 0
That is 0.0% public over target, or 100.0% public target speed. The raw signed
CSV row is -0.4% / 100.4%. Across the 126 timing-bearing battle-card rows,
the public average is +0.3% over target / 99.7% target speed (0.2708%
exact public over target / 99.7337% exact public target speed); the raw
signed optimization matrix is -0.4963% / 100.5160%.
The latest WALKSTUF1 high scalar retained-read closure tested the remaining
shared append rows after the 427..443 CD-work baseline. Some candidates were
exact-flat, and the rows that saved reads paid the win back as visible-loop,
target, or refill debt. The next high-side attempt should use generated
deadline ownership, pack-side byte/phase reduction, or upload/restore work
removal rather than another hand-authored scalar append.
The latest WALKSTUF1 low scheduler sweep tested post-prepare window refill thresholds. Conservative slack did not fire; lower thresholds fired but regressed loop, blocking, refill, and due misses. That closes the cheap prepare-then-refill branch and leaves generated frame-deadline ownership or pack/upload work reduction as the next low-side path.
The latest VISITOR3 high promotion reuses the proven low compact frame143/144
cleanup payloads and repacks frames 141/140/142/143/144 plus sound events
inside the existing 277..293 setup segment. It improves high to 1063/1040,
overrun 23, blocking/read time 35, and reads/due 6/6, while pack
bytes/LBA/sectors and the PS-EXE bucket stay fixed. BUILDING2 low now keeps the
earlier 218..229 slack-8 row and adds v739 draw-tail trimming, improving to
1339/1317.
Scene Battle Card
As of 2026-05-14, all 126 scene/tide variants have current headless
perf measurements. The latest updated rows are stamped
building2-low-trimtails-v739,
visitor3-high-tail-pack-v629,
visitor5-high-rg30-46-v496,
visitor3-low-frame137-primegap-v510,
walkstuf1-low-rg78-91-v474,
walkstuf1-high-current-v458-refresh,
building2-low-rg218-229-slack8-v626,
building2-low-delta-v454,
visitor5-low-compact-rg23-47-v451,
walkstuf1-high-shared-dual-tail-v428,
walkstuf1-low-shared-dual-tail-v428,
building2-high-rg206-230-cap24-v441,
building6-window-slack4-v364,
johnny6-compact-fgp3-v354,
visitor3-low-tail-pack-only-v338,
visitor3-low-f128-resident-seg27-v302,
visitor3-high-f131-resident-alias121123-v299,
visitor3-low-alias-noop114117-v292,
visitor3-high-f140-segment-copy-v291,
visitor3-low-noop113-v249,
visitor3-low-noop114117-v248,
visitor3-high-f127-f130-resident-copy-v238,
visitor3-drop-unused-motion-dispatch-v197,
activity9-low-compact-fgp3-v174,
johnny1-compact-fgp3-v173,
walkstuf3-low-compact-fgp3-v171,
activity9-high-compact-fgp3-v167,
building6-compact-fgp3-v165,
walkstuf3-high-compact-fgp3-v163,
building2-low-restore-window-slack4-v160,
visitor5-high-current-v401,
building1-compact-fgp3-noautoprime-v157,
mary3-preserve-window-slack8-v149,
missing-scenes-current-v001,
visitor3-tail-trim-stageguard-v127,
graphics-composite-os-v111,
building2-low-group365-381-v110,
building2-high-group60-72-v109,
building2-high-restore-minus-current-v108,
visitor3-low-offscreen-exitright-v106,
visitor3-high-offscreen-drawclip-v105,
walkstuf1-high-primecap144-v089,
visitor3-low-readgroup-prune-v088,
building4-restore-minus-current-v087,
visitor3-restore-minus-current-v086,
visitor3-high-readgroup-prune-v084,
compact-u16-inline-v083,
fgp3v4-drawcompact-all-v082,
activity9-dead-readgroup-prune-v082,
read-group-selector-single-assign-v082,
visitor3-high-group138-162-slack4-v081,
walkstuf1-low-primecap160-v081,
johnny2-prefetch-relief-v081,
activity9-low-fgp3-cleanup-compact-v081,
activity9-current-v081-refresh,
building4-fgp3-cleanup-compact-window-v081,
building2-fgp3-cleanup-compact-v081,
visitor3-fgp3-cleanup-compact-v081,
mary2-prefetch-relief-v081,
mary2-fgp3-padded-v081,
johnny2-fgp3-padded-v081,
mary5-fgp3-padded-v081,
activity11-fgp3-padded-v081,
building5-fgp3-padded-v080,
walkstuf1-fgp2-setup-prime-v080,
visitor3-setup-prime-192k-v080,
visitor3-high-group170-186-v080-current,
activity9-lowgroup-v072c,
activity9-fgp3-v072c,
activity9-window-v072c,
activity4-fishing4-v072c-prefetch-relief,
activity1-v072c-current-refresh,
activity11-12-v072c-prefetch-relief,
stale-next-v072c-current-refresh,
mary1-v072c-prefetch-relief,
stale-layout-v072c-current-refresh,
activity9-v072c-prefetch-relief,
stale-pressure2-v072c-current-refresh,
johnny1-v072c-prefetch-relief,
stale-pressure-v072c-current-refresh,
activity10-johnny3-v072-prefetch-relief,
stale-zero2-v072b-current-refresh,
stale-zero-v072b-current-refresh,
stale-top-v072b-current-refresh,
visitor5-v072-prefetch-relief,
mismatch-top-v072-current-refresh,
stand-family-v072-current-refresh,
visitor4-v072-current-refresh,
stand1-v072-current-refresh,
visitor3-v072-prefetch-relief,
fishing5-v065-current-ledger-overlay,
compact-fgp3-v66-final-frame-hold,
compact-fgp3-v64-building2-group318-330,
compact-fgp3-v63-building2low-prime, and
indexed8-row-local-dirty-v1; other refreshed rows include
compact-fgp3-v62-fishing3low-group253-265,
compact-fgp3-v61-fishing3low-group163-175,
compact-fgp3-v60-visitor3high-group230-242,
compact-fgp3-v59-visitor3high-group72-84, indexed8-tile-local-compose-v1,
compact-fgp3-v58-activity9high-window20-table, compact-fgp3-v57-policy-table-refactor, and compact-fgp3-v49-walkstuf2-auto-prime through compact-fgp3-v29-smallprime, and the full-matrix baseline rows are stamped
compact-fgp3-v2-fullmatrix. 63 of 63 scenes have at least one routed
variant, and 63 scenes have both high- and low-tide variants routed. All 126
rows now carry active-loop timing; suzy1 needs the longer 12000-frame
matrix budget because its valid scene-end lands after the default 7200-frame
window. The latest matrix run is 2026-05-13T21:31:34; per-row freshness and stats version are shown on
the battle card. The values below are
public-capped over target / target speed (loop_vb/target_vb), with blk
and due called out when nonzero. Faster-than-target rows display
0.0% / 100.0%; their raw signed values remain in
docs/ps1/performance-scene-matrix.csv.
The complete matrix pass is compact-fgp3-v2-fullmatrix; accepted follow-up
rows now use visitor3-high-tail-pack-v629,
visitor5-high-rg30-46-v496,
visitor3-low-frame137-primegap-v510,
walkstuf1-low-rg78-91-v474,
walkstuf1-high-current-v458-refresh,
building2-low-trimtails-v739,
building2-low-rg218-229-slack8-v626,
building2-low-delta-v454,
visitor5-low-compact-rg23-47-v451,
walkstuf1-high-shared-dual-tail-v428,
walkstuf1-low-shared-dual-tail-v428,
building2-low-rg238-250-v445,
building2-high-rg206-230-cap24-v441,
building6-window-slack4-v364,
visitor3-high-f131-resident-alias121123-v299,
visitor3-low-tail-pack-only-v338,
visitor3-low-f128-resident-seg27-v302,
visitor3-low-alias-noop114117-v292,
visitor3-high-f140-segment-copy-v291,
visitor3-low-noop113-v249,
visitor3-low-noop114117-v248,
visitor3-high-f127-f130-resident-copy-v238,
visitor3-drop-unused-motion-dispatch-v197,
johnny1-compact-fgp3-v173,
walkstuf3-low-compact-fgp3-v171,
activity9-high-compact-fgp3-v167,
building6-compact-fgp3-v165,
walkstuf3-high-compact-fgp3-v163,
building2-low-restore-window-slack4-v160,
visitor5-high-current-v401,
building1-compact-fgp3-noautoprime-v157,
mary3-preserve-window-slack8-v149,
visitor3-tail-trim-stageguard-v127,
graphics-composite-os-v111,
building2-low-group365-381-v110,
building2-high-group60-72-v109,
building2-high-restore-minus-current-v108,
visitor3-low-offscreen-exitright-v106,
visitor3-high-offscreen-drawclip-v105,
walkstuf1-compact-fgp3-v141,
visitor3-low-readgroup-prune-v088,
building4-restore-minus-current-v087,
visitor3-restore-minus-current-v086,
visitor3-high-readgroup-prune-v084,
fgp3v4-drawcompact-all-v082,
compact-u16-inline-v083,
visitor3-fgp3-cleanup-compact-v081,
walkstuf1-low-primecap160-v081,
johnny2-prefetch-relief-v081,
mary2-prefetch-relief-v081,
mary2-fgp3-padded-v081,
johnny2-fgp3-padded-v081,
mary5-fgp3-padded-v081,
activity11-fgp3-padded-v081,
building5-fgp3-padded-v080,
walkstuf1-fgp2-setup-prime-v080,
visitor3-setup-prime-192k-v080,
visitor3-high-group170-186-v080-current,
activity9-lowgroup-v072c,
activity9-fgp3-v072c,
activity9-window-v072c,
johnny6-compact-fgp3-v354,
activity4-fishing4-v072c-prefetch-relief,
activity1-v072c-current-refresh,
activity11-12-v072c-prefetch-relief,
stale-next-v072c-current-refresh,
mary1-v072c-prefetch-relief,
stale-layout-v072c-current-refresh,
activity9-v072c-prefetch-relief,
stale-pressure2-v072c-current-refresh,
johnny1-v072c-prefetch-relief,
stale-pressure-v072c-current-refresh,
activity10-johnny3-v072-prefetch-relief,
stale-zero2-v072b-current-refresh,
stale-zero-v072b-current-refresh,
stale-top-v072b-current-refresh,
visitor5-v072-prefetch-relief,
mismatch-top-v072-current-refresh,
stand-family-v072-current-refresh,
visitor4-v072-current-refresh,
stand1-v072-current-refresh,
visitor3-v072-prefetch-relief,
compact-fgp3-v66-final-frame-hold,
fishing5-v065-current-ledger-overlay,
compact-fgp3-v64-building2-group318-330,
compact-fgp3-v63-building2low-prime, and
indexed8-row-local-dirty-v1; other refreshed rows include
compact-fgp3-v62-fishing3low-group253-265,
compact-fgp3-v61-fishing3low-group163-175,
compact-fgp3-v60-visitor3high-group230-242,
compact-fgp3-v59-visitor3high-group72-84, indexed8-tile-local-compose-v1,
compact-fgp3-v58-activity9high-window20-table, compact-fgp3-v57-policy-table-refactor, and compact-fgp3-v49-walkstuf2-auto-prime through compact-fgp3-v29-smallprime. Older padded-fgp3-v1 / compact-fgp3-v1
rows are historical only.
| Scene | High tide | Low tide |
|---|---|---|
activity1 |
0.0% / 100.0% (2754/2764); blk 1 | 0.0% / 100.0% (2754/2765) |
activity4 |
0.0% / 100.0% (1065/1066); blk 4 | 0.0% / 100.0% (1064/1068); blk 1 |
activity5 |
0.0% / 100.0% (1730/1749); blk 2 | 0.0% / 100.0% (1731/1749); blk 2 |
activity6 |
+0.1% / 99.9% (912/911) | +0.1% / 99.9% (912/911) |
activity7 |
0.0% / 100.0% (593/596) | 0.0% / 100.0% (594/596) |
activity8 |
0.0% / 100.0% (898/904); blk 1 | 0.0% / 100.0% (899/904); blk 2 |
activity9 |
+1.0% / 99.0% (2082/2062); due 1; blk 24 | +0.7% / 99.3% (2075/2061); due 1; blk 17 |
activity10 |
0.0% / 100.0% (1259/1259); due 1; blk 7 | 0.0% / 100.0% (1255/1256); due 2; blk 17 |
activity11 |
0.0% / 100.0% (1715/1722); blk 2 | 0.0% / 100.0% (1717/1722); blk 4 |
activity12 |
0.0% / 100.0% (1411/1412); blk 7 | 0.0% / 100.0% (1409/1411); due 1; blk 10 |
building1 |
+2.1% / 98.0% (794/778); blk 21 | +1.9% / 98.1% (794/779); blk 21 |
building2 |
+3.0% / 97.0% (1351/1311); due 7; blk 54 | +2.3% / 97.8% (1349/1319); due 17; blk 80 |
building3 |
0.0% / 100.0% (5460/5465) | 0.0% / 100.0% (5460/5465) |
building4 |
+1.0% / 99.0% (2844/2816); due 1; blk 37 | +1.3% / 98.7% (2853/2816); due 1; blk 40 |
building5 |
0.0% / 100.0% (3343/3348); blk 5 | 0.0% / 100.0% (3345/3347); blk 8 |
building6 |
+1.0% / 99.0% (2482/2457); blk 25 | +1.2% / 98.8% (2485/2456); blk 28 |
building7 |
0.0% / 100.0% (3132/3133); blk 9 | 0.0% / 100.0% (3130/3133); blk 7 |
fishing1 |
0.0% / 100.0% (1068/1074); blk 2 | 0.0% / 100.0% (1067/1074); blk 1 |
fishing2 |
0.0% / 100.0% (1761/1763); blk 6 | 0.0% / 100.0% (1759/1765); blk 3 |
fishing3 |
+0.6% / 99.4% (1962/1950); due 1; blk 17 | +0.1% / 99.9% (1957/1955); blk 9 |
fishing4 |
0.0% / 100.0% (835/842); blk 2 | 0.0% / 100.0% (834/843) |
fishing5 |
0.0% / 100.0% (885/890) | 0.0% / 100.0% (885/890) |
fishing6 |
0.0% / 100.0% (744/753) | 0.0% / 100.0% (744/753) |
fishing7 |
0.0% / 100.0% (715/725) | 0.0% / 100.0% (715/725) |
fishing8 |
0.0% / 100.0% (1243/1253) | 0.0% / 100.0% (1243/1253) |
johnny1 |
+1.4% / 98.6% (1973/1945); blk 25 | +1.4% / 98.6% (1973/1945); blk 25 |
johnny2 |
0.0% / 100.0% (1741/1751) | 0.0% / 100.0% (1741/1751) |
johnny3 |
0.0% / 100.0% (1158/1161); due 1; blk 10 | 0.0% / 100.0% (1157/1166) |
johnny4 |
0.0% / 100.0% (1204/1214) | 0.0% / 100.0% (1204/1214) |
johnny5 |
0.0% / 100.0% (811/820) | 0.0% / 100.0% (810/820) |
johnny6 |
+1.0% / 99.0% (2829/2802); blk 24 | +1.0% / 99.0% (2830/2802); blk 25 |
mary1 |
+0.8% / 99.2% (4867/4830); due 2; blk 47 | +0.4% / 99.6% (4860/4840); due 1; blk 31 |
mary2 |
0.0% / 100.0% (2241/2248); blk 2 | 0.0% / 100.0% (2242/2250); blk 2 |
mary3 |
+0.1% / 99.9% (2296/2294); due 13; blk 53 | +0.1% / 99.9% (2297/2295); due 13; blk 51 |
mary4 |
0.0% / 100.0% (1968/2016); due 3; blk 28 | 0.0% / 100.0% (1966/2019); due 3; blk 24 |
mary5 |
0.0% / 100.0% (1581/1586); due 1; blk 5 | 0.0% / 100.0% (1581/1584); due 1; blk 6 |
miscgag1 |
0.0% / 100.0% (953/961) | 0.0% / 100.0% (953/961) |
miscgag2 |
0.0% / 100.0% (1352/1356) | 0.0% / 100.0% (1352/1356) |
stand1 |
0.0% / 100.0% (194/202) | 0.0% / 100.0% (194/202) |
stand2 |
0.0% / 100.0% (480/490) | 0.0% / 100.0% (480/490) |
stand3 |
0.0% / 100.0% (547/557) | 0.0% / 100.0% (547/557) |
stand4 |
0.0% / 100.0% (1202/1220) | 0.0% / 100.0% (1203/1218); blk 3 |
stand5 |
0.0% / 100.0% (1442/1460) | 0.0% / 100.0% (1442/1460) |
stand6 |
0.0% / 100.0% (1346/1364) | 0.0% / 100.0% (1346/1364) |
stand7 |
0.0% / 100.0% (520/538) | 0.0% / 100.0% (520/538) |
stand8 |
0.0% / 100.0% (483/499); blk 2 | 0.0% / 100.0% (483/499); blk 2 |
stand9 |
0.0% / 100.0% (520/538) | 0.0% / 100.0% (522/538) |
stand10 |
0.0% / 100.0% (528/538) | 0.0% / 100.0% (528/538) |
stand11 |
0.0% / 100.0% (528/538) | 0.0% / 100.0% (528/538) |
stand12 |
0.0% / 100.0% (1450/1459); blk 1 | 0.0% / 100.0% (1450/1460) |
stand15 |
0.0% / 100.0% (444/452) | 0.0% / 100.0% (444/452) |
stand16 |
+0.2% / 99.8% (473/472) | +0.2% / 99.8% (473/472) |
suzy1 |
no active loop | no active loop |
suzy2 |
no active loop | no active loop |
visitor1 |
0.0% / 100.0% (672/677) | 0.0% / 100.0% (672/677) |
visitor3 |
+2.2% / 97.8% (1063/1040); due 6; blk 35 | +2.1% / 97.9% (1062/1040); due 7; blk 42 |
visitor4 |
0.0% / 100.0% (424/428) | 0.0% / 100.0% (424/428) |
visitor5 |
+1.1% / 98.9% (1104/1092); blk 11 | +2.0% / 98.0% (1112/1090); blk 12 |
visitor6 |
0.0% / 100.0% (2043/2047); blk 1 | 0.0% / 100.0% (2043/2047); blk 1 |
visitor7 |
0.0% / 100.0% (1619/1625) | 0.0% / 100.0% (1619/1625) |
walkstuf1 |
+3.4% / 96.8% (1480/1432); due 16; blk 83 | +3.7% / 96.4% (1484/1431); due 12; blk 72 |
walkstuf2 |
0.0% / 100.0% (451/461) | 0.0% / 100.0% (451/461) |
walkstuf3 |
+0.9% / 99.1% (2310/2290); due 6; blk 47 | +1.2% / 98.8% (2321/2293); due 5; blk 41 |
Detail-tier attribution for the canary currently points at render and restore pressure rather than CD stalls:
sched.wait = 722
sched.present = 99
sched.cd_stage = 137
sched.cd_window = 19
gfx.restore_bytes = 251,144
gfx.upload_bytes = 8,643,840
The FISHING1 canary remains at the public 100.0% cap with raw signed
headroom, but the full battle card still has CD-heavy scenes (visitor3,
building2, building6, walkstuf1, and building4). The clean-pressure relief rows prove scene-local
CD policy can recover large due-miss collapses, while the refreshed stale rows
prove current-pack baselines must be cleared before ranking fixed overhead.
Next plausible wins, in priority order:
- Generated read grouping or setup/data-shape work. WALKSTUF1 low/high are
now the largest gaps at
+53/+48VBlanks after the latest high444..456same-speed CD-work reduction. BUILDING2 high/low (+38/+31) and VISITOR3 low/high (+32/+31) are the next tight rows after the VISITOR3 motion-copy, setup-segment, setup-prime, guarded second-segment, resident-copy, and low no-op residual passes; its local C read-table rows are exhausted, so the next CD-shape pass needs generated scheduler ownership, selective preprocessing, or further pack data-shape work rather than hand-authored ranges. The default selective upload-ready plan is footprint-closed as a same-layout append because2111224bytes of payload plus rect metadata exceed the current970076bytes of VISITOR3 high-pack slack. The budgeted analyzer target keeps this same-footprint lane alive with78selected frames,968904payload+rect bytes, and4232112modeled upload bytes saved before runtime implementation. The empty-hold no-op recast is closed because the current packs expose0zero-visual-work entries. The packed-draw metadata probes prove a real VISITOR3 byte-reduction signal: the v4 draw-tail trim plus VISITOR3 stage guard is now promoted, while the v7 runtime decoder shape remains rejected because it perturbs BUILDING2 and BUILDING4 canaries. A layout-neutral packed-delta retry keeps LBAs and the PS-EXE bucket fixed, but its function-scoped PAL4 span-Ostrade regresses VISITOR3 high while improving low tide, so that C-side shape is closed too. An entry-origin recentering size gate also saves0bytes on current VISITOR3 high/low FGP3/v4 payloads, so that zero-runtime-code coordinate-shift lane is closed before emulator time. - FG2-specific present pipeline with explicit slack budgeting. Earlier present-prep experiments regressed because they stole CD prefetch slack; the next scheduler needs separate render-prep and CD-prefetch budgets.
- X-aware dirty upload and rect-pressure control. The FISHING 1 canary still restores 251 KB and uploads 8.5 MB; larger scenes carry more upload pressure.
- Specialized indexed8 and PAL4 compositors. The pack-format wins reduce bytes, but dense scenes still pay per-span/per-pixel runtime costs.
- Generated scheduler ownership for the remaining under-99 rows. MARY3 is
now green after the guarded prefetch-preserve pass, and BUILDING6 moved to
the bottom of the yellow band after compact-pack promotion. The remaining
hard rows are VISITOR3 low/high, WALKSTUF1 low/high, BUILDING2 high/low,
VISITOR5 low/high, JOHNNY1 high/low, BUILDING4 low, and BUILDING6 high/low,
where hand-authored read groups and scalar window changes have
repeatedly shifted cadence instead of safely removing work. The latest
WALKSTUF1 low v747/v749/v750/v751/v753/v755/v756/v757/v759/v762/v763/v766/v767/v769/v770/v771/v772/v773/v774/v775/v776/v777/v779/v780/v781/v782/v783/v784/v785/v786/v787/v788/v789/v790/v791 pass keeps the row exact-flat while
shrinking frames
51,49,47,61,62,58,45,37,35,43,41,57,33,67,68,69,32,133,5,141,70,30,6,71,72,142,73,131,74,19,28,138,145,75, and76in-place (879801 -> 801103active payload), and v760 restores the bounded CD fast-poll runtime to60/272read time, so W1-low now has a safe no-shift payload lane but still needs a sector/read timing conversion. The BUILDING4 low v387 pass closes the local178..202append group and40/48 KiBstream-window growth: reads fell, but visible blocking and loop overrun rose sharply. The newer v746 in-place frame291 shrink proves no-shift payload reduction is safe, cutting active payload855284 -> 849109while staying exact-flat, so that row now needs sector-changing no-shift byte reduction, generated deadline ownership, or selective preprocessing rather than larger raw fresh fills.
The author considers the current build comfortable for the validated scenes, not yet headroom-clean. The canary bottleneck is no longer raw CD stall; the matrix bottleneck is uneven per-scene payload/read shape plus render/restore pressure.
Non-goals
A few things the perf work explicitly does not chase, with reasons:
- Frame dropping. Violates pixel-perfect playback. The acceptance bar requires every captured entry to render on its captured beat.
- Timing compression before throughput work. The timing-bearing matrix public average is now +0.2708% over target / 99.7337% target speed, with several worse CD-bound outliers; compressing the timing files would expose the same throughput bottleneck without fixing it.
- Reintroducing FG1 / ADS / TTM runtime paths. Those are retired from the active public path. The PS1 executable links only the scene-playback runtime plus the minimal background / audio / input / CD layers it needs.
- Fixed island assumptions. The runtime must randomly place the island per scene, so all optimizations must preserve scene-relative FG2 placement.
- Direct framebuffer or progressive-mode experiments as first moves. Prior history says these were unstable. Exhaust stable scene playback first.
Related pages
- Performance battle card — the live timing matrix this reference manual describes the columns of. 126 scene/tide variants, sortable, color-coded.
- From 87 to 99.5: the post-validation performance loop — the retrospective on the optimization arc, including which experiments landed and which got rejected.
- v0.8.1: what the soak found that the matrix didn’t — the soak-loop war story; matrix and soak are not redundant.
- The 24/7 build farm — the magazine treatment of the parallel Docker machinery that iterates the perf experiments this reference describes the output of. Same JCPERF / JCPERF2 records, but framed as methodology for keeping a 126-row matrix moving.
- Hardware — what the optimizations are running against.
- Build & toolchain — how the PS1 binary is produced.
- Build infrastructure — the wrapper around the perf iterate script.
- Audio pipeline — the SPU side, which has its own scheduling concerns.
- Story-loop walks — the walk subsystem’s persistent clean buffer is part of the same pressure-accounting envelope the matrix above measures; the v0.8.0 clean-rect retry path and v0.8.1 wave-band/split-rect pressure changes are documented there.
- Vision-classifier work — the validation layer that runs against perf-experiment outputs.
- Devlog — perf work shows up day-by-day there.
View source on GitHub
The body cites a dozen files; this section collects them. Grouped by purpose — plan and ledgers, runtime, iterate gate, the scene matrix, the compiler-flag and preprocessing sweeps, the read-plan rollup, and the regtest runner.
docs/ps1/performance-optimization-plan.md·docs/ps1/performance-experiment-log.md— the optimization plan and the 600+ experiment ledger.src/ps1_perf.c·src/foreground_pilot.c— runtime: the JCPERF/JCPERF2 instrumentation and the FG2 dispatcher whose per-frame budget the matrix measures.scripts/ps1-perf-iterate.sh— the experiment gate every probe goes through (run → compare → promote-or-reject).docs/ps1/performance-scene-matrix.csv— the full scene/tide battle card; rendered as the live sortable matrix at /perf/.docs/ps1/performance-o2-audit.md·docs/ps1/performance-o2-audit.csv— current compiler-flag sweep, regenerated frombuild-ps1/compile_commands.json+build-ps1/jcreborn.mapbefore each-O2probe.docs/ps1/performance-preprocess-opportunities.md·docs/ps1/performance-preprocess-opportunities.csv·scripts/analyze-fg2-preprocess-plans.py·docs/ps1/performance-preprocess-visitor3-hotspots.csv— pack-time graphics preprocessing target sheet, the FGP2/FGP3 per-pack analyzer, and the VISITOR3 cap-hit / saving-heavy frame sheet.docs/ps1/performance-read-candidate-matrix.md·docs/ps1/performance-read-candidate-matrix.csv— foreground read-plan candidates classified by append-start ownership, grouped-read capacity, and visible-CD cost class.scripts/run-regtest.sh— headless DuckStation runner that captures PNGs and ingests TTY records into per-run summary JSON files.