PS1 Scene Validation And Debug Prompt

You are working in the jc_reborn repository on the PS1 runtime and test harness.

Your mission is not to produce plausible-looking runs. Your mission is to get to 63 actually validated PS1 story scenes against the canonical Linux/reference output, with trustworthy comparison artifacts a human can review.

On this page

Primary Goal
Source Of Truth
Current Expectations
Comparison Requirements
How To Compare Correctly
Current Harness Concerns To Investigate
Current PS1 Runtime Debug Context
Recommended Work Sequence
Logging Requirements
Standards

Primary Goal

Reach a state where all 63 target PS1 story scenes can be:

booted deterministically
captured reproducibly
compared against the canonical reference set
reviewed visually in generated HTML
marked validated only when they actually match scene content, not just because they share black frames, title frames, or empty/ocean fallback frames

Do not count:

black-screen agreement
title-screen agreement
empty-ocean agreement
“close enough” when no scene content launched

The standard is validated scene content.

Source Of Truth

The baseline source of truth is already present in this repo:

reference root:
- repo:/regtest-references

This contains the canonical Linux/reference outputs for the scene set. Treat that as authoritative.

Important related files:

reference scene list:
- repo:/config/ps1/regtest-scenes.txt
single-scene runner:
- repo:/scripts/regtest-scene.sh
batch compare runner:
- repo:/scripts/compare-reference-batch.sh
compare renderer:
- repo:/scripts/render-compare-timeline.py
single-run review renderer:
- repo:/scripts/render-regtest-run.py

Current Expectations

There is still work to do on both:

the harness
the actual PS1 scene runs

Assume the current harness is useful but not fully solved.

In particular, there is an active suspicion that the harness may be stopping too early relative to the real PS1 boot/title lead-in.

Working hypothesis to test:

scene timing windows may need to be increased significantly
a good first hypothesis is that effective scene timing may need roughly +35 seconds of extra boot allowance before valid scene comparison begins

Do not assume that hypothesis is true. Test it.

Comparison Requirements

For every serious run, produce artifacts that a human can inspect.

At minimum, each scene run should leave:

result.json
review.html
raw captured frames
if reference-compared, compare.json
if reference-compared, compare.html

The human must be able to open the HTML and confirm whether the run contains real scene content.

How To Compare Correctly

Single PS1 run review

Use:

repo:/scripts/regtest-scene.sh

This now emits:

result.json
review.html

The review page is for raw PS1 inspection only. It is not proof of correctness by itself.

PS1 vs reference comparison

Use:

repo:/scripts/compare-reference-batch.sh

This should be the main path for actual validation sweeps against the canonical reference set.

Expected comparison artifacts per scene:

compare.json
compare.html

If compare alignment fails, that is not a pass. It means the harness or the run still needs work.

Validation rule

A scene is only validated when:

the PS1 run clearly reaches the intended scene
the scene content aligns against the reference in a defensible way
the generated HTML artifacts support that claim

Current Harness Concerns To Investigate

You should actively test and improve these:

Boot lead-in may be under-budgeted.
- Current scene windows may begin too early.
- Test longer frame budgets and later alignment windows.
Some current comparisons may overweight title/black/ocean contamination.
- The harness must reject invalid anchors instead of fabricating confidence.
HTML output must always exist for human review.
- Keep that invariant.
The “best frame” or “state hash” is not enough.
- Validation must be scene-content driven.
If the compare path falls back, make that obvious.
- Fallback review pages are useful, but they are not validated compare pages.

Current PS1 Runtime Debug Context

The active deep bug work has been concentrated on ACTIVITY 1.

Current strongest runtime read:

the live seam is likely in ps1PilotLoadPackIndex(...) in:
- repo:/cdrom_ps1.c
it appears stack-layout-sensitive
exact local ordering of:
- cdPath
- cdfile matters
tiny caller-frame changes around CdSearchFile(...) move the runtime between stable bad branches

This means:

do not assume a clean logic bug
suspect UB, overwrite, or stack-sensitive corruption around the pack-file lookup path

That said, do not get trapped in only one scene forever if the better immediate win is harness confidence or a different scene that is closer to validated.

Recommended Work Sequence

Verify the harness outputs are always reviewable.
- review.html for single runs
- compare.html for reference comparisons
Audit timing assumptions.
- Test whether boot grace, scene-entry windows, or total run length need to move later
- Specifically test the “add roughly 35 seconds” hypothesis
Run targeted scene comparisons against the reference set.
- Prefer one scene at a time when debugging
- Prefer broader sweeps when ranking “closest to validated”
Identify the easiest real win.
- Choose scenes that already show actual content and are closest to aligned reference output
Only then return to deeper runtime surgery when needed.

Logging Requirements

Keep a concise progress log in:

repo:/docs/ps1/research/HARNESS_WORKLOG_2026-03-28.md

For each significant step, record:

what was changed
what run was executed
where the result artifacts are
what conclusion is justified
what the next target is

Do not write vague summaries. Record the actual result and the actual conclusion.

Standards

Be skeptical of false positives.

If a scene:

never launches
only shows water
only shows black
only shows title
or fails to align cleanly against reference

then it is not validated.

The end goal is:

63 scenes
all human-reviewable
all compared against canonical reference
all validated on real scene content