Skip to content

[codex] HTML API: Add set_inner_html to the processor#69

Draft
sirreal wants to merge 8 commits into
trunkfrom
set-inner-html
Draft

[codex] HTML API: Add set_inner_html to the processor#69
sirreal wants to merge 8 commits into
trunkfrom
set-inner-html

Conversation

@sirreal

@sirreal sirreal commented Jun 14, 2026

Copy link
Copy Markdown
Owner

What changed

Adds WP_HTML_Processor::set_inner_html() for non-atomic tag openers. The method replaces the target element's raw inner HTML only when the replacement can be parsed without changing the tree outside the target; otherwise it returns false and leaves the source unchanged.

Also adds focused PHPUnit coverage and a deterministic standalone fuzzer for the outside-tree invariant, including BODY/HTML attribute-hoisting cases such as and safe template/foreign-content exceptions.

Validation

  • WP_TESTS_SKIP_INSTALL=1 ./vendor/bin/phpunit --group html-api tests/phpunit/tests/html-api/wpHtmlProcessorSetInnerHtml.php
  • WP_TESTS_SKIP_INSTALL=1 ./vendor/bin/phpunit --group html-api
  • php -l tools/html-api-fuzz/set-inner-html.php
  • ./vendor/bin/phpcs --standard=phpcs.xml.dist tools/html-api-fuzz/set-inner-html.php
  • php tools/html-api-fuzz/set-inner-html.php --iterations 0 --output-dir /tmp/set-inner-html-fuzz-corpus --stop-on-failure
  • php tools/html-api-fuzz/set-inner-html.php --iterations 100 --output-dir /tmp/set-inner-html-fuzz-smoke --stop-on-failure
  • php tools/html-api-fuzz/set-inner-html.php --iterations 5000 --output-dir /tmp/set-inner-html-fuzz-5000 --stop-on-failure
  • php tools/html-api-fuzz/set-inner-html.php --iterations 50000 --output-dir /tmp/set-inner-html-fuzz-50000 --stop-on-failure
  • php tools/html-api-fuzz/set-inner-html.php --iterations 50000 --start-seed 50001 --output-dir /tmp/set-inner-html-fuzz-50001-100000 --stop-on-failure

@sirreal

sirreal commented Jun 14, 2026

Copy link
Copy Markdown
Owner Author

set_inner_html() algorithm notes

The current implementation is intentionally conservative: it validates the proposed replacement by parsing in full document/fragment context rather than trying to reason about the replacement string in isolation.

High-level flow:

  1. Reject unsupported call sites: virtual tokens, non-matched states, tag closers, atomic/non-closer elements, integration-node tokens, or tokens without a source bookmark.
  2. Flush pending lexical updates so validation runs against the same source that will receive the replacement.
  3. Locate the target opener by source span and find the raw inner-HTML byte range by reparsing until the target element is popped/closed. This handles explicit closers, implicit closers, EOF virtual pops, and special full-document BODY/HTML end behavior.
  4. Build candidate source by splicing the proposed inner HTML into the original source.
  5. Reparse the original and candidate in the same public parsing mode and compute an outside-tree signature. The signature records tokens outside the target, including token type/name/namespace, closer state, breadcrumbs, and serialized token. Tokens inside the target are skipped for comparison, but still processed because they can affect parser state and where the target closes.
  6. Compare active formatting element state at target entry/exit to catch parser-state leaks such as reconstruction outside the target.
  7. Track non-visitable parser events for BODY/HTML start/end tags that are consumed without normal visitable stack events. Attribute-bearing <body ...> / <html ...> tokens inside the target/replacement range are rejected because they may hoist attributes onto the real body/html element rather than remain target-local. Safe cases such as template content and foreign-content HTML-looking tags are allowed.
  8. Queue one raw lexical replacement only if the original and candidate outside signatures match and no unsupported/parser-error condition is encountered. Otherwise return false and leave the source unchanged.

In other words: the safety property is enforced by full in-context reparsing plus strict outside-token comparison, with extra bookkeeping for parser effects that are not visible as normal visited tokens.

Performance notes / possible optimizations

The current version prioritizes correctness and simplicity, but it does extra work. It can parse the original once to find the target end, parse the original again for the outside signature, and parse the candidate for the candidate outside signature.

Potential follow-ups:

  1. Merge original passes. The end-finding pass and original outside-signature pass could likely be combined into one original parse that finds inner_end, records the outside signature, and detects original-side BODY/HTML hoist events.
  2. Stop at target close with a complete parser-state signature. Instead of parsing through EOF, compare original/candidate state immediately after the target closes. This could avoid reparsing an unchanged suffix, but only if the state signature is complete: open elements, active formatting elements, insertion mode/template mode, namespace/integration state, form/head/frameset state, and hoist events. This is the highest-risk optimization because omitting one state field could make the check unsound.
  3. Fast path for text-only replacements. If the replacement contains no markup introducers, common cases could skip candidate reparsing after inner_end is known, because plain text cannot introduce closers, active formatting changes, table repairs, or body/html hoists.
  4. Cheap pre-scan for obvious rejects. A lexical scan for high-risk constructs like target closers, <body, <html, nested non-nestable tags, or unclosed active formatting elements could reject many invalid replacements before constructing/parsing the full candidate. This would be an optimization only; the full parser validation should remain authoritative.
  5. Cache target metadata. If repeated attempts are made at the same processor position, the computed original target end/signature metadata could be reused.

The safest near-term optimization is merging the original passes. The largest potential win is stopping at target close with a complete parser-state comparison, but that requires careful proof that the state snapshot fully determines parsing of the unchanged suffix.

@sirreal

sirreal commented Jun 21, 2026

Copy link
Copy Markdown
Owner Author

Design notes from set_inner_html() discussion

Core goal

set_inner_html() should accept raw inner HTML only when the parsed result does not affect anything outside the target element.

The invariant is:

  • Accepted replacement: target inner bytes change; outside tree remains identical.
  • Rejected replacement: source remains unchanged; processor state is not poisoned.
  • Validation must use the HTML parser/tree builder, not string matching, because many effects are context-sensitive.

Main approach

The strongest general approach is:

  1. Splice the proposed inner HTML into the original source.
  2. Parse both original and candidate in the same parsing mode.
  3. Find the same target element by source span.
  4. Compare a signature of everything outside the target.
  5. Reject if the target disappears, the parser reaches unsupported state, active formatting state leaks, or the outside signature changes.

This avoids bespoke checks for most dangerous input, such as explicit closers, nested anchors, unclosed formatting elements, and table repairs. If those change anything outside the target, the outside-tree comparison catches them.

BODY / HTML hoisting

The special case is BODY and HTML start tags with attributes. These may be consumed by the parser without becoming normal child elements, while their attributes can hoist to the real outer BODY or HTML element. A normal next_token() tree walk may not see them even though they have semantic side effects outside the target.

The simplified rule we converged on:

Reject BODY/HTML start tags with attributes unless they are inside TEMPLATE.

This still needs parser awareness. It should not be a raw string scan.

Template exception

TEMPLATE is the important safe exception. Inside template contents, BODY/HTML-looking tags do not hoist to the outer document, so this should be allowed:

<template><body add-class>New</template>

A raw "contains <body with attrs" check would over-reject this.

Foreign content

We corrected the foreign-content reasoning. BODY in SVG/MathML is a breakout tag, so cases like these can reach the in-body rules and hoist attributes:

<svg><body s>
<math><body m>

A foreign html element by itself can sometimes remain foreign, but the simpler conservative direction is: outside TEMPLATE, reject BODY/HTML with attributes when parser handling says they are document-level or breakout-reprocessed hazards.

Why raw scan + normalize is not enough

A standalone scan/normalize pass is not viable because replacement safety depends on the current parser context and surrounding tree. The same bytes can behave differently inside TEMPLATE, SVG/MathML, active formatting contexts, table insertion modes, full-document BODY/HTML, and fragment vs full parser modes.

So validation needs to parse the candidate in the actual context created by splicing it into the original source.

record_nonvisitable_token_event()

The current helper matters because it exposes parser events that are otherwise invisible to next_token().

It currently does two jobs:

  1. Detect ignored/non-visitable BODY/HTML start tags with attributes that may hoist.
  2. Record non-visitable </body> / </html> end tags so full-document BODY/HTML inner spans can be located accurately.

The name and abstraction are probably too ad hoc, but the information is needed.

Better direction

A cleaner design would expose "non-tree parser tokens" internally during validation, rather than having a narrowly named event recorder.

That would let normal tree events drive outside-tree comparison while ignored/reprocessed BODY/HTML start tags with attributes outside TEMPLATE cause rejection, and explicit non-visitable </body> / </html> tokens provide replacement end offsets.

Not viable directions

  • Raw regex/string scanning for BODY/HTML: over-rejects templates and can misread context.
  • Standalone fragment normalization: loses context and misses leaks that only appear when inserted into the current tree.
  • Relying only on public visitable tokens: misses invisible parser side effects like attribute hoisting.
  • Exposing ignored BODY/HTML as normal public tokens: risks confusing public APIs because those tokens are not actual tree children.

Best summary: use full parser-context comparison for general safety, plus an internal way to observe non-tree parser tokens for the small class of invisible side effects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant