You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adds WP_HTML_Processor::set_inner_html() for non-atomic tag openers. The method replaces the target element's raw inner HTML only when the replacement can be parsed without changing the tree outside the target; otherwise it returns false and leaves the source unchanged.
Also adds focused PHPUnit coverage and a deterministic standalone fuzzer for the outside-tree invariant, including BODY/HTML attribute-hoisting cases such as and safe template/foreign-content exceptions.
The current implementation is intentionally conservative: it validates the proposed replacement by parsing in full document/fragment context rather than trying to reason about the replacement string in isolation.
High-level flow:
Reject unsupported call sites: virtual tokens, non-matched states, tag closers, atomic/non-closer elements, integration-node tokens, or tokens without a source bookmark.
Flush pending lexical updates so validation runs against the same source that will receive the replacement.
Locate the target opener by source span and find the raw inner-HTML byte range by reparsing until the target element is popped/closed. This handles explicit closers, implicit closers, EOF virtual pops, and special full-document BODY/HTML end behavior.
Build candidate source by splicing the proposed inner HTML into the original source.
Reparse the original and candidate in the same public parsing mode and compute an outside-tree signature. The signature records tokens outside the target, including token type/name/namespace, closer state, breadcrumbs, and serialized token. Tokens inside the target are skipped for comparison, but still processed because they can affect parser state and where the target closes.
Compare active formatting element state at target entry/exit to catch parser-state leaks such as reconstruction outside the target.
Track non-visitable parser events for BODY/HTML start/end tags that are consumed without normal visitable stack events. Attribute-bearing <body ...> / <html ...> tokens inside the target/replacement range are rejected because they may hoist attributes onto the real body/html element rather than remain target-local. Safe cases such as template content and foreign-content HTML-looking tags are allowed.
Queue one raw lexical replacement only if the original and candidate outside signatures match and no unsupported/parser-error condition is encountered. Otherwise return false and leave the source unchanged.
In other words: the safety property is enforced by full in-context reparsing plus strict outside-token comparison, with extra bookkeeping for parser effects that are not visible as normal visited tokens.
Performance notes / possible optimizations
The current version prioritizes correctness and simplicity, but it does extra work. It can parse the original once to find the target end, parse the original again for the outside signature, and parse the candidate for the candidate outside signature.
Potential follow-ups:
Merge original passes. The end-finding pass and original outside-signature pass could likely be combined into one original parse that finds inner_end, records the outside signature, and detects original-side BODY/HTML hoist events.
Stop at target close with a complete parser-state signature. Instead of parsing through EOF, compare original/candidate state immediately after the target closes. This could avoid reparsing an unchanged suffix, but only if the state signature is complete: open elements, active formatting elements, insertion mode/template mode, namespace/integration state, form/head/frameset state, and hoist events. This is the highest-risk optimization because omitting one state field could make the check unsound.
Fast path for text-only replacements. If the replacement contains no markup introducers, common cases could skip candidate reparsing after inner_end is known, because plain text cannot introduce closers, active formatting changes, table repairs, or body/html hoists.
Cheap pre-scan for obvious rejects. A lexical scan for high-risk constructs like target closers, <body, <html, nested non-nestable tags, or unclosed active formatting elements could reject many invalid replacements before constructing/parsing the full candidate. This would be an optimization only; the full parser validation should remain authoritative.
Cache target metadata. If repeated attempts are made at the same processor position, the computed original target end/signature metadata could be reused.
The safest near-term optimization is merging the original passes. The largest potential win is stopping at target close with a complete parser-state comparison, but that requires careful proof that the state snapshot fully determines parsing of the unchanged suffix.
set_inner_html() should accept raw inner HTML only when the parsed result does not affect anything outside the target element.
The invariant is:
Accepted replacement: target inner bytes change; outside tree remains identical.
Rejected replacement: source remains unchanged; processor state is not poisoned.
Validation must use the HTML parser/tree builder, not string matching, because many effects are context-sensitive.
Main approach
The strongest general approach is:
Splice the proposed inner HTML into the original source.
Parse both original and candidate in the same parsing mode.
Find the same target element by source span.
Compare a signature of everything outside the target.
Reject if the target disappears, the parser reaches unsupported state, active formatting state leaks, or the outside signature changes.
This avoids bespoke checks for most dangerous input, such as explicit closers, nested anchors, unclosed formatting elements, and table repairs. If those change anything outside the target, the outside-tree comparison catches them.
BODY / HTML hoisting
The special case is BODY and HTML start tags with attributes. These may be consumed by the parser without becoming normal child elements, while their attributes can hoist to the real outer BODY or HTML element. A normal next_token() tree walk may not see them even though they have semantic side effects outside the target.
The simplified rule we converged on:
Reject BODY/HTML start tags with attributes unless they are inside TEMPLATE.
This still needs parser awareness. It should not be a raw string scan.
Template exception
TEMPLATE is the important safe exception. Inside template contents, BODY/HTML-looking tags do not hoist to the outer document, so this should be allowed:
<template><bodyadd-class>New</template>
A raw "contains <body with attrs" check would over-reject this.
Foreign content
We corrected the foreign-content reasoning. BODY in SVG/MathML is a breakout tag, so cases like these can reach the in-body rules and hoist attributes:
<svg><bodys><math><bodym>
A foreign html element by itself can sometimes remain foreign, but the simpler conservative direction is: outside TEMPLATE, reject BODY/HTML with attributes when parser handling says they are document-level or breakout-reprocessed hazards.
Why raw scan + normalize is not enough
A standalone scan/normalize pass is not viable because replacement safety depends on the current parser context and surrounding tree. The same bytes can behave differently inside TEMPLATE, SVG/MathML, active formatting contexts, table insertion modes, full-document BODY/HTML, and fragment vs full parser modes.
So validation needs to parse the candidate in the actual context created by splicing it into the original source.
record_nonvisitable_token_event()
The current helper matters because it exposes parser events that are otherwise invisible to next_token().
It currently does two jobs:
Detect ignored/non-visitable BODY/HTML start tags with attributes that may hoist.
Record non-visitable </body> / </html> end tags so full-document BODY/HTML inner spans can be located accurately.
The name and abstraction are probably too ad hoc, but the information is needed.
Better direction
A cleaner design would expose "non-tree parser tokens" internally during validation, rather than having a narrowly named event recorder.
That would let normal tree events drive outside-tree comparison while ignored/reprocessed BODY/HTML start tags with attributes outside TEMPLATE cause rejection, and explicit non-visitable </body> / </html> tokens provide replacement end offsets.
Not viable directions
Raw regex/string scanning for BODY/HTML: over-rejects templates and can misread context.
Standalone fragment normalization: loses context and misses leaks that only appear when inserted into the current tree.
Relying only on public visitable tokens: misses invisible parser side effects like attribute hoisting.
Exposing ignored BODY/HTML as normal public tokens: risks confusing public APIs because those tokens are not actual tree children.
Best summary: use full parser-context comparison for general safety, plus an internal way to observe non-tree parser tokens for the small class of invisible side effects.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed
Adds WP_HTML_Processor::set_inner_html() for non-atomic tag openers. The method replaces the target element's raw inner HTML only when the replacement can be parsed without changing the tree outside the target; otherwise it returns false and leaves the source unchanged.
Also adds focused PHPUnit coverage and a deterministic standalone fuzzer for the outside-tree invariant, including BODY/HTML attribute-hoisting cases such as and safe template/foreign-content exceptions.
Validation