PDF: fix garbled/misplaced text — string escapes, CID encoding, substitute baseline#586
Merged
Conversation
The literal-string parser appended the raw character after a backslash for non-octal escapes, so `\n \r \t \b \f` yielded the letters 'n' 'r' 't' 'b' 'f' instead of their control bytes (ISO 32000-1 7.3.4.2, Table 3). Any glyph whose byte code had one of those low bytes (common in 2-byte CFF CID codes) decoded to the wrong code, producing .notdef boxes and garbled selectable text. Translate the full escape set and handle `\`-before-EOL line continuation. Covered by new PdfObjectParser string tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
An embedded Type0 /Encoding CMap stream is now parsed (cidchar/cidrange -> Font::cid_encoding) so Font::codes() maps char codes to CIDs through it and uses the CMap's authoritative codespace. This fixes producers that mix a 1-byte code (e.g. a space) among 2-byte CIDs: without the CMap the space fell through identity to the wrong glyph and advance. CMap gains cid_for_code / add_cid_range / has_cid_map; the parser gains read_cidchar / read_cidrange. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A non-embedded font renders in a local system font whose hhea/OS-2 metrics we don't control, so under line-height:1 the browser placed the glyph baseline at that font's ascent rather than the ascent_em we use to compute `top` — dropping e.g. the 120pt Times title well below the page bottom. Route each substitute through a generated @font-face ('odr-sN', src: local(...) of the family stack) carrying ascent-override/descent-override (summing to 1em) and line-gap-override:0, so the baseline is positioned from our metric. Faces are deduped by (family, style, ascent); the family stack is kept as a fallback. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 156258fde6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Thread a Logger into GraphicsOperatorParser (mirroring CMapParser) and emit ODR_DEBUG for unrecognized operators rather than writing to std::cerr. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
37df8a7 to
716995d
Compare
An embedded Type0 /Encoding CMap that inherits a base via `usecmap` (e.g. `/Identity-H usecmap`) and then declares only a partial local codespacerange would have its partial codespace treated as authoritative by Font::codes(), so inherited 2-byte codes fell through code_width() to the 1-byte fallback and were mis-split. We don't resolve the inherited base CMap, so flag the `usecmap` operator at parse time and exclude an inherited codespace from has_codespace(). Callers then fall back to the ToUnicode codespace / fixed 2-byte width (the pre-CID-encoding behavior). Local cidchar/cidrange overrides are still applied; unmapped codes fall through to identity, correct for an Identity-H base. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the paired `const char *position, const char *end` cursor with a single `string_view` that shrinks as codes are consumed. Bundling the two pointers removes the mismatched-position/end footgun and lets `settle()` use `size()` directly instead of pointer subtraction. `begin()`/`end()` now pass the view (and an empty view at its tail) rather than reaching into `data()`, dropping both suspicious-stringview-data-usage suppressions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three related fixes to the in-house PDF engine (
DecoderEngine::odr), surfaced byfontfile3_opentype.pdf(and its encrypted twin), where lots of glyphs rendered as.notdefboxes, some text was missing/garbled, and the large "PDFClown" title sat well below the page bottom.1. Decode reverse-solidus escapes in string literals
The literal-string parser appended the raw character after a backslash for non-octal escapes, so
\n \r \t \b \fyielded the letters instead of their control bytes (ISO 32000-1 7.3.4.2, Table 3). Any glyph whose 2-byte CFF CID code had one of those low bytes decoded to the wrong code →.notdefboxes and garbled selectable text. Now translates the full escape set +\-before-EOL line continuation. NewPdfObjectParsertests cover it. (notdef boxes 173 → 0.)2. Parse composite
/EncodingCMap for CID mappingAn embedded Type0
/EncodingCMap stream is now parsed (cidchar/cidrange→Font::cid_encoding), soFont::codes()maps char codes to CIDs through it and uses the CMap's authoritative codespace. Fixes producers that mix a 1-byte code (e.g. a space) among 2-byte CIDs — previously the space fell through identity to the wrong glyph and advance.3. Pin substitute-font baseline with
@font-facemetric overridesA non-embedded font renders in a local system font whose
hhea/OS-2metrics we don't control, so underline-height:1the browser placed the baseline at that font's ascent rather than theascent_emused to computetop— dropping e.g. the 120pt Times title below the page bottom. Each substitute now routes through a generated@font-face('odr-sN',src: local(...)) carryingascent-override/descent-override(summing to 1em) andline-gap-override:0, so the baseline is positioned from our metric. Faces deduped by (family, style, ascent); the family stack is kept as a fallback.4. Log unknown graphics operators via
LoggerGraphicsOperatorParsernow takes aLogger(mirroringCMapParser) and emitsODR_DEBUGfor unrecognized operators instead of writing tostd::cerr.All 228 PDF unit tests pass. Reference-output snapshot regeneration is left to the maintainer's update-ref workflow.
🤖 Generated with Claude Code