Skip to content

PDF: fix garbled/misplaced text — string escapes, CID encoding, substitute baseline#586

Merged
andiwand merged 8 commits into
mainfrom
pdf-string-escape-cid-fixes
Jul 5, 2026
Merged

PDF: fix garbled/misplaced text — string escapes, CID encoding, substitute baseline#586
andiwand merged 8 commits into
mainfrom
pdf-string-escape-cid-fixes

Conversation

@andiwand

@andiwand andiwand commented Jul 4, 2026

Copy link
Copy Markdown
Member

Three related fixes to the in-house PDF engine (DecoderEngine::odr), surfaced by fontfile3_opentype.pdf (and its encrypted twin), where lots of glyphs rendered as .notdef boxes, some text was missing/garbled, and the large "PDFClown" title sat well below the page bottom.

1. Decode reverse-solidus escapes in string literals

The literal-string parser appended the raw character after a backslash for non-octal escapes, so \n \r \t \b \f yielded the letters instead of their control bytes (ISO 32000-1 7.3.4.2, Table 3). Any glyph whose 2-byte CFF CID code had one of those low bytes decoded to the wrong code → .notdef boxes and garbled selectable text. Now translates the full escape set + \-before-EOL line continuation. New PdfObjectParser tests cover it. (notdef boxes 173 → 0.)

2. Parse composite /Encoding CMap for CID mapping

An embedded Type0 /Encoding CMap stream is now parsed (cidchar/cidrangeFont::cid_encoding), so Font::codes() maps char codes to CIDs through it and uses the CMap's authoritative codespace. Fixes producers that mix a 1-byte code (e.g. a space) among 2-byte CIDs — previously the space fell through identity to the wrong glyph and advance.

3. Pin substitute-font baseline with @font-face metric overrides

A non-embedded font renders in a local system font whose hhea/OS-2 metrics we don't control, so under line-height:1 the browser placed the baseline at that font's ascent rather than the ascent_em used to compute top — dropping e.g. the 120pt Times title below the page bottom. Each substitute now routes through a generated @font-face ('odr-sN', src: local(...)) carrying ascent-override/descent-override (summing to 1em) and line-gap-override:0, so the baseline is positioned from our metric. Faces deduped by (family, style, ascent); the family stack is kept as a fallback.

4. Log unknown graphics operators via Logger

GraphicsOperatorParser now takes a Logger (mirroring CMapParser) and emits ODR_DEBUG for unrecognized operators instead of writing to std::cerr.

All 228 PDF unit tests pass. Reference-output snapshot regeneration is left to the maintainer's update-ref workflow.

🤖 Generated with Claude Code

andiwand and others added 3 commits July 4, 2026 23:31
The literal-string parser appended the raw character after a backslash for
non-octal escapes, so `\n \r \t \b \f` yielded the letters 'n' 'r' 't' 'b'
'f' instead of their control bytes (ISO 32000-1 7.3.4.2, Table 3). Any glyph
whose byte code had one of those low bytes (common in 2-byte CFF CID codes)
decoded to the wrong code, producing .notdef boxes and garbled selectable
text. Translate the full escape set and handle `\`-before-EOL line
continuation. Covered by new PdfObjectParser string tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
An embedded Type0 /Encoding CMap stream is now parsed (cidchar/cidrange ->
Font::cid_encoding) so Font::codes() maps char codes to CIDs through it and
uses the CMap's authoritative codespace. This fixes producers that mix a
1-byte code (e.g. a space) among 2-byte CIDs: without the CMap the space fell
through identity to the wrong glyph and advance. CMap gains cid_for_code /
add_cid_range / has_cid_map; the parser gains read_cidchar / read_cidrange.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A non-embedded font renders in a local system font whose hhea/OS-2 metrics we
don't control, so under line-height:1 the browser placed the glyph baseline at
that font's ascent rather than the ascent_em we use to compute `top` — dropping
e.g. the 120pt Times title well below the page bottom. Route each substitute
through a generated @font-face ('odr-sN', src: local(...) of the family stack)
carrying ascent-override/descent-override (summing to 1em) and
line-gap-override:0, so the baseline is positioned from our metric. Faces are
deduped by (family, style, ascent); the family stack is kept as a fallback.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 156258fde6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/odr/internal/pdf/pdf_cmap_parser.cpp
Thread a Logger into GraphicsOperatorParser (mirroring CMapParser) and
emit ODR_DEBUG for unrecognized operators rather than writing to
std::cerr.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@andiwand andiwand force-pushed the pdf-string-escape-cid-fixes branch from 37df8a7 to 716995d Compare July 4, 2026 21:42
andiwand and others added 4 commits July 4, 2026 23:51
An embedded Type0 /Encoding CMap that inherits a base via `usecmap`
(e.g. `/Identity-H usecmap`) and then declares only a partial local
codespacerange would have its partial codespace treated as
authoritative by Font::codes(), so inherited 2-byte codes fell through
code_width() to the 1-byte fallback and were mis-split.

We don't resolve the inherited base CMap, so flag the `usecmap` operator
at parse time and exclude an inherited codespace from has_codespace().
Callers then fall back to the ToUnicode codespace / fixed 2-byte width
(the pre-CID-encoding behavior). Local cidchar/cidrange overrides are
still applied; unmapped codes fall through to identity, correct for an
Identity-H base.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the paired `const char *position, const char *end` cursor with a
single `string_view` that shrinks as codes are consumed. Bundling the two
pointers removes the mismatched-position/end footgun and lets `settle()`
use `size()` directly instead of pointer subtraction. `begin()`/`end()` now
pass the view (and an empty view at its tail) rather than reaching into
`data()`, dropping both suspicious-stringview-data-usage suppressions.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@andiwand andiwand merged commit ea21756 into main Jul 5, 2026
11 checks passed
@andiwand andiwand deleted the pdf-string-escape-cid-fixes branch July 5, 2026 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant