PDF: fix garbled/misplaced text — string escapes, CID encoding, substitute baseline by andiwand · Pull Request #586 · opendocument-app/OpenDocument.core

andiwand · 2026-07-04T21:36:24Z

Three related fixes to the in-house PDF engine (DecoderEngine::odr), surfaced by fontfile3_opentype.pdf (and its encrypted twin), where lots of glyphs rendered as .notdef boxes, some text was missing/garbled, and the large "PDFClown" title sat well below the page bottom.

1. Decode reverse-solidus escapes in string literals

The literal-string parser appended the raw character after a backslash for non-octal escapes, so \n \r \t \b \f yielded the letters instead of their control bytes (ISO 32000-1 7.3.4.2, Table 3). Any glyph whose 2-byte CFF CID code had one of those low bytes decoded to the wrong code → .notdef boxes and garbled selectable text. Now translates the full escape set + \-before-EOL line continuation. New PdfObjectParser tests cover it. (notdef boxes 173 → 0.)

2. Parse composite `/Encoding` CMap for CID mapping

An embedded Type0 /Encoding CMap stream is now parsed (cidchar/cidrange → Font::cid_encoding), so Font::codes() maps char codes to CIDs through it and uses the CMap's authoritative codespace. Fixes producers that mix a 1-byte code (e.g. a space) among 2-byte CIDs — previously the space fell through identity to the wrong glyph and advance.

3. Pin substitute-font baseline with `@font-face` metric overrides

A non-embedded font renders in a local system font whose hhea/OS-2 metrics we don't control, so under line-height:1 the browser placed the baseline at that font's ascent rather than the ascent_em used to compute top — dropping e.g. the 120pt Times title below the page bottom. Each substitute now routes through a generated @font-face ('odr-sN', src: local(...)) carrying ascent-override/descent-override (summing to 1em) and line-gap-override:0, so the baseline is positioned from our metric. Faces deduped by (family, style, ascent); the family stack is kept as a fallback.

4. Log unknown graphics operators via `Logger`

GraphicsOperatorParser now takes a Logger (mirroring CMapParser) and emits ODR_DEBUG for unrecognized operators instead of writing to std::cerr.

All 228 PDF unit tests pass. Reference-output snapshot regeneration is left to the maintainer's update-ref workflow.

🤖 Generated with Claude Code

The literal-string parser appended the raw character after a backslash for non-octal escapes, so `\n \r \t \b \f` yielded the letters 'n' 'r' 't' 'b' 'f' instead of their control bytes (ISO 32000-1 7.3.4.2, Table 3). Any glyph whose byte code had one of those low bytes (common in 2-byte CFF CID codes) decoded to the wrong code, producing .notdef boxes and garbled selectable text. Translate the full escape set and handle `\`-before-EOL line continuation. Covered by new PdfObjectParser string tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

An embedded Type0 /Encoding CMap stream is now parsed (cidchar/cidrange -> Font::cid_encoding) so Font::codes() maps char codes to CIDs through it and uses the CMap's authoritative codespace. This fixes producers that mix a 1-byte code (e.g. a space) among 2-byte CIDs: without the CMap the space fell through identity to the wrong glyph and advance. CMap gains cid_for_code / add_cid_range / has_cid_map; the parser gains read_cidchar / read_cidrange. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@font-face

A non-embedded font renders in a local system font whose hhea/OS-2 metrics we don't control, so under line-height:1 the browser placed the glyph baseline at that font's ascent rather than the ascent_em we use to compute `top` — dropping e.g. the 120pt Times title well below the page bottom. Route each substitute through a generated @font-face ('odr-sN', src: local(...) of the family stack) carrying ascent-override/descent-override (summing to 1em) and line-gap-override:0, so the baseline is positioned from our metric. Faces are deduped by (family, style, ascent); the family stack is kept as a fallback. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 156258fde6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Thread a Logger into GraphicsOperatorParser (mirroring CMapParser) and emit ODR_DEBUG for unrecognized operators rather than writing to std::cerr. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

An embedded Type0 /Encoding CMap that inherits a base via `usecmap` (e.g. `/Identity-H usecmap`) and then declares only a partial local codespacerange would have its partial codespace treated as authoritative by Font::codes(), so inherited 2-byte codes fell through code_width() to the 1-byte fallback and were mis-split. We don't resolve the inherited base CMap, so flag the `usecmap` operator at parse time and exclude an inherited codespace from has_codespace(). Callers then fall back to the ToUnicode codespace / fixed 2-byte width (the pre-CID-encoding behavior). Local cidchar/cidrange overrides are still applied; unmapped codes fall through to identity, correct for an Identity-H base. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Replace the paired `const char *position, const char *end` cursor with a single `string_view` that shrinks as codes are consumed. Bundling the two pointers removes the mismatched-position/end footgun and lets `settle()` use `size()` directly instead of pointer subtraction. `begin()`/`end()` now pass the view (and an empty view at its tail) rather than reaching into `data()`, dropping both suspicious-stringview-data-usage suppressions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

andiwand and others added 3 commits July 4, 2026 23:31

chatgpt-codex-connector Bot reviewed Jul 4, 2026

View reviewed changes

Comment thread src/odr/internal/pdf/pdf_cmap_parser.cpp

PDF: log unknown graphics operators via Logger instead of std::cerr

716995d

Thread a Logger into GraphicsOperatorParser (mirroring CMapParser) and emit ODR_DEBUG for unrecognized operators rather than writing to std::cerr. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

andiwand force-pushed the pdf-string-escape-cid-fixes branch from 37df8a7 to 716995d Compare July 4, 2026 21:42

andiwand and others added 4 commits July 4, 2026 23:51

update refs

495eb2b

PDF: use doc-comment style for CidRange field comments

2cbe092

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

andiwand merged commit ea21756 into main Jul 5, 2026
11 checks passed

andiwand deleted the pdf-string-escape-cid-fixes branch July 5, 2026 07:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PDF: fix garbled/misplaced text — string escapes, CID encoding, substitute baseline#586

PDF: fix garbled/misplaced text — string escapes, CID encoding, substitute baseline#586
andiwand merged 8 commits into
mainfrom
pdf-string-escape-cid-fixes

andiwand commented Jul 4, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

andiwand commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Decode reverse-solidus escapes in string literals

2. Parse composite /Encoding CMap for CID mapping

3. Pin substitute-font baseline with @font-face metric overrides

4. Log unknown graphics operators via Logger

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andiwand commented Jul 4, 2026 •

edited

Loading

2. Parse composite `/Encoding` CMap for CID mapping

3. Pin substitute-font baseline with `@font-face` metric overrides

4. Log unknown graphics operators via `Logger`