Skip to content

nested_type_combinations.test: DuckDB segfaults on list<enum> with arrow_lossless_conversion=true #1

@rustyconover

Description

@rustyconover

Summary

table_in_out/echo/nested_type_combinations.test segfaults the C++ test harness mid-run after the dict-batch fixes landed in vgi-rpc-java (880a5e4 / bdccadc / 5cd91f0). Bisected to a single query at line 495-501:

SET arrow_lossless_conversion = true;
CREATE TYPE mood AS ENUM ('happy','sad','neutral');

SELECT l FROM example.echo((
    SELECT ['happy'::mood, 'sad'::mood, 'neutral'::mood] AS l
));

Previously this test silently returned 0 rows (because the worker's wire stream was being rejected as malformed and read as EOS). The dict-batch fix makes the worker emit data the C++ side actually consumes — and the consumption then crashes on this case.

Root cause

Traced via wire-byte capture and a pyarrow.ipc.open_stream diff against the Python reference worker.

With arrow_lossless_conversion = true, DuckDB sends list<enum> to the worker as:

list<sparse_union<varchar: dictionary<values=string, indices=uint8>=24,
                  uint1: uint8=33>>

— a sparse-union-tagged element type where each value carries a type id (24 = the dict-encoded varchar, 33 = a bit). This preserves enum-value-vs-NULL identity losslessly across the Arrow boundary.

Wire schema emitted by worker Element values
Python worker (passes) list<dictionary<values=string, indices=uint8>> [['happy', 'sad', 'neutral']]
Java worker (segfaults) list<sparse_union<varchar: dict<...>=24, uint1: uint8=33>> [[7, 0, 0]]

Python's worker collapses the sparse-union back to plain dict-encoded values before emit. Java's EchoFunction TransferPair-passes the input vectors through unchanged, so the sparse union survives to the output — but the bind-time output schema declared list<dict<...>> (no union), so the wire shape contradicts the declared schema.

DuckDB then reads the response, sees list<sparse_union<...>> where it expected list<dict<...>>, and segfaults inside its Arrow → DuckDB converter when it tries to decode the unexpected sparse-union child.

Reproducer

require-env VGI_TEST_WORKER
require vgi
require httpfs

statement ok
SET arrow_lossless_conversion = true;

statement ok
ATTACH 'example' AS example (TYPE vgi, LOCATION '${VGI_TEST_WORKER}');

statement ok
CREATE TYPE mood AS ENUM ('happy','sad','neutral');

query I
SELECT l FROM example.echo((SELECT ['happy'::mood, 'sad'::mood, 'neutral'::mood] AS l));
----
[happy, sad, neutral]

Where to fix

Fixture-side handling of lossless-tagged inputs in EchoFunction (and any other TIO that passes input through). The flow needs to:

  1. Detect when an input vector's wire-format Field carries a sparse_union whose children are the lossless tagging shape (one dict-encoded variant + one tag-only variant).
  2. Re-collapse it to the declared output schema's type (dict-encoded) before emit.

Probably belongs in a shared helper in vgi-core/src/main/java/farm/query/vgi/internal/ so other TIO fixtures that handle dict-encoded nested data can reuse it. Will need test coverage for at least: enum-in-list, enum-in-struct, enum-in-map, list-of-enum-in-struct.

Workaround

CLAUDE.md documents excluding this test from integration runs:

find ~/Development/vgi/test/sql/integration -name '*.test' \
  -not -path '*/writable/*' -not -path '*/simple_writable/*' \
  -not -name 'nested_type_combinations.test' \
  | sort > /tmp/intest.txt

Related

  • vgi-rpc-java commits 880a5e4, bdccadc, 5cd91f0 — the dict-encoded round-trip fixes that unblocked filter_pushdown/enums.test and table_in_out/echo/all_types.test, exposed this one.
  • Originally one of the 17 failures listed in CLAUDE.md state-of-play.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions