Summary
table_in_out/echo/nested_type_combinations.test segfaults the C++ test harness mid-run after the dict-batch fixes landed in vgi-rpc-java (880a5e4 / bdccadc / 5cd91f0). Bisected to a single query at line 495-501:
SET arrow_lossless_conversion = true;
CREATE TYPE mood AS ENUM ('happy','sad','neutral');
SELECT l FROM example.echo((
SELECT ['happy'::mood, 'sad'::mood, 'neutral'::mood] AS l
));
Previously this test silently returned 0 rows (because the worker's wire stream was being rejected as malformed and read as EOS). The dict-batch fix makes the worker emit data the C++ side actually consumes — and the consumption then crashes on this case.
Root cause
Traced via wire-byte capture and a pyarrow.ipc.open_stream diff against the Python reference worker.
With arrow_lossless_conversion = true, DuckDB sends list<enum> to the worker as:
list<sparse_union<varchar: dictionary<values=string, indices=uint8>=24,
uint1: uint8=33>>
— a sparse-union-tagged element type where each value carries a type id (24 = the dict-encoded varchar, 33 = a bit). This preserves enum-value-vs-NULL identity losslessly across the Arrow boundary.
|
Wire schema emitted by worker |
Element values |
| Python worker (passes) |
list<dictionary<values=string, indices=uint8>> |
[['happy', 'sad', 'neutral']] |
| Java worker (segfaults) |
list<sparse_union<varchar: dict<...>=24, uint1: uint8=33>> |
[[7, 0, 0]] |
Python's worker collapses the sparse-union back to plain dict-encoded values before emit. Java's EchoFunction TransferPair-passes the input vectors through unchanged, so the sparse union survives to the output — but the bind-time output schema declared list<dict<...>> (no union), so the wire shape contradicts the declared schema.
DuckDB then reads the response, sees list<sparse_union<...>> where it expected list<dict<...>>, and segfaults inside its Arrow → DuckDB converter when it tries to decode the unexpected sparse-union child.
Reproducer
require-env VGI_TEST_WORKER
require vgi
require httpfs
statement ok
SET arrow_lossless_conversion = true;
statement ok
ATTACH 'example' AS example (TYPE vgi, LOCATION '${VGI_TEST_WORKER}');
statement ok
CREATE TYPE mood AS ENUM ('happy','sad','neutral');
query I
SELECT l FROM example.echo((SELECT ['happy'::mood, 'sad'::mood, 'neutral'::mood] AS l));
----
[happy, sad, neutral]
Where to fix
Fixture-side handling of lossless-tagged inputs in EchoFunction (and any other TIO that passes input through). The flow needs to:
- Detect when an input vector's wire-format Field carries a
sparse_union whose children are the lossless tagging shape (one dict-encoded variant + one tag-only variant).
- Re-collapse it to the declared output schema's type (dict-encoded) before emit.
Probably belongs in a shared helper in vgi-core/src/main/java/farm/query/vgi/internal/ so other TIO fixtures that handle dict-encoded nested data can reuse it. Will need test coverage for at least: enum-in-list, enum-in-struct, enum-in-map, list-of-enum-in-struct.
Workaround
CLAUDE.md documents excluding this test from integration runs:
find ~/Development/vgi/test/sql/integration -name '*.test' \
-not -path '*/writable/*' -not -path '*/simple_writable/*' \
-not -name 'nested_type_combinations.test' \
| sort > /tmp/intest.txt
Related
- vgi-rpc-java commits
880a5e4, bdccadc, 5cd91f0 — the dict-encoded round-trip fixes that unblocked filter_pushdown/enums.test and table_in_out/echo/all_types.test, exposed this one.
- Originally one of the 17 failures listed in CLAUDE.md state-of-play.
🤖 Generated with Claude Code
Summary
table_in_out/echo/nested_type_combinations.testsegfaults the C++ test harness mid-run after the dict-batch fixes landed in vgi-rpc-java (880a5e4/bdccadc/5cd91f0). Bisected to a single query at line 495-501:Previously this test silently returned 0 rows (because the worker's wire stream was being rejected as malformed and read as EOS). The dict-batch fix makes the worker emit data the C++ side actually consumes — and the consumption then crashes on this case.
Root cause
Traced via wire-byte capture and a
pyarrow.ipc.open_streamdiff against the Python reference worker.With
arrow_lossless_conversion = true, DuckDB sendslist<enum>to the worker as:— a sparse-union-tagged element type where each value carries a type id (24 = the dict-encoded varchar, 33 = a bit). This preserves enum-value-vs-NULL identity losslessly across the Arrow boundary.
list<dictionary<values=string, indices=uint8>>[['happy', 'sad', 'neutral']]list<sparse_union<varchar: dict<...>=24, uint1: uint8=33>>[[7, 0, 0]]Python's worker collapses the sparse-union back to plain dict-encoded values before emit. Java's
EchoFunctionTransferPair-passes the input vectors through unchanged, so the sparse union survives to the output — but the bind-time output schema declaredlist<dict<...>>(no union), so the wire shape contradicts the declared schema.DuckDB then reads the response, sees
list<sparse_union<...>>where it expectedlist<dict<...>>, and segfaults inside its Arrow → DuckDB converter when it tries to decode the unexpected sparse-union child.Reproducer
Where to fix
Fixture-side handling of lossless-tagged inputs in
EchoFunction(and any other TIO that passes input through). The flow needs to:sparse_unionwhose children are the lossless tagging shape (one dict-encoded variant + one tag-only variant).Probably belongs in a shared helper in
vgi-core/src/main/java/farm/query/vgi/internal/so other TIO fixtures that handle dict-encoded nested data can reuse it. Will need test coverage for at least: enum-in-list, enum-in-struct, enum-in-map, list-of-enum-in-struct.Workaround
CLAUDE.mddocuments excluding this test from integration runs:Related
880a5e4,bdccadc,5cd91f0— the dict-encoded round-trip fixes that unblockedfilter_pushdown/enums.testandtable_in_out/echo/all_types.test, exposed this one.🤖 Generated with Claude Code