Rewrite the parquet input adapter manager by arhamchopra · Pull Request #704 · Point72/csp

arhamchopra · 2026-04-23T16:57:06Z

Rewrite the parquet input adapter for RecordBatch-based streaming

Replaces the old ParquetReader / ParquetReaderColumnAdapter class hierarchy with a new three-layer architecture that operates on Arrow RecordBatch data directly. The new design is simpler (fewer virtual calls, no per-type reader subclasses) and supports reading from parquet files, Arrow IPC streams, and in-memory Arrow Tables through a unified RecordBatchStreamSource interface.

Motivation

The old implementation had:

A deep class hierarchy (FileReaderWrapper → ParquetFileReaderWrapper, per-type ReaderColumnAdapter subclasses) that was hard to extend
No support for the Arrow C Stream Interface (couldn't accept RecordBatchReader from external sources)
Row-by-row Parquet reads without RecordBatch-level column projection
~2500 lines of boilerplate reader classes that duplicated Arrow's own type dispatch

The new implementation:

Reads entire RecordBatches and iterates rows within them (cache-friendly columnar access)
Uses Arrow's native column projection to read only requested columns from parquet
Exposes a RecordBatchStreamSource interface that cleanly separates file management from row processing
Reduces total C++ reader code by ~40%

Architecture

RecordBatchStreamSource          — file/stream boundary management
  └→ RecordBatchRowProcessor     — row-level cursor across N sources, alignment validation
       └→ ColumnDispatcher       — per-column type-erased read + dispatch to csp adapters
            └→ FieldReader       — Arrow array → typed value extraction (incl. nested structs)

RecordBatchStreamSource (new interface) abstracts file iteration with two implementations:

NativeParquetStreamSource — C++ opens parquet files directly with leaf-level column projection
PyRecordBatchStreamSource — Python yields RecordBatchReader objects via Arrow C Stream Interface (IPC, memory tables)

RecordBatchRowProcessor (new) binds to N RecordBatchReader* (one per split-column file), provides readRowAndAdvance() / skipRow() / dispatchRow(). Validates split-column alignment at runtime.

ColumnDispatcher (new) is a type-erased wrapper combining FieldReader + value storage + ValueDispatcher, one per subscribed column.

ParquetInputAdapterManager is rewritten to orchestrate via the above layers. It no longer touches Arrow arrays directly.

What's removed

ParquetReader / ParquetReaderColumnAdapter (~2500 lines) — the old per-type reader class hierarchy
FileReaderWrapper / ParquetFileReaderWrapper / ArrowIPCFileReaderWrapper — old file abstractions
DialectGenericListReaderInterface — unused reader interface
Dead m_rbSources member in DictBasketReaderRecord (declared/cleared but never populated)

Bug fixes

Nested struct column projection: Parquet stores struct sub-fields as separate leaf columns. The old projection logic used Arrow field indices directly, causing only the first sub-field to be read for struct columns. Added countLeafColumns() to correctly expand struct fields into all their leaf indices.
Null deref on schema change: if the time column disappears between files with allow_missing_columns=True, the adapter now throws a clear RuntimeError instead of segfaulting.
Stale basket processor sources: when basket columns are absent from a new stream, basket processor sources are now explicitly cleared, preventing use-after-free of dangling reader pointers.
Split-column row count alignment: runtime validation that all split-column files have matching row counts per batch.

Performance

Benchmarked on Linux (Python 3.13, Arrow 23.0.1, GCC 14.3). Two suites: bench_large measures full-file consumption (1M–5M ticks, all rows read), bench_comprehensive measures partial reads (86K ticks from larger files, realistic time-windowed workloads).

Scenario	main	PR	Δ
Full-file dict basket 10sym	10.4ms	9.7ms	-7%
Full-file dict basket 50sym	19.0ms	17.5ms	-9%
Partial 2M×10 int	126.2ms	108.0ms	-14%
Partial 500K×10 string	117.5ms	106.2ms	-10%
Partial struct (3 fields)	57.6ms	41.7ms	-28%
Partial struct (10 fields)	95.7ms	56.0ms	-42%
Partial projection 50-of-50	105.8ms	97.7ms	-8%
Partial IPC 2M×10	105.3ms	89.3ms	-15%

Full-file: 3 faster, 0 slower, 20 unchanged. Partial-file: 22 faster, 0 slower, 13 unchanged (±3% threshold).

Key wins come from RecordBatch-level column projection (skips unrequested columns entirely), the InlineReader zero-overhead hot loop (verified by assembly to match hand-written typed access), and struct bulk-read eliminating per-field virtual dispatch.

API compatibility

The public Python API (ParquetReader.subscribe, subscribe_all, subscribe_dict_basket) is unchanged. All 128 existing + new tests pass (covering all Arrow types, null handling, struct projection, split columns, dict baskets, multi-file, IPC, and in-memory tables).

Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>

…iguity The introduction of namespace csp::adapters::arrow (for the new ColumnDispatcher/RecordBatchRowProcessor classes) creates ambiguity when writer-side headers use unqualified arrow:: inside namespace csp::adapters::parquet. The compiler finds the sibling csp::adapters::arrow namespace before the global ::arrow namespace. Also forward-declares ColumnDispatcher and RecordBatchRowProcessor in ParquetInputAdapterManager.h (moving full includes to .cpp) and adds direct includes for csp/core/Exception.h and arrow/table.h that were previously provided transitively through the now-deleted reader headers. Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>

Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>

AdamGlustein · 2026-05-11T21:28:33Z


-    virtual std::shared_ptr<arrow::DataType> getDataType() = 0;
-    virtual std::shared_ptr<arrow::ArrayBuilder> getBuilder() = 0;
+    virtual std::shared_ptr<::arrow::DataType> getDataType() = 0;


All the :: additions are kind of noisy here

AdamGlustein · 2026-05-12T19:11:30Z

-        ColumnAdapterReference         m_valueCountColumn;
-        std::unique_ptr<ParquetReader> m_reader;
+        std::string                                     m_basketName;
+        std::string                                     m_basketSymbolColumn;


This member doesn't seem to be used at all, just written to in setupDictBaskets

AdamGlustein · 2026-05-12T19:23:40Z

+        return getOrCreateStructColumnAdapter( m_simInputAdapters, type, symbol, dictFieldMap, pushMode );
    }
-    CSP_THROW( RuntimeException, "Reached unreachable code" );
+    properties.get<std::string>( "field_map" );


Why is this line here? Not added in your PR but seems like it was a mistake to begin with

AdamGlustein · 2026-05-12T19:26:03Z

+    for( auto && record : m_dictBasketReaders )
+    {
+        auto numValues = record.getValueCount();
+        const char * phase = dispatch ? "dispatch" : "skip";


This string doesn't need to be created per loop iteration, can move to function level scope

AdamGlustein · 2026-05-12T19:28:49Z

+            []( const ::arrow::HalfFloatArray & arr, int64_t i ) -> double {
+                return ::arrow::util::Float16::FromBits( arr.Value( i ) ).ToDouble();
+            } );
+    case ::arrow::Type::STRING:


The lambdas for STRING, LARGE_STRING, BINARY, LARGE_BINARY, FIXED_SIZE_BINARY are all exactly the same except the arg type, can you just define it once and templatize it? Might just be able to declare the arr arg as auto too and put all of them as a single case

AdamGlustein · 2026-05-12T19:38:29Z

+        }
+
+        // Pull first non-empty batch
+        for( ;; )


Isn't this just the same logic as fetchNextBatch directly below? Can probably just pass a default constructed entry to it and not need to duplicate the logic

AdamGlustein · 2026-05-12T19:43:41Z

+void RecordBatchRowProcessor::rebindSource( SourceEntry & entry )
+{
+    for( size_t i = 0; i < entry.dispatchers.size(); ++i )
+        entry.dispatchers[i] -> bindColumn( entry.currentBatch -> column( entry.colIndices[i] ).get() );


Compiler may do this anyways, but you can get vector & cols = entry.currentBatch -> columns() before the loop and then just index into it within the loop and avoid some indirections

AdamGlustein · 2026-05-12T19:45:54Z

    properties.tryGet( "time_shift", m_time_shift );

    CSP_TRUE_OR_THROW_RUNTIME( m_timeColumn != "", "Time column can't be empty" );
    CSP_TRUE_OR_THROW_RUNTIME( m_defaultTimezone == "UTC",


Why even accept tz as an argument then?

AdamGlustein · 2026-05-12T19:48:13Z

                raise TypeError("CSP Cannot load binary arrows derived from pyarrow versions less than 4.0.1")
-            wrapped = self._filenames_gen
-            self._filenames_gen = lambda starttime, endtime: self._arrow_c_data_interface(wrapped, starttime, endtime)
+            self._table_gen = self._filenames_gen  # Alias: memory-table path uses _table_gen


What's this about? I don't see why we need the two of these

AdamGlustein · 2026-05-12T19:52:26Z

+    void doReadNextValue( int64_t row, void * optionalOut ) override
+    {
+        auto & out = *static_cast<std::optional<ValueT> *>( optionalOut );
+        auto & typed = static_cast<const ArrowArrayT &>( *this -> m_column );


Can replace 196-200 with if( !doExtract( row, out ) out.reset()

arhamchopra force-pushed the ac/parquet_input_adapter branch from a40d7a1 to bc4b134 Compare April 23, 2026 17:50

timkpaine added type: feature Issues and PRs related to new features adapter: parquet Issues and PRs related to our Apache Parquet/Arrow adapter labels Apr 23, 2026

arhamchopra force-pushed the ac/parquet_input_adapter branch from bc4b134 to 5122477 Compare April 27, 2026 21:37

arhamchopra added 6 commits April 28, 2026 11:21

Arrow row-by-row processing: ColumnDispatcher, RecordBatchRowProcessor

b4e3218

Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>

Remove old C++ reader classes and file wrappers

63565bc

Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>

Rewrite ParquetInputAdapterManager for RecordBatch input

364b392

Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>

Python adapter: RecordBatch stream factories and C Stream Interface

efd9217

Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>

Tests for parquet input adapter rewrite

468bd2a

Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>

arhamchopra force-pushed the ac/parquet_input_adapter branch from 86d9f43 to 2ac4688 Compare April 28, 2026 15:21

arhamchopra added 5 commits April 29, 2026 14:39

Fix Arrow 21 compatibility: use out-parameter FileReader::Make

f89e359

Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>

Add comprehensive test coverage for parquet input adapter

bb19d88

Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>

Remove dead m_rbSources member from DictBasketReaderRecord, and format

366f76f

Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>

Optimize hot path: InlineReader for zero-overhead value extraction

483950d

Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>

Add comprehensive tests for all Arrow types and edge cases

ff982c7

Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>

arhamchopra marked this pull request as ready for review April 30, 2026 01:16

arhamchopra requested review from AdamGlustein, alexddobkin, ptomecek, robambalu, svatasoiu and timkpaine as code owners April 30, 2026 01:16

AdamGlustein reviewed May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite the parquet input adapter manager#704

Rewrite the parquet input adapter manager#704
arhamchopra wants to merge 11 commits into
mainfrom
ac/parquet_input_adapter

arhamchopra commented Apr 23, 2026 •

edited

Loading

Uh oh!

AdamGlustein May 11, 2026

Uh oh!

AdamGlustein May 12, 2026

Uh oh!

AdamGlustein May 12, 2026

Uh oh!

AdamGlustein May 12, 2026

Uh oh!

AdamGlustein May 12, 2026

Uh oh!

AdamGlustein May 12, 2026

Uh oh!

AdamGlustein May 12, 2026

Uh oh!

AdamGlustein May 12, 2026

Uh oh!

AdamGlustein May 12, 2026

Uh oh!

AdamGlustein May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

arhamchopra commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rewrite the parquet input adapter for RecordBatch-based streaming

Motivation

Architecture

What's removed

Bug fixes

Performance

API compatibility

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

arhamchopra commented Apr 23, 2026 •

edited

Loading