Rewrite the parquet input adapter manager#704
Conversation
a40d7a1 to
bc4b134
Compare
bc4b134 to
5122477
Compare
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
…iguity The introduction of namespace csp::adapters::arrow (for the new ColumnDispatcher/RecordBatchRowProcessor classes) creates ambiguity when writer-side headers use unqualified arrow:: inside namespace csp::adapters::parquet. The compiler finds the sibling csp::adapters::arrow namespace before the global ::arrow namespace. Also forward-declares ColumnDispatcher and RecordBatchRowProcessor in ParquetInputAdapterManager.h (moving full includes to .cpp) and adds direct includes for csp/core/Exception.h and arrow/table.h that were previously provided transitively through the now-deleted reader headers. Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
86d9f43 to
2ac4688
Compare
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
|
|
||
| virtual std::shared_ptr<arrow::DataType> getDataType() = 0; | ||
| virtual std::shared_ptr<arrow::ArrayBuilder> getBuilder() = 0; | ||
| virtual std::shared_ptr<::arrow::DataType> getDataType() = 0; |
There was a problem hiding this comment.
All the :: additions are kind of noisy here
| ColumnAdapterReference m_valueCountColumn; | ||
| std::unique_ptr<ParquetReader> m_reader; | ||
| std::string m_basketName; | ||
| std::string m_basketSymbolColumn; |
There was a problem hiding this comment.
This member doesn't seem to be used at all, just written to in setupDictBaskets
| return getOrCreateStructColumnAdapter( m_simInputAdapters, type, symbol, dictFieldMap, pushMode ); | ||
| } | ||
| CSP_THROW( RuntimeException, "Reached unreachable code" ); | ||
| properties.get<std::string>( "field_map" ); |
There was a problem hiding this comment.
Why is this line here? Not added in your PR but seems like it was a mistake to begin with
| for( auto && record : m_dictBasketReaders ) | ||
| { | ||
| auto numValues = record.getValueCount(); | ||
| const char * phase = dispatch ? "dispatch" : "skip"; |
There was a problem hiding this comment.
This string doesn't need to be created per loop iteration, can move to function level scope
| []( const ::arrow::HalfFloatArray & arr, int64_t i ) -> double { | ||
| return ::arrow::util::Float16::FromBits( arr.Value( i ) ).ToDouble(); | ||
| } ); | ||
| case ::arrow::Type::STRING: |
There was a problem hiding this comment.
The lambdas for STRING, LARGE_STRING, BINARY, LARGE_BINARY, FIXED_SIZE_BINARY are all exactly the same except the arg type, can you just define it once and templatize it? Might just be able to declare the arr arg as auto too and put all of them as a single case
| } | ||
|
|
||
| // Pull first non-empty batch | ||
| for( ;; ) |
There was a problem hiding this comment.
Isn't this just the same logic as fetchNextBatch directly below? Can probably just pass a default constructed entry to it and not need to duplicate the logic
| void RecordBatchRowProcessor::rebindSource( SourceEntry & entry ) | ||
| { | ||
| for( size_t i = 0; i < entry.dispatchers.size(); ++i ) | ||
| entry.dispatchers[i] -> bindColumn( entry.currentBatch -> column( entry.colIndices[i] ).get() ); |
There was a problem hiding this comment.
Compiler may do this anyways, but you can get vector & cols = entry.currentBatch -> columns() before the loop and then just index into it within the loop and avoid some indirections
| properties.tryGet( "time_shift", m_time_shift ); | ||
|
|
||
| CSP_TRUE_OR_THROW_RUNTIME( m_timeColumn != "", "Time column can't be empty" ); | ||
| CSP_TRUE_OR_THROW_RUNTIME( m_defaultTimezone == "UTC", |
There was a problem hiding this comment.
Why even accept tz as an argument then?
| raise TypeError("CSP Cannot load binary arrows derived from pyarrow versions less than 4.0.1") | ||
| wrapped = self._filenames_gen | ||
| self._filenames_gen = lambda starttime, endtime: self._arrow_c_data_interface(wrapped, starttime, endtime) | ||
| self._table_gen = self._filenames_gen # Alias: memory-table path uses _table_gen |
There was a problem hiding this comment.
What's this about? I don't see why we need the two of these
| void doReadNextValue( int64_t row, void * optionalOut ) override | ||
| { | ||
| auto & out = *static_cast<std::optional<ValueT> *>( optionalOut ); | ||
| auto & typed = static_cast<const ArrowArrayT &>( *this -> m_column ); |
There was a problem hiding this comment.
Can replace 196-200 with if( !doExtract( row, out ) out.reset()
Rewrite the parquet input adapter for RecordBatch-based streaming
Replaces the old
ParquetReader/ParquetReaderColumnAdapterclass hierarchy with a new three-layer architecture that operates on ArrowRecordBatchdata directly. The new design is simpler (fewer virtual calls, no per-type reader subclasses) and supports reading from parquet files, Arrow IPC streams, and in-memory Arrow Tables through a unifiedRecordBatchStreamSourceinterface.Motivation
The old implementation had:
FileReaderWrapper→ParquetFileReaderWrapper, per-typeReaderColumnAdaptersubclasses) that was hard to extendRecordBatchReaderfrom external sources)The new implementation:
RecordBatchStreamSourceinterface that cleanly separates file management from row processingArchitecture
RecordBatchStreamSource(new interface) abstracts file iteration with two implementations:NativeParquetStreamSource— C++ opens parquet files directly with leaf-level column projectionPyRecordBatchStreamSource— Python yieldsRecordBatchReaderobjects via Arrow C Stream Interface (IPC, memory tables)RecordBatchRowProcessor(new) binds to NRecordBatchReader*(one per split-column file), providesreadRowAndAdvance()/skipRow()/dispatchRow(). Validates split-column alignment at runtime.ColumnDispatcher(new) is a type-erased wrapper combiningFieldReader+ value storage +ValueDispatcher, one per subscribed column.ParquetInputAdapterManageris rewritten to orchestrate via the above layers. It no longer touches Arrow arrays directly.What's removed
ParquetReader/ParquetReaderColumnAdapter(~2500 lines) — the old per-type reader class hierarchyFileReaderWrapper/ParquetFileReaderWrapper/ArrowIPCFileReaderWrapper— old file abstractionsDialectGenericListReaderInterface— unused reader interfacem_rbSourcesmember inDictBasketReaderRecord(declared/cleared but never populated)Bug fixes
countLeafColumns()to correctly expand struct fields into all their leaf indices.allow_missing_columns=True, the adapter now throws a clearRuntimeErrorinstead of segfaulting.Performance
Benchmarked on Linux (Python 3.13, Arrow 23.0.1, GCC 14.3). Two suites:
bench_largemeasures full-file consumption (1M–5M ticks, all rows read),bench_comprehensivemeasures partial reads (86K ticks from larger files, realistic time-windowed workloads).Full-file: 3 faster, 0 slower, 20 unchanged. Partial-file: 22 faster, 0 slower, 13 unchanged (±3% threshold).
Key wins come from RecordBatch-level column projection (skips unrequested columns entirely), the
InlineReaderzero-overhead hot loop (verified by assembly to match hand-written typed access), and struct bulk-read eliminating per-field virtual dispatch.API compatibility
The public Python API (
ParquetReader.subscribe,subscribe_all,subscribe_dict_basket) is unchanged. All 128 existing + new tests pass (covering all Arrow types, null handling, struct projection, split columns, dict baskets, multi-file, IPC, and in-memory tables).