Add Microsoft.Extensions.DataRetrieval abstractions and pipeline by luisquintanilla · Pull Request #7508 · dotnet/extensions

luisquintanilla · 2026-05-05T17:02:07Z

PR: Add Microsoft.Extensions.DataRetrieval packages

Resolves #7507

What this PR adds

Two new packages providing retrieval pipeline abstractions for RAG (Retrieval-Augmented Generation) applications:

`Microsoft.Extensions.DataRetrieval.Abstractions`

Thin abstraction layer defining the data types and processor contracts:

RetrievalQuery — query with variant expansion and metadata
RetrievalChunk — scored content chunk from vector search
RetrievalResults — result container with pipeline metadata
RetrievalQueryProcessor — abstract base for pre-search processors
RetrievalResultProcessor — abstract base for post-search processors
IReranker — interface for re-ranking strategies
IRetriever — data-source-agnostic retrieval contract for DI and testability

`Microsoft.Extensions.DataRetrieval`

Pipeline implementation:

RetrievalPipeline — orchestrates query processing → vector search → result processing
VectorStoreRetriever<TKey, TRecord> — IRetriever implementation binding pipeline + collection
RetrievalPipelineExtensions.AsRetriever() — convenience extension for pipeline → retriever conversion
RetrievalPipelineOptions — configuration (ActivitySource name)
Built-in: Reciprocal Rank Fusion for multi-query deduplication
Built-in: Tree-traversal search paradigm for hierarchical indices
OpenTelemetry tracing via System.Diagnostics.ActivitySource
Structured logging via ILoggerFactory

Design

These packages are the read-side counterpart to Microsoft.Extensions.DataIngestion (the write-side). Together with Microsoft.Extensions.AI and Microsoft.Extensions.VectorData, they provide a complete composable RAG stack.

The pipeline follows the same "zero-cost when empty" philosophy as DataIngestion: with no processors registered, ProcessAsync performs a raw vector search. Each processor adds exactly one capability.

Dependencies

Microsoft.Extensions.VectorData.Abstractions (for VectorStoreCollection<TKey, TRecord>)
Microsoft.Extensions.Logging.Abstractions
Microsoft.Extensions.Options

No dependency on Microsoft.Extensions.AI in the abstractions package — processor implementations bring their own AI client dependencies.

Testing

Unit tests covering pipeline orchestration, RRF deduplication, tree traversal
Processor contract tests (mock implementations)
Integration verified against Qdrant, Azure AI Search, and in-memory vector stores via the advanced-rag reference application

API diff summary

+ Microsoft.Extensions.DataRetrieval.Abstractions (new package)
  + RetrievalQuery
  + RetrievalChunk
  + RetrievalResults
  + RetrievalQueryProcessor
  + RetrievalResultProcessor
  + IReranker
  + IRetriever

+ Microsoft.Extensions.DataRetrieval (new package)
  + RetrievalPipeline
  + RetrievalPipelineOptions
  + VectorStoreRetriever<TKey, TRecord>
  + RetrievalPipelineExtensions (AsRetriever)

Checklist

XML documentation on all public types
Exception documentation (<exception>) on throwing methods
ActivitySource telemetry
Structured logging
README for both packages
Follows existing DataIngestion packaging pattern
Audited against dotnet/runtime Framework Design Guidelines

Microsoft Reviewers: Open in CodeFlow

Abstractions (Microsoft.Extensions.DataIngestion.Abstractions): - RetrievalQuery: query data type with variants + metadata - RetrievalChunk: single chunk with content, score, record data - RetrievalResults: result collection with pipeline metadata - RetrievalQueryProcessor: abstract base for pre-search processing - RetrievalResultProcessor: abstract base for post-search processing - ISearchReranker: interface for re-ranking strategies Implementation (Microsoft.Extensions.DataIngestion): - RetrievalPipeline: orchestrator (query processors → vector search → result processors) - RetrievalPipelineOptions: ActivitySource configuration - Source-generated log methods for pipeline tracing Mirrors IngestionPipeline design: empty processor list = raw vector search. Builds clean: 0 warnings, 0 errors across all target frameworks. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Align retrieval processor method names with ingestion-side convention. IngestionChunkProcessor<T> and IngestionDocumentProcessor both use ProcessAsync — the retrieval abstractions should follow suit. - RetrievalQueryProcessor.ProcessQueryAsync → ProcessAsync - RetrievalResultProcessor.ProcessResultsAsync → ProcessAsync - RetrievalPipeline: update calls to use new names Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Phase 2 now checks query.Metadata['search_paradigm'] == 'TreeTraversal': - Wider search (topK * 3) to capture all tree levels - Groups results by Level metadata (set by TreeIndexProcessor) - Prioritizes leaf chunks with branch/root summaries for context - Falls back to existing flat search when metadata not present Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

These files contained hardcoded c:\Dev\extensions paths for local package resolution. They are template scaffolding not related to retrieval abstractions and should not be on this branch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add Retrieval Types section documenting RetrievalQuery, RetrievalChunk, RetrievalResults, RetrievalQueryProcessor, RetrievalResultProcessor, and ISearchReranker. Restructure package description to cover both ingestion and retrieval. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Move retrieval types from DataIngestion to new DataRetrieval packages: - Microsoft.Extensions.DataRetrieval.Abstractions: RetrievalQuery, RetrievalChunk, RetrievalResults, RetrievalQueryProcessor, RetrievalResultProcessor, IReranker (renamed from ISearchReranker) - Microsoft.Extensions.DataRetrieval: RetrievalPipeline, RetrievalPipelineOptions, retrieval Log methods Namespace changes from Microsoft.Extensions.DataIngestion to Microsoft.Extensions.DataRetrieval. Retrieval types had zero dependencies on ingestion types — clean split. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Added <exception> documentation for ArgumentNullException and ArgumentException thrown by RetrieveAsync, per dotnet/runtime adding-api-guidelines.md requirements. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Introduces IRetriever in DataRetrieval.Abstractions as a simple, vector-store-agnostic contract for retrieval pipelines. BoundRetriever<TKey, TRecord> in DataRetrieval adapts a RetrievalPipeline + VectorStoreCollection into IRetriever, enabling DI registration and testability. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Better communicates what the type does from a consumer perspective: it retrieves from a vector store. Follows noun-noun compound pattern (StreamReader, ChannelWriter) and distinguishes from future IRetriever implementations (WebSearchRetriever, DatabaseRetriever, etc). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Aligns with IngestionPipeline.ProcessAsync — both pipelines 'process' inputs through their stages. Reserves 'RetrieveAsync' exclusively for IRetriever, creating clear vocabulary separation: - Pipeline.ProcessAsync: engine that processes queries through stages - IRetriever.RetrieveAsync: endpoint that retrieves results Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds a convenience extension method that wraps a RetrievalPipeline into a VectorStoreRetriever<TKey, TRecord> implementing IRetriever. This improves discoverability and enables pipeline.AsRetriever(...) as a natural terminal operation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

CZEMacLeod · 2026-05-06T00:35:05Z

DataRetrieval is a kind of generic name/activity which may or may not indicate that this is AI related (RAG).
As it is directly under the very generic Microsoft.Extensions.* packages list, it would be nice if these sort of packages were better grouped/named.
Other areas/features that might match ME.DataRetrieval

agnostic archived/'cold data retrieval (e.g. from blob storage, azure backup or another source).
database or nosql storage systems
(corrupt) filesystem recovery tools

luisquintanilla and others added 12 commits April 7, 2026 17:13

Add READMEs for DataRetrieval packages

b95b5cf

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add exception XML docs to RetrieveAsync per FDG audit

d69e997

Added <exception> documentation for ArgumentNullException and ArgumentException thrown by RetrieveAsync, per dotnet/runtime adding-api-guidelines.md requirements. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions Bot added the area-ai-templates Microsoft.Extensions.AI.Templates label May 5, 2026

dotnet-policy-service Bot assigned luisquintanilla May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Microsoft.Extensions.DataRetrieval abstractions and pipeline#7508

Add Microsoft.Extensions.DataRetrieval abstractions and pipeline#7508
luisquintanilla wants to merge 12 commits intodotnet:mainfrom
luisquintanilla:feature/retrieval-abstractions

luisquintanilla commented May 5, 2026 •

edited by dotnet-policy-service Bot

Loading

Uh oh!

CZEMacLeod commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

luisquintanilla commented May 5, 2026 • edited by dotnet-policy-service Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR: Add Microsoft.Extensions.DataRetrieval packages

What this PR adds

Microsoft.Extensions.DataRetrieval.Abstractions

Microsoft.Extensions.DataRetrieval

Design

Dependencies

Testing

API diff summary

Checklist

Microsoft Reviewers: Open in CodeFlow

Uh oh!

CZEMacLeod commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

luisquintanilla commented May 5, 2026 •

edited by dotnet-policy-service Bot

Loading

`Microsoft.Extensions.DataRetrieval.Abstractions`

`Microsoft.Extensions.DataRetrieval`