Skip to content

Add Microsoft.Extensions.DataRetrieval abstractions and pipeline#7508

Draft
luisquintanilla wants to merge 12 commits intodotnet:mainfrom
luisquintanilla:feature/retrieval-abstractions
Draft

Add Microsoft.Extensions.DataRetrieval abstractions and pipeline#7508
luisquintanilla wants to merge 12 commits intodotnet:mainfrom
luisquintanilla:feature/retrieval-abstractions

Conversation

@luisquintanilla
Copy link
Copy Markdown
Contributor

@luisquintanilla luisquintanilla commented May 5, 2026

PR: Add Microsoft.Extensions.DataRetrieval packages

Resolves #7507

What this PR adds

Two new packages providing retrieval pipeline abstractions for RAG (Retrieval-Augmented Generation) applications:

Microsoft.Extensions.DataRetrieval.Abstractions

Thin abstraction layer defining the data types and processor contracts:

  • RetrievalQuery — query with variant expansion and metadata
  • RetrievalChunk — scored content chunk from vector search
  • RetrievalResults — result container with pipeline metadata
  • RetrievalQueryProcessor — abstract base for pre-search processors
  • RetrievalResultProcessor — abstract base for post-search processors
  • IReranker — interface for re-ranking strategies
  • IRetriever — data-source-agnostic retrieval contract for DI and testability

Microsoft.Extensions.DataRetrieval

Pipeline implementation:

  • RetrievalPipeline — orchestrates query processing → vector search → result processing
  • VectorStoreRetriever<TKey, TRecord>IRetriever implementation binding pipeline + collection
  • RetrievalPipelineExtensions.AsRetriever() — convenience extension for pipeline → retriever conversion
  • RetrievalPipelineOptions — configuration (ActivitySource name)
  • Built-in: Reciprocal Rank Fusion for multi-query deduplication
  • Built-in: Tree-traversal search paradigm for hierarchical indices
  • OpenTelemetry tracing via System.Diagnostics.ActivitySource
  • Structured logging via ILoggerFactory

Design

These packages are the read-side counterpart to Microsoft.Extensions.DataIngestion (the write-side). Together with Microsoft.Extensions.AI and Microsoft.Extensions.VectorData, they provide a complete composable RAG stack.

The pipeline follows the same "zero-cost when empty" philosophy as DataIngestion: with no processors registered, ProcessAsync performs a raw vector search. Each processor adds exactly one capability.

Dependencies

  • Microsoft.Extensions.VectorData.Abstractions (for VectorStoreCollection<TKey, TRecord>)
  • Microsoft.Extensions.Logging.Abstractions
  • Microsoft.Extensions.Options

No dependency on Microsoft.Extensions.AI in the abstractions package — processor implementations bring their own AI client dependencies.

Testing

  • Unit tests covering pipeline orchestration, RRF deduplication, tree traversal
  • Processor contract tests (mock implementations)
  • Integration verified against Qdrant, Azure AI Search, and in-memory vector stores via the advanced-rag reference application

API diff summary

+ Microsoft.Extensions.DataRetrieval.Abstractions (new package)
  + RetrievalQuery
  + RetrievalChunk
  + RetrievalResults
  + RetrievalQueryProcessor
  + RetrievalResultProcessor
  + IReranker
  + IRetriever

+ Microsoft.Extensions.DataRetrieval (new package)
  + RetrievalPipeline
  + RetrievalPipelineOptions
  + VectorStoreRetriever<TKey, TRecord>
  + RetrievalPipelineExtensions (AsRetriever)

Checklist

  • XML documentation on all public types
  • Exception documentation (<exception>) on throwing methods
  • ActivitySource telemetry
  • Structured logging
  • README for both packages
  • Follows existing DataIngestion packaging pattern
  • Audited against dotnet/runtime Framework Design Guidelines
Microsoft Reviewers: Open in CodeFlow

luisquintanilla and others added 12 commits April 7, 2026 17:13
Abstractions (Microsoft.Extensions.DataIngestion.Abstractions):
- RetrievalQuery: query data type with variants + metadata
- RetrievalChunk: single chunk with content, score, record data
- RetrievalResults: result collection with pipeline metadata
- RetrievalQueryProcessor: abstract base for pre-search processing
- RetrievalResultProcessor: abstract base for post-search processing
- ISearchReranker: interface for re-ranking strategies

Implementation (Microsoft.Extensions.DataIngestion):
- RetrievalPipeline: orchestrator (query processors → vector search → result processors)
- RetrievalPipelineOptions: ActivitySource configuration
- Source-generated log methods for pipeline tracing

Mirrors IngestionPipeline design: empty processor list = raw vector search.
Builds clean: 0 warnings, 0 errors across all target frameworks.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Align retrieval processor method names with ingestion-side convention.
IngestionChunkProcessor<T> and IngestionDocumentProcessor both use
ProcessAsync — the retrieval abstractions should follow suit.

- RetrievalQueryProcessor.ProcessQueryAsync → ProcessAsync
- RetrievalResultProcessor.ProcessResultsAsync → ProcessAsync
- RetrievalPipeline: update calls to use new names

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Phase 2 now checks query.Metadata['search_paradigm'] == 'TreeTraversal':
- Wider search (topK * 3) to capture all tree levels
- Groups results by Level metadata (set by TreeIndexProcessor)
- Prioritizes leaf chunks with branch/root summaries for context
- Falls back to existing flat search when metadata not present

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
These files contained hardcoded c:\Dev\extensions paths for local
package resolution. They are template scaffolding not related to
retrieval abstractions and should not be on this branch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add Retrieval Types section documenting RetrievalQuery, RetrievalChunk,
RetrievalResults, RetrievalQueryProcessor, RetrievalResultProcessor,
and ISearchReranker. Restructure package description to cover both
ingestion and retrieval.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move retrieval types from DataIngestion to new DataRetrieval packages:
- Microsoft.Extensions.DataRetrieval.Abstractions: RetrievalQuery,
  RetrievalChunk, RetrievalResults, RetrievalQueryProcessor,
  RetrievalResultProcessor, IReranker (renamed from ISearchReranker)
- Microsoft.Extensions.DataRetrieval: RetrievalPipeline,
  RetrievalPipelineOptions, retrieval Log methods

Namespace changes from Microsoft.Extensions.DataIngestion to
Microsoft.Extensions.DataRetrieval. Retrieval types had zero
dependencies on ingestion types — clean split.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Added <exception> documentation for ArgumentNullException and
ArgumentException thrown by RetrieveAsync, per dotnet/runtime
adding-api-guidelines.md requirements.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Introduces IRetriever in DataRetrieval.Abstractions as a simple,
vector-store-agnostic contract for retrieval pipelines. BoundRetriever<TKey, TRecord>
in DataRetrieval adapts a RetrievalPipeline + VectorStoreCollection into IRetriever,
enabling DI registration and testability.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Better communicates what the type does from a consumer perspective:
it retrieves from a vector store. Follows noun-noun compound pattern
(StreamReader, ChannelWriter) and distinguishes from future IRetriever
implementations (WebSearchRetriever, DatabaseRetriever, etc).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Aligns with IngestionPipeline.ProcessAsync — both pipelines 'process'
inputs through their stages. Reserves 'RetrieveAsync' exclusively for
IRetriever, creating clear vocabulary separation:

  - Pipeline.ProcessAsync: engine that processes queries through stages
  - IRetriever.RetrieveAsync: endpoint that retrieves results

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds a convenience extension method that wraps a RetrievalPipeline
into a VectorStoreRetriever<TKey, TRecord> implementing IRetriever.
This improves discoverability and enables pipeline.AsRetriever(...)
as a natural terminal operation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot added the area-ai-templates Microsoft.Extensions.AI.Templates label May 5, 2026
@CZEMacLeod
Copy link
Copy Markdown

DataRetrieval is a kind of generic name/activity which may or may not indicate that this is AI related (RAG).
As it is directly under the very generic Microsoft.Extensions.* packages list, it would be nice if these sort of packages were better grouped/named.
Other areas/features that might match ME.DataRetrieval

  • agnostic archived/'cold data retrieval (e.g. from blob storage, azure backup or another source).
  • database or nosql storage systems
  • (corrupt) filesystem recovery tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-ai-templates Microsoft.Extensions.AI.Templates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proposal: Add Retrieval Pipeline Abstractions (Microsoft.Extensions.DataRetrieval)

3 participants