Add Microsoft.Extensions.DataRetrieval abstractions and pipeline#7508
Draft
luisquintanilla wants to merge 12 commits intodotnet:mainfrom
Draft
Add Microsoft.Extensions.DataRetrieval abstractions and pipeline#7508luisquintanilla wants to merge 12 commits intodotnet:mainfrom
luisquintanilla wants to merge 12 commits intodotnet:mainfrom
Conversation
Abstractions (Microsoft.Extensions.DataIngestion.Abstractions): - RetrievalQuery: query data type with variants + metadata - RetrievalChunk: single chunk with content, score, record data - RetrievalResults: result collection with pipeline metadata - RetrievalQueryProcessor: abstract base for pre-search processing - RetrievalResultProcessor: abstract base for post-search processing - ISearchReranker: interface for re-ranking strategies Implementation (Microsoft.Extensions.DataIngestion): - RetrievalPipeline: orchestrator (query processors → vector search → result processors) - RetrievalPipelineOptions: ActivitySource configuration - Source-generated log methods for pipeline tracing Mirrors IngestionPipeline design: empty processor list = raw vector search. Builds clean: 0 warnings, 0 errors across all target frameworks. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Align retrieval processor method names with ingestion-side convention. IngestionChunkProcessor<T> and IngestionDocumentProcessor both use ProcessAsync — the retrieval abstractions should follow suit. - RetrievalQueryProcessor.ProcessQueryAsync → ProcessAsync - RetrievalResultProcessor.ProcessResultsAsync → ProcessAsync - RetrievalPipeline: update calls to use new names Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Phase 2 now checks query.Metadata['search_paradigm'] == 'TreeTraversal': - Wider search (topK * 3) to capture all tree levels - Groups results by Level metadata (set by TreeIndexProcessor) - Prioritizes leaf chunks with branch/root summaries for context - Falls back to existing flat search when metadata not present Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
These files contained hardcoded c:\Dev\extensions paths for local package resolution. They are template scaffolding not related to retrieval abstractions and should not be on this branch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add Retrieval Types section documenting RetrievalQuery, RetrievalChunk, RetrievalResults, RetrievalQueryProcessor, RetrievalResultProcessor, and ISearchReranker. Restructure package description to cover both ingestion and retrieval. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move retrieval types from DataIngestion to new DataRetrieval packages: - Microsoft.Extensions.DataRetrieval.Abstractions: RetrievalQuery, RetrievalChunk, RetrievalResults, RetrievalQueryProcessor, RetrievalResultProcessor, IReranker (renamed from ISearchReranker) - Microsoft.Extensions.DataRetrieval: RetrievalPipeline, RetrievalPipelineOptions, retrieval Log methods Namespace changes from Microsoft.Extensions.DataIngestion to Microsoft.Extensions.DataRetrieval. Retrieval types had zero dependencies on ingestion types — clean split. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Added <exception> documentation for ArgumentNullException and ArgumentException thrown by RetrieveAsync, per dotnet/runtime adding-api-guidelines.md requirements. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Introduces IRetriever in DataRetrieval.Abstractions as a simple, vector-store-agnostic contract for retrieval pipelines. BoundRetriever<TKey, TRecord> in DataRetrieval adapts a RetrievalPipeline + VectorStoreCollection into IRetriever, enabling DI registration and testability. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Better communicates what the type does from a consumer perspective: it retrieves from a vector store. Follows noun-noun compound pattern (StreamReader, ChannelWriter) and distinguishes from future IRetriever implementations (WebSearchRetriever, DatabaseRetriever, etc). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Aligns with IngestionPipeline.ProcessAsync — both pipelines 'process' inputs through their stages. Reserves 'RetrieveAsync' exclusively for IRetriever, creating clear vocabulary separation: - Pipeline.ProcessAsync: engine that processes queries through stages - IRetriever.RetrieveAsync: endpoint that retrieves results Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds a convenience extension method that wraps a RetrievalPipeline into a VectorStoreRetriever<TKey, TRecord> implementing IRetriever. This improves discoverability and enables pipeline.AsRetriever(...) as a natural terminal operation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR: Add Microsoft.Extensions.DataRetrieval packages
Resolves #7507
What this PR adds
Two new packages providing retrieval pipeline abstractions for RAG (Retrieval-Augmented Generation) applications:
Microsoft.Extensions.DataRetrieval.AbstractionsThin abstraction layer defining the data types and processor contracts:
RetrievalQuery— query with variant expansion and metadataRetrievalChunk— scored content chunk from vector searchRetrievalResults— result container with pipeline metadataRetrievalQueryProcessor— abstract base for pre-search processorsRetrievalResultProcessor— abstract base for post-search processorsIReranker— interface for re-ranking strategiesIRetriever— data-source-agnostic retrieval contract for DI and testabilityMicrosoft.Extensions.DataRetrievalPipeline implementation:
RetrievalPipeline— orchestrates query processing → vector search → result processingVectorStoreRetriever<TKey, TRecord>—IRetrieverimplementation binding pipeline + collectionRetrievalPipelineExtensions.AsRetriever()— convenience extension for pipeline → retriever conversionRetrievalPipelineOptions— configuration (ActivitySource name)System.Diagnostics.ActivitySourceILoggerFactoryDesign
These packages are the read-side counterpart to
Microsoft.Extensions.DataIngestion(the write-side). Together withMicrosoft.Extensions.AIandMicrosoft.Extensions.VectorData, they provide a complete composable RAG stack.The pipeline follows the same "zero-cost when empty" philosophy as DataIngestion: with no processors registered,
ProcessAsyncperforms a raw vector search. Each processor adds exactly one capability.Dependencies
Microsoft.Extensions.VectorData.Abstractions(forVectorStoreCollection<TKey, TRecord>)Microsoft.Extensions.Logging.AbstractionsMicrosoft.Extensions.OptionsNo dependency on
Microsoft.Extensions.AIin the abstractions package — processor implementations bring their own AI client dependencies.Testing
API diff summary
Checklist
<exception>) on throwing methodsActivitySourcetelemetryMicrosoft Reviewers: Open in CodeFlow