You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.
This PR includes no changesets
When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types
This test run verified the Dataset Reimplementation feature (PR #2341), which re-enables the "Test Suites" sidebar link, adds new dataset run endpoints, and replaces old chat-API-based execution with scheduled trigger infrastructure. The core routing, API endpoints, and backend infrastructure are working correctly. However, several UI bugs were identified: the run detail page incorrectly displays progress and invocation data when all agent executions fail, the cross-product display groups by dataset item instead of showing per-invocation rows, and unauthenticated users can access protected UI routes.
✅ Passed (20)
Test Case
Summary
Timestamp
Screenshot
ROUTE-1
Verified Test Suites link appears under Monitor section in sidebar with Database icon. Clicking it navigates to /default/projects/activities-planner/datasets showing dataset listing.
1:54
ROUTE-2
Created run config 'Test Run E2E' with 1 agent, auto-trigger fired creating 3 invocations (3 items x 1 agent), run appeared in runs list
6:16
ROUTE-3
Run detail page loads at correct URL pattern, displays run name 'Test Run E2E', shows 'Run in progress' indicator with spinner, progress bar and test cases table with 3 items.
14:35
ROUTE-5
All filter mechanisms work correctly: search input filters by text, Show Filters expands filter panel, Agent and Output Status dropdowns work, Clear Filters resets all.
28:07
ROUTE-6
GET dataset-runs/by-dataset endpoint returns 200 with run objects containing id, datasetId, status, totalItems, completedItems, failedItems.
32:09
ROUTE-7
GET dataset-runs/{runId}/items returns 200 with invocation objects containing id, agentId, datasetRunId, datasetItemId, status, attemptNumber, conversationId.
32:50
ROUTE-8
Trigger endpoint created 3 invocations (3 dataset items x 1 agent). API response shows totalItems=3, status=completed.
11:07
ROUTE-9
Created evaluator and run config with evaluator attached. Run detail page shows View Evaluation Job button. Clicking opens new tab with correct URL format.
48:55
ROUTE-10
Clicking 'Back to test suite' button on run detail page navigates to dataset detail page with the Runs tab active.
19:46
EDGE-1
Created run config on empty dataset (0 items). API confirmed totalItems:0. Run detail page correctly shows Test Cases (0) with No items found message.
51:12
EDGE-4
API response shows run with all items failed (failedItems=3, completedItems=0) has status=completed. Confirms deriveRunStatus correctly returns completed when pending+running=0.
39:44
EDGE-6
Timestamps display in local timezone format using browser's locale settings via Intl.DateTimeFormat.
14:44
EDGE-7
API run status metadata has only totalItems/completedItems/failedItems fields. Cancelled invocations are counted under failedItems.
40:24
EDGE-8
Created run config via API without triggering. Config exists in system but does not appear in runs list. Confirms partial failure handling.
44:37
ADV-1
POST to trigger with non-existent runConfigId returns HTTP 404 with body {"res":{},"status":404}.
33:22
ADV-3
Clicked Create Run with empty Name field, validation message 'Name is required' appeared preventing submission.
7:04
ADV-4
Rapidly triple-clicked Create Run button. Only 1 run config was created. Button disabled after first click prevents duplicates.
12:37
ADV-5
Navigated to non-existent runId. Page shows error state with Error title and HTTP 404: Not Found message.
35:33
ADV-6
Created dataset item with script tag content. Verified HTML/script content rendered as plain text in JSON code view. No script execution detected.
37:29
LOGIC-1
POST /evals/run-dataset-items returns HTTP 404 Not Found. The old route has been successfully removed.
API shows totalItems:0 but UI incorrectly shows 4 items with Agent dash and Processing status stuck.
54:28
EDGE-3
API confirms 8 invocations (4 items x 2 agents) but UI only shows 4 rows grouped by dataset item.
59:59
EDGE-5
Auto-refresh never stops. UI stuck showing 'Run in progress' even though API reports run completed with all items failed.
14:59
ADV-2
API correctly returns 401 for unauthenticated requests, but UI loads datasets page without redirecting to login.
1:04:21
ROUTE-4: Dataset run detail shows test cases table with correct data – Failed
Where: Run detail page at /{tenantId}/projects/{projectId}/datasets/{datasetId}/runs/{runId}
Steps to reproduce:
Navigate to a run detail page where all agent invocations have failed
Observe the Test Cases table
What failed: The table displays incorrect information: (1) Agent column shows '-' instead of the agent ID, (2) Output column shows 'Processing...' with spinner for items where the API confirms status='failed', (3) Test Cases section header shows count '(0)' despite 3 rows being displayed in the table.
Code analysis: The UI derives display state from conversations array on each item. When invocations fail before creating a conversation, the conversations array is empty. The code at line 394-434 shows a placeholder row with Agent='-' and "Processing..." when no conversations exist, regardless of actual invocation status.
constconversations=item.conversations||[];if(conversations.length===0){// No conversations yet - show placeholder row with loading state if run is in progressreturn(<TableRowkey={item.id}><TableCell>{/* ... */}</TableCell><TableCell><spanclassName="text-sm text-muted-foreground">-</span></TableCell><TableCell>{conversationProgress.isRunning ? (<spanclassName="flex items-center gap-2 text-sm text-muted-foreground"><Loader2className="h-3 w-3 animate-spin"/>
Processing...
</span>) : (<spanclassName="text-sm text-muted-foreground">No output</span>)}</TableCell>{/* ... */}</TableRow>);}
<CardTitle>
Test Cases (
{filteredItems.reduce((acc,item)=>acc+(item.conversations?.length||0),0)}{' '}{/* This counts conversations, not invocations - shows 0 when all fail */}
)
</CardTitle>
Why this is likely a bug: The UI should display invocation status (from scheduled trigger invocations) rather than relying solely on conversations. When invocations fail, no conversation is created, but the UI should still show the failure status and agent ID from the invocation data.
Introduced by this PR: Yes – this PR modified the relevant code. The run detail page was part of the dataset reimplementation.
Timestamp: 25:55
EDGE-2: Run config with no agent relations produces zero invocations – Failed
Where: Run detail page for a run config created with no agents selected
Steps to reproduce:
Create a run config with no agents selected on a populated dataset (4 items)
Navigate to the run detail page
What failed: API correctly shows totalItems:0 for the run. However, the UI run detail page incorrectly shows 4 items with Agent column as dash, all showing "Processing..." with "Run in progress" status stuck. The UI displays dataset items even when there are no invocations.
Code analysis: The backend at datasetRuns.ts line 213-217 fetches ALL dataset items via listDatasetItems(db) regardless of how many invocations exist. The UI displays these items, creating a disconnect between what the API reports (0 invocations) and what the UI shows (4 dataset items with placeholder rows).
Why this is likely a bug: When no agents are selected, the run has zero invocations (totalItems=0). The UI should show "No test cases" or display based on actual invocation count from the API's status metadata, not based on the number of dataset items.
Introduced by this PR: Yes – this PR modified the relevant code in both the API endpoint and UI page.
Timestamp: 54:28
EDGE-3: Run with multiple agents and items creates correct cross-product of invocations – Failed
Where: Run detail page for a run with 4 items x 2 agents
Steps to reproduce:
Create a run config selecting 2 agents on a dataset with 4 items
Trigger the run and navigate to the run detail page
What failed: The API correctly reports totalItems=8 (4 items x 2 agents = 8 invocations). However, the UI run detail page only displays 4 rows (grouped by dataset item) instead of the expected 8 rows (one per agent-item combination). The progress bar shows '0 of 4 completed' instead of '0 of 8'.
Code analysis: The UI iterates over filteredItems (dataset items) and then maps over item.conversations. Since no conversations were created (all failed), each dataset item shows a single placeholder row. The UI structure is designed to show one row per conversation, not one row per invocation.
{filteredItems.flatMap((item)=>{// ...constconversations=item.conversations||[];if(conversations.length===0){// Shows ONE placeholder row per dataset item, not per invocationreturn(<TableRowkey={item.id}>
returnconversations.map((conversation)=>(<TableRowkey={`${item.id}-${conversation.conversationId}`}>{/* This correctly shows one row per conversation when they exist */}</TableRow>));
Why this is likely a bug: The UI should display one row per scheduled trigger invocation (from the API's invocations data), not per dataset item. When multiple agents are selected, the cross-product creates N x M invocations, and the UI should reflect this.
Introduced by this PR: Yes – this PR introduced the dataset run detail page as part of the reimplementation.
Timestamp: 59:59
EDGE-5: Auto-refresh stops when run completes – Failed
Where: Run detail page during and after run completion
Steps to reproduce:
Navigate to a run detail page where all invocations have failed
Observe the auto-refresh behavior and UI state
What failed: Auto-refresh does NOT stop when the run completes. The API reports status=completed with 3 failed items, but the UI remains stuck showing 'Run in progress' with '0 of 3 completed'. The polling continues indefinitely via repeated requests every ~3 seconds.
Code analysis: The isRunInProgress flag at line 116-117 depends on conversationProgress.isRunning. The conversation progress calculates completion based on conversations created. When all invocations fail, no conversations are created, so completed is always 0 and total is always > 0 (dataset items count), making isRunning perpetually true.
constconversationProgress=useMemo(()=>{if(!run?.items)return{total: 0,completed: 0,isRunning: false};consttotal=run.items.length;constcompleted=run.items.filter((item)=>item.conversations&&item.conversations.length>0).length;return{ total, completed,isRunning: completed<total&&total>0};},[run]);// Overall progress - run is complete only when both conversations AND evaluations are doneconstisRunInProgress=conversationProgress.isRunning||(evaluationProgress?.isRunning??false);
useEffect(()=>{if(!isRunInProgress)return;constinterval=setInterval(()=>{loadRun(false);// Don't show loading state for refresh},3000);// Refresh every 3 secondsreturn()=>clearInterval(interval);},[isRunInProgress,loadRun]);
Why this is likely a bug: The UI should use the API's reported status (from deriveRunStatus which returns 'completed' when pending+running=0) to determine if the run is complete, not rely on conversation count. This causes infinite polling and a permanently stuck "in progress" state for any run where invocations fail.
Introduced by this PR: Yes – this PR introduced the run detail page with auto-refresh functionality.
Timestamp: 14:59
ADV-2: Accessing dataset routes without authentication returns 401/403 – Failed
Where: UI datasets page at /default/projects/activities-planner/datasets
Steps to reproduce:
Clear all cookies (unauthenticated state)
Navigate directly to the datasets page URL
What failed: The API (port 3002) correctly returns 401 Unauthorized for unauthenticated requests. However, the UI (port 3000) does NOT redirect unauthenticated users to a login page. After clearing all cookies, navigating to the datasets page renders the full page with data visible.
Code analysis: The agents-manage-ui app does not have a Next.js middleware.ts file for route protection. The tenant layout ([tenantId]/layout.tsx) renders content without checking authentication status. Authentication is handled client-side via the AuthClientProvider context, but there's no server-side redirect for unauthenticated users.
// API requests include bypass secret for server-side callsconstheaders: HeadersInit={'Content-Type': 'application/json',
...(isServer&&process.env.INKEEP_AGENTS_MANAGE_API_BYPASS_SECRET
? {Authorization: `Bearer ${process.env.INKEEP_AGENTS_MANAGE_API_BYPASS_SECRET}`,}
: {}),};
Why this is likely a bug: Protected routes should redirect unauthenticated users to the login page. The current implementation allows direct access to UI pages that display protected data because server-side rendering uses a bypass secret, but then renders the page to an unauthenticated user.
Introduced by this PR: No – pre-existing bug (authentication code not changed in this PR). However, this PR re-enabled the datasets routes which expose this issue.
Testing verified the Dataset (Test Suite) reimplementation in PR #2341. The core functionality works well: sidebar navigation, dataset CRUD operations, tab switching, item creation, run config creation, run progress tracking, auto-refresh, filtering, XSS prevention, and error handling all passed. One validation bug was confirmed in the run config form where submitting without agents selected does not show an error.
✅ Passed (31)
Test Case
Summary
Timestamp
Screenshot
ROUTE-1
Verified sidebar has Monitor section with Test Suites link positioned between Traces and Evaluations
3:38
ROUTE-2
Datasets page shows empty state with 'No test suites yet.' heading, description text, and 'Create test suite' link
5:12
ROUTE-3
Created dataset 'Playwright Test Suite' via the create form
6:14
ROUTE-4
Verified default tab is Items, clicking Runs tab shows runs content with URL ?tab=runs
7:54
ROUTE-5
Created a dataset item with role 'user' and content 'What is the weather in San Francisco?'
9:21
ROUTE-6
Successfully created run config 'Test Run Alpha' with Activities Planner agent selected
20:03
ROUTE-7
Run detail page shows progress tracking with 'Run in progress' banner, progress bar, and test cases table
20:29
ROUTE-8
Observed auto-refresh on run detail page with timestamp progressing from 'just now' to '2m ago'
22:38
ROUTE-9
Verified search filter, Show/Hide Filters toggle, Output Status filter, and Clear Filters button
25:54
ROUTE-10
DatasetItemViewDialog opened showing full Input messages with role and content
27:12
ROUTE-12
View Evaluation Job button appears on run detail page when evaluators are attached
42:08
ROUTE-13
Run detail page shows dual progress tracking for Test cases and Evaluations
42:08
ROUTE-14
Runs list shows 'Test Run Alpha' with relative creation timestamp and chevron icon
20:04
ROUTE-15
Run At column shows local timezone format, Created shows relative timestamp with clock icon
20:32
ROUTE-16
Runs tab empty state showing 'No runs yet' text and 'Add first run' button
48:32
ROUTE-17
Back to test suite button navigates to dataset page with Runs tab selected
32:32
ROUTE-18
Run config form showed 'Loading agents...' and 'Loading evaluators...' during data load
48:53
EDGE-1
Triggered run on empty dataset, graceful handling with 'No items found' message
52:58
EDGE-3
Validation error 'Name is required' displayed when submitting empty name
49:39
EDGE-4
Run detail page shows pending items correctly with 'Processing...' spinner and 'Pending...' text
20:30
EDGE-6
Created Run B and Run C in quick succession, both appear as separate entries
60:04
EDGE-7
Tab state persists via URL query parameter ?tab=runs
10:18
EDGE-8
Complex message content formats all display correctly in run detail table
60:43
EDGE-9
Search for non-matching term shows 'No test cases match the current filters' message
33:09
EDGE-10
Long input text truncated at ~100 chars with ellipsis, dialog shows full content
60:46
EDGE-11
Runs list shows skeleton loading placeholders during data fetch
65:32
ADV-1
XSS payload rendered as plain escaped text, no script execution
68:50
ADV-2
Non-existent run ID shows Error card with HTTP 404 Not Found
69:57
ADV-2_69-57.png
ADV-3
Invalid tab query parameter falls back gracefully, tab switching works normally
10:56
ADV-4
Dev mode auto-authenticates, no redirect to login page
0:00
ADV-5
Rapid double-click on Create Run button prevented duplicate creation
19:26
❌ Failed (1)
Test Case
Summary
Timestamp
Screenshot
EDGE-2
Form submitted successfully with 0 agents selected - expected validation error but got success
50:26
EDGE-2: Run config form with no agents selected validation – Failed
Where: Dataset run config creation form dialog
Steps to reproduce:
Navigate to a dataset's Runs tab
Click 'Add first run' or 'New run' button
Enter a name in the Name field (e.g., 'Validation Test Run')
Do NOT select any agents from the Agents multi-selector
Click 'Create Run' button
What failed: Expected a validation error preventing form submission when no agents are selected. Instead, the form submitted successfully, creating a run with 0 agents. The success toast 'Run config created successfully' appeared.
Code analysis: Examined the form validation schema and found the root cause. The UI shows the Agents field with an isRequired indicator (asterisk), but the Zod validation schema does not enforce a minimum of one agent.
Why this is likely a bug: The UI displays an asterisk (isRequired) on the Agents label indicating it's a required field, but the Zod schema only uses .default([]) without .min(1, ...). This creates a mismatch where users see a required indicator but can submit without selecting any agents. The fix is to change line 6 to: agentIds: z.array(z.string()).min(1, 'At least one agent is required').
Introduced by this PR: Yes – this PR modified the relevant code. This PR re-enabled the dataset run configs routes and modified the dataset run config actions. While the validation file itself may not be new, the feature re-enablement means this validation gap is now exposed to users.
The run validated core feedback, branch, and dataset flows that were executable in this environment. One user-facing defect was confirmed through code inspection: invalid feedback query parameters are forwarded without bounds sanitization, which can trigger a hard load error instead of graceful coercion.
✅ Passed (10)
Test Case
Summary
Timestamp
Screenshot
ROUTE-1
Feedback page loaded at /default/projects/default/feedback without runtime crash and displayed a valid empty state.
0:00
ROUTE-2
Created positive message-scoped feedback via localhost API fallback and verified the positive row with messageId renders in Feedback UI.
10:45
ROUTE-5
UI delete removed the feedback row and repeat delete via API returned not found, confirming non-false-success behavior after deletion.
10:45
ROUTE-8
Clean branch merge API returned success and no conflicts.
14:07
ROUTE-10
Non-main branch deletion succeeded and protected main-branch deletion was correctly rejected.
14:07
ROUTE-11
Created a dataset run config with an agent relation and verified automatic run creation in UI; API trigger endpoint returned 202 with datasetRunId.
38:01
ROUTE-12
Run detail showed consistent status and counters, and dataset-runs items API returned 200 with matching datasetRunId, status, and attempt fields for all items.
38:16
EDGE-1
Branches page rendered a valid empty state with 'No branches' messaging and no broken table artifacts.
14:07
ADV-2
Rapid repeated clicks on merge and delete confirmations produced one effective mutation each pending cycle; UI prevented duplicate destructive requests and ended with a single branch deletion outcome.
25:53
ADV-3
Unauthorized feedback create and branch merge mutation calls were both denied with 401 responses, confirming mutation boundaries were enforced.
42:23
❌ Failed (1)
Test Case
Summary
EDGE-2
Invalid feedback query parameters rendered a failed-load state instead of being safely coerced to valid pagination bounds.
Steps to reproduce: Open the feedback page with out-of-range pagination params (for example ?page=999999&limit=100000).
What failed: The page passes unbounded numeric query values directly to the API, receives a validation error for oversized limit, and falls into the full-page error state instead of coercing inputs to safe bounds.
Code analysis: The page parser only checks numeric parse/finite-ness, not bounds; the API route validates query params with strict pagination schema, so oversized values are rejected and bubble up to error UI.
Why this is likely a bug: The UI path explicitly intends query-param handling for feedback pagination, but out-of-range values are not sanitized before strict API validation, producing a user-visible load failure.
Introduced by this PR: Yes - this PR modified the relevant code.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.