Skip to content

in_exec: run collector in threaded mode#11782

Open
edsiper wants to merge 2 commits intomasterfrom
http-deadlock-update
Open

in_exec: run collector in threaded mode#11782
edsiper wants to merge 2 commits intomasterfrom
http-deadlock-update

Conversation

@edsiper
Copy link
Copy Markdown
Member

@edsiper edsiper commented May 7, 2026

Fixes an internal HTTP server self-request deadlock by running the exec input collector on Fluent Bit's existing threaded input path. This keeps blocking popen()/read work out of the engine event loop while leaving the monitoring HTTP server on its original event loop.

Adds an integration regression test where exec runs curl against /api/v1/metrics/prometheus and verifies the internal HTTP server remains responsive.

Validation

  • Confirmed the regression test fails on the original baseline with an HTTP read
    timeout.
  • cmake --build build -j8
  • tests/integration/run_tests.py --quiet scenarios/internal_http_server/tests/ test_internal_http_server_exec_deadlock_001.py::test_http_server_responsive_afte r_exec_self_request
  • tests/integration/run_tests.py --valgrind --valgrind-strict --quiet scenarios/internal_http_server/tests/ test_internal_http_server_exec_deadlock_001.py::test_http_server_responsive_afte r_exec_self_request
  • tests/integration/run_tests.py --quiet scenarios/internal_http_server
  • ctest --test-dir build -R flb-it-http_server --output-on-failure
  • GITHUB_EVENT_NAME=pull_request GITHUB_BASE_REF=master tests/ integration/.venv/bin/python .github/scripts/commit_prefix_check.py

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • Bug Fixes

    • Fixed a deadlock that caused the internal HTTP server to become unresponsive when executing metric collection that includes self-referential requests.
  • Tests

    • Added integration test scenario to verify the HTTP server remains responsive after exec inputs make self-targeted metric requests.

edsiper added 2 commits May 6, 2026 21:41
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

The exec input plugin is updated to run in threaded mode (FLB_INPUT_THREADED), and a regression test validates that the internal HTTP server remains responsive when the threaded exec input makes periodic self-requests to scrape metrics, ensuring no deadlock occurs.

Changes

Exec Plugin Threaded Mode Deadlock Fix

Layer / File(s) Summary
Plugin Configuration
plugins/in_exec/in_exec.c
The in_exec_plugin struct gains .flags = FLB_INPUT_THREADED to enable threaded input mode.
Test Configuration
tests/integration/scenarios/internal_http_server/config/internal_http_server_exec_deadlock.yaml
YAML defines a service with internal HTTP server and a pipeline where exec input runs curl every second to scrape /api/v1/metrics/prometheus, outputting to null.
Integration Test
tests/integration/scenarios/internal_http_server/tests/test_internal_http_server_exec_deadlock_001.py
Test starts the service, verifies /api/v1/uptime returns 200 with uptime data, sleeps, then polls the endpoint using wait_for_condition to confirm server responsiveness with a 10-second timeout.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested labels

docs-required

Suggested reviewers

  • fujimotos
  • koleini

Poem

🐰 A flag is flipped, threads now align,
The exec plugin runs in threaded time,
Self-requests no more shall deadlock bind,
The HTTP server stays responsive and kind! 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly describes the main change: enabling threaded mode for the in_exec input plugin to resolve an event-loop deadlock issue.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch http-deadlock-update

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4482c18480

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/in_exec/in_exec.c
struct flb_input_plugin in_exec_plugin = {
.name = "exec",
.description = "Exec Input",
.flags = FLB_INPUT_THREADED,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve conditional routing for exec inputs

Making every exec instance threaded disables an existing routing feature for this plugin: flb_input_log_append() explicitly skips conditional/per-record routes when flb_input_is_threaded(ins) is true (src/flb_input_log.c:1021-1025) and falls back to normal routing. Any exec input configured with routes.logs conditions or per-record routing will now send records to the default matched outputs instead of the conditional destinations, whereas it was non-threaded before this line. Please either keep exec non-threaded for configurations that use conditional routes or add threaded-input support for that routing path.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/integration/scenarios/internal_http_server/tests/test_internal_http_server_exec_deadlock_001.py (1)

49-59: ⚡ Quick win

Prefer condition-based waiting over fixed sleep.

time.sleep(2) adds nondeterministic delay and can still be too short/long depending on CI load. Fold this into wait_for_condition and let polling drive readiness.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/integration/scenarios/internal_http_server/tests/test_internal_http_server_exec_deadlock_001.py`
around lines 49 - 59, Remove the fixed sleep and fold that readiness check into
the existing polling: delete the time.sleep(2) call and ensure
wait_for_condition (used on service.service.wait_for_condition) performs the
polling by calling service.request("/api/v1/uptime") until the lambda returns a
non-None response that meets response["status_code"] == 200 and contains
"uptime_sec" in response["body"]. Keep or adjust the existing timeout/interval
parameters as needed but do not reintroduce any fixed sleeps.
tests/integration/scenarios/internal_http_server/config/internal_http_server_exec_deadlock.yaml (1)

13-14: ⚡ Quick win

Bound the curl execution time to reduce test flakiness.

Consider adding --connect-timeout and --max-time so a stalled request doesn’t leave long-running child processes during regression runs.

Proposed tweak
-      command: curl -s http://127.0.0.1:${FLUENT_BIT_HTTP_MONITORING_PORT}/api/v1/metrics/prometheus
+      command: curl -s --connect-timeout 1 --max-time 2 http://127.0.0.1:${FLUENT_BIT_HTTP_MONITORING_PORT}/api/v1/metrics/prometheus
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/integration/scenarios/internal_http_server/config/internal_http_server_exec_deadlock.yaml`
around lines 13 - 14, The curl command in the YAML test config (the command
field that uses ${FLUENT_BIT_HTTP_MONITORING_PORT}) can hang and cause flaky
tests; update that command to include connection and overall timeouts (for
example add --connect-timeout 2 and --max-time 5) so stalled requests fail fast
instead of leaving long-running child processes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@tests/integration/scenarios/internal_http_server/tests/test_internal_http_server_exec_deadlock_001.py`:
- Around line 49-62: The test currently only checks HTTP responsiveness but
never proves the exec self-request path ran; update the test around the
wait_for_condition block to record the pre-exec value of the exec counter/metric
(e.g., the "exec.0" input/exec counter exposed by the service), trigger or allow
the exec self-request to run as already done, then after the wait assert that
the "exec.0" metric/counter increased (compare post-run value > pre-run value)
so the test fails if no exec activity occurred; use existing helpers like
service.request(...) or service.service.wait_for_condition to fetch the metric
and reference the "exec.0" metric name in your assertions.

---

Nitpick comments:
In
`@tests/integration/scenarios/internal_http_server/config/internal_http_server_exec_deadlock.yaml`:
- Around line 13-14: The curl command in the YAML test config (the command field
that uses ${FLUENT_BIT_HTTP_MONITORING_PORT}) can hang and cause flaky tests;
update that command to include connection and overall timeouts (for example add
--connect-timeout 2 and --max-time 5) so stalled requests fail fast instead of
leaving long-running child processes.

In
`@tests/integration/scenarios/internal_http_server/tests/test_internal_http_server_exec_deadlock_001.py`:
- Around line 49-59: Remove the fixed sleep and fold that readiness check into
the existing polling: delete the time.sleep(2) call and ensure
wait_for_condition (used on service.service.wait_for_condition) performs the
polling by calling service.request("/api/v1/uptime") until the lambda returns a
non-None response that meets response["status_code"] == 200 and contains
"uptime_sec" in response["body"]. Keep or adjust the existing timeout/interval
parameters as needed but do not reintroduce any fixed sleeps.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 093ab459-b522-4650-bc0c-60f48e29817d

📥 Commits

Reviewing files that changed from the base of the PR and between 54a9ebc and 4482c18.

📒 Files selected for processing (3)
  • plugins/in_exec/in_exec.c
  • tests/integration/scenarios/internal_http_server/config/internal_http_server_exec_deadlock.yaml
  • tests/integration/scenarios/internal_http_server/tests/test_internal_http_server_exec_deadlock_001.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants