Skip to content

fix(config): graceful failure fallback to default#98

Open
duncanista wants to merge 3 commits intomainfrom
jordan.gonzalez/config/graceful-failure-per-field
Open

fix(config): graceful failure fallback to default#98
duncanista wants to merge 3 commits intomainfrom
jordan.gonzalez/config/graceful-failure-per-field

Conversation

@duncanista
Copy link
Contributor

What does this PR do?

Adds per-field graceful deserialization to the datadog-agent-config crate. Previously, if a single config field had the wrong type (e.g., DD_FLUSH_TIMEOUT=abc or flush_timeout: [1,2,3] in YAML), the entire config
source would fail and fall back to Config::default() — losing every valid field alongside the bad one. Now, only the invalid field falls back to its default while all other correctly-typed fields are preserved.

The fix introduces one generic deserializer function, deserialize_with_default<T: Deserialize + Default>, which catches serde errors at the field level and returns T::default(). This is applied via
#[serde(deserialize_with)] to all previously-unprotected fields across EnvConfig, YamlConfig, and their nested structs. Several existing deserializers (deserialize_optional_duration_from_seconds,
deserialize_optional_duration_from_microseconds, deserialize_trace_propagation_style, deserialize_key_value_pair_array_to_hashmap, deserialize_service_mapping, deserialize_apm_replace_rules) were also hardened to
catch errors instead of propagating them.

Motivation

In production, a single misconfigured environment variable or YAML field should not wipe out the entire agent configuration. The agent should log a warning for the bad field and continue with defaults for that
field only, while respecting all other valid configuration.

Additional Notes

  • No structural changes to EnvConfig, YamlConfig, nested structs, Config, merge_config, or ConfigSource::load. The change is purely additive #[serde(deserialize_with)] annotations and error-catching in existing
    deserializer functions.
  • All existing deserializers remain in use — no dead code introduced.
  • The test_parse_duration_from_microseconds test was updated: negative and float values for microsecond durations now gracefully return None instead of erroring, consistent with the graceful-failure goal.

Describe how to test/QA your changes

Two comprehensive regression tests were added:

  • test_all_yaml_fields_wrong_type_fallback_to_default — Creates a YAML file where every field (including deeply nested OTLP config) is set to the wrong type ([1, 2, 3]). Asserts load() succeeds and config ==
    Config::default().
  • test_all_env_fields_wrong_type_fallback_to_default — Sets every non-string env var to an invalid value ("not_a_number", "not_a_bool", "not_json{{", etc.). Asserts load() succeeds and every non-string field
    equals its default.

Both tests serve as canaries: when adding a new field, adding it to the test YAML/env with a wrong type will immediately fail if the field lacks a graceful deserializer.

All 78 tests pass (cargo test -p datadog-agent-config --lib).

@duncanista duncanista requested review from a team as code owners March 12, 2026 18:53
@duncanista duncanista requested review from apiarian-datadog, Copilot and shreyamalpani and removed request for a team March 12, 2026 18:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the resilience of datadog-agent-config config loading by making deserialization failures non-fatal at the field level, so that a single mistyped YAML/env value no longer causes the entire source to fall back to Config::default().

Changes:

  • Introduces a generic deserialize_with_default<T: Deserialize + Default> helper and applies it broadly via #[serde(deserialize_with)] on YAML/env config fields.
  • Hardens several existing deserializers to catch and ignore serde/type errors instead of propagating them.
  • Adds “canary” regression tests ensuring YAML/env loads succeed even when many fields have wrong types.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
crates/datadog-agent-config/yaml.rs Applies per-field graceful deserialization across YAML structs and adds a comprehensive wrong-type YAML regression test.
crates/datadog-agent-config/env.rs Applies graceful deserialization for select env fields and adds a comprehensive wrong-type env regression test.
crates/datadog-agent-config/mod.rs Adds deserialize_with_default and hardens multiple deserializers to ignore type/serde errors.
crates/datadog-agent-config/service_mapping.rs Makes service mapping deserialization ignore type errors (returns empty map).
crates/datadog-agent-config/apm_replace_rule.rs Makes APM replace rules deserialization ignore type errors (returns None).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@duncanista duncanista changed the title fix(config): graceful failure fix(config): graceful failure fallback to default Mar 12, 2026
Copy link
Contributor

@apiarian-datadog apiarian-datadog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some notes and questions

let json_string = match deserializer.deserialize_any(StringOrReplaceRulesVisitor) {
Ok(s) => s,
Err(e) => {
tracing::error!("Failed to deserialize APM replace rules: {e}, ignoring");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're okay with an error rather than a debug level here? i thought we generally prefer debug level unless we're about to take the whole app down.

/// Note: string-typed fields (DD_SITE, DD_API_KEY, etc.) can't fail from env
/// because all env values are strings. This test covers all non-string types.
///
/// When adding a new non-string field to EnvConfig, add an entry here with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a programmatic way we can ensure that there's no fields that are missing from the test?

// No string env vars were set, so string fields stay at default.
// All non-string env vars were set to invalid values, so they also stay at default.
// The entire config should equal the default.
assert_eq!(config, Config::default());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be a bit of a more convincing test if we have some non-default values. perhaps we can put something into one of the string configs, to make sure that at least those get set correctly and don't end up being reset to the default.

match T::deserialize(deserializer) {
Ok(value) => Ok(value),
Err(e) => {
warn!("Failed to deserialize field: {}, using default", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this sometimes a warn and sometimes an error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants