fix(config): graceful failure fallback to default#98
fix(config): graceful failure fallback to default#98duncanista wants to merge 3 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves the resilience of datadog-agent-config config loading by making deserialization failures non-fatal at the field level, so that a single mistyped YAML/env value no longer causes the entire source to fall back to Config::default().
Changes:
- Introduces a generic
deserialize_with_default<T: Deserialize + Default>helper and applies it broadly via#[serde(deserialize_with)]on YAML/env config fields. - Hardens several existing deserializers to catch and ignore serde/type errors instead of propagating them.
- Adds “canary” regression tests ensuring YAML/env loads succeed even when many fields have wrong types.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
crates/datadog-agent-config/yaml.rs |
Applies per-field graceful deserialization across YAML structs and adds a comprehensive wrong-type YAML regression test. |
crates/datadog-agent-config/env.rs |
Applies graceful deserialization for select env fields and adds a comprehensive wrong-type env regression test. |
crates/datadog-agent-config/mod.rs |
Adds deserialize_with_default and hardens multiple deserializers to ignore type/serde errors. |
crates/datadog-agent-config/service_mapping.rs |
Makes service mapping deserialization ignore type errors (returns empty map). |
crates/datadog-agent-config/apm_replace_rule.rs |
Makes APM replace rules deserialization ignore type errors (returns None). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
apiarian-datadog
left a comment
There was a problem hiding this comment.
some notes and questions
| let json_string = match deserializer.deserialize_any(StringOrReplaceRulesVisitor) { | ||
| Ok(s) => s, | ||
| Err(e) => { | ||
| tracing::error!("Failed to deserialize APM replace rules: {e}, ignoring"); |
There was a problem hiding this comment.
we're okay with an error rather than a debug level here? i thought we generally prefer debug level unless we're about to take the whole app down.
| /// Note: string-typed fields (DD_SITE, DD_API_KEY, etc.) can't fail from env | ||
| /// because all env values are strings. This test covers all non-string types. | ||
| /// | ||
| /// When adding a new non-string field to EnvConfig, add an entry here with |
There was a problem hiding this comment.
is there a programmatic way we can ensure that there's no fields that are missing from the test?
| // No string env vars were set, so string fields stay at default. | ||
| // All non-string env vars were set to invalid values, so they also stay at default. | ||
| // The entire config should equal the default. | ||
| assert_eq!(config, Config::default()); |
There was a problem hiding this comment.
this might be a bit of a more convincing test if we have some non-default values. perhaps we can put something into one of the string configs, to make sure that at least those get set correctly and don't end up being reset to the default.
| match T::deserialize(deserializer) { | ||
| Ok(value) => Ok(value), | ||
| Err(e) => { | ||
| warn!("Failed to deserialize field: {}, using default", e); |
There was a problem hiding this comment.
why is this sometimes a warn and sometimes an error?
What does this PR do?
Adds per-field graceful deserialization to the datadog-agent-config crate. Previously, if a single config field had the wrong type (e.g., DD_FLUSH_TIMEOUT=abc or flush_timeout: [1,2,3] in YAML), the entire config
source would fail and fall back to Config::default() — losing every valid field alongside the bad one. Now, only the invalid field falls back to its default while all other correctly-typed fields are preserved.
The fix introduces one generic deserializer function, deserialize_with_default<T: Deserialize + Default>, which catches serde errors at the field level and returns T::default(). This is applied via
#[serde(deserialize_with)] to all previously-unprotected fields across EnvConfig, YamlConfig, and their nested structs. Several existing deserializers (deserialize_optional_duration_from_seconds,
deserialize_optional_duration_from_microseconds, deserialize_trace_propagation_style, deserialize_key_value_pair_array_to_hashmap, deserialize_service_mapping, deserialize_apm_replace_rules) were also hardened to
catch errors instead of propagating them.
Motivation
In production, a single misconfigured environment variable or YAML field should not wipe out the entire agent configuration. The agent should log a warning for the bad field and continue with defaults for that
field only, while respecting all other valid configuration.
Additional Notes
deserializer functions.
Describe how to test/QA your changes
Two comprehensive regression tests were added:
Config::default().
equals its default.
Both tests serve as canaries: when adding a new field, adding it to the test YAML/env with a wrong type will immediately fail if the field lacks a graceful deserializer.
All 78 tests pass (cargo test -p datadog-agent-config --lib).