Problem Statement
As one of the breaking changes for 3.0, we [removed the DSL format](#3367). I think that it's worth revisiting this and potentially adding a new simplified JSON format.
Current State
Why did we remove it?
Originally we directly copied the underlying OpenSearch DSL format from the internal search request. This includes the hits, as well as lots of extra metadata:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 12,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "fruits",
"_id": "d88d69445faebeaaa9c0683d",
"_score": 1.0,
"_source": {
"name": "strawberry",
"color": "red",
"approx_size_in": 2,
"has_pit": false
}
},
{
"_index": "fruits",
"_id": "8bec12157951f7e9003a13b4",
"_score": 1.0,
"_source": {
"name": "apple",
"color": "red",
"approx_size_in": 4,
"has_pit": false
}
}
]
}
}
The problem is that this doesn't work for more sophisticated queries. For anything that requires some level of postprocessing (e.g. spath extraction), it's not obvious what the result should look like, and the metadata might not make sense anymore (e.g. the time taken should include the spath extraction time). It also isn't clear how to express this format for join queries, which span multiple indices and thus invalidate the _shards metrics. Even for the queries that work, directly returning DSL without going through the normal result serialization system punches a hole in our query execution that's hard to work around.
Why should we bring it back?
The primary JSON interface we ship today is JDBC, which orients the records around a list of columns.
{
"schema": [
{
"name": "name",
"type": "string"
},
{
"name": "approx_size_in",
"type": "bigint"
},
{
"name": "has_pit",
"type": "boolean"
},
{
"name": "color",
"type": "string"
}
],
"datarows": [
[
"strawberry",
2,
false,
"red"
],
[
"apple",
4,
false,
"red"
]
],
"total": 2,
"size": 2
}
The problem is that this format isn't very convenient for standard JSON parsing flows: you need one pass to load the columns, and another to correlate it with the data rows. It's not easy without custom logic to do any analysis of the results, and when you have lots of columns it's hard to tell which one is which. That custom logic requirement breaks tools like jq. This is inconvenient enough that I use a custom script to fix this:
$ echo '{"query":"source=fruits | head 2"}' | xh post localhost:9200/_plugins/_ppl | unjdbc
{"approx_size_in":2,"color":"red","has_pit":false,"name":"strawberry"}
{"approx_size_in":4,"color":"red","has_pit":false,"name":"apple"}
Only then can I natively pass query output to these tools.
Long-term Goals
- Simplify JSON result parsing for users.
- Enable easier integration with other tools that process OpenSearch DSL records.
- Provide an easier migration path for 2.x customers reliant on this format.
Proposal
I propose re-introducing at least one json format. I think there are three formats worth considering:
- An ordinary JSON object format compatible with standard parsers,
json_array
[{"approx_size_in":2,"color":"red","has_pit":false,"name":"strawberry"},
{"approx_size_in":4,"color":"red","has_pit":false,"name":"apple"}]
- An NDJSON stream which works better with streaming parsers like jq,
ndjson. It's also theoretically easier to stream over transport if we'd like to optimize latency here in the future (but if we need that we should invest in shipping a binary format).
{"approx_size_in":2,"color":"red","has_pit":false,"name":"strawberry"}
{"approx_size_in":4,"color":"red","has_pit":false,"name":"apple"}
- A DSL-style response that has
hits and _source nesting and grants compatibility to 2.x clients that don't want to migrate formats, dsl. We don't need to reimplement all of the metadata in one batch, I think it's safe to assume that "most" users are just pulling out the _source records.
{
"took": 1, // <-- (Optional) Implemented by timing the query if we detect this format.
"hits": {
"hits": [
{
"_index": "fruits", // <-- Metadata optional, we can only include `_source` if it's complex.
"_id": "d88d69445faebeaaa9c0683d",
"_source": {
"name": "strawberry",
"color": "red",
"approx_size_in": 2,
"has_pit": false
}
},
{
"_index": "fruits", // <-- Leave it off for queries that don't have a clear index value (e.g. joins, aggregations)
"_id": "8bec12157951f7e9003a13b4",
"_source": {
"name": "apple",
"color": "red",
"approx_size_in": 4,
"has_pit": false
}
}
]
}
}
Approach
The actual formats can be implemented one-by-one, I don't think there's any cross-dependency (maybe share object serialization if we need to do column traversal, but that's implementation nits).
To give maximum flexibility to users and enable seamless migration, we can define a json format which can default to any of the others with a config like plugins.query.default_json_format that takes any of the above values. This then aliases json to be equivalent to ?format=ndjson, ?format=dsl, ... Users who are migrating 2.x clusters can set this config to dsl:
POST _cluster/_settings
{
"persistent": { "plugins.query.default_json_format": "dsl" }
}
I think the nicest default for most end users will be json_array.
Alternatives
- Do nothing: Require users to continue implementing their own field mappers.
- We can freely implement or not-implement any of these formats.
ndjson is probably the best candidate for exclusion. I selfishly would like it as I use it regularly as part of my development, and if we have the infrastructure to implement the other two then it seems easy to add on.
- For the config, this introduces the scenario of users breaking themselves by changing it, so it may be safer to just alias
json with a documentation note like "This is an alias to dsl provided for backward compatibility."
- The downside of this is that I think a reasonable user would try
?format=json and expect good results, I think there's a world where it becomes sort-of tribal knowledge to say "Oh, you don't want to use json, you want to use json_array."
- We could also pick a better json default and have
dsl be separate, but this requires user code to seamlessly handle using dsl on 3.x and json on 2.x.
Implementation Discussion
Not particularly interesting: we can register a new formatter that takes a QueryResponse and does arbitrary processing. The existing JDBC formatters already give a pretty good template for this.
Metadata can be added to dsl incrementally as needed, an initial implementation might only have hits and _source.
"took" would be a nice piece of metadata to also put on the existing JDBC format, it could be implemented by just putting a simple timer around the query execution. This could help users debug backend vs frontend latency at a glance.
Problem Statement
As one of the breaking changes for 3.0, we [removed the DSL format](#3367). I think that it's worth revisiting this and potentially adding a new simplified JSON format.
Current State
Why did we remove it?
Originally we directly copied the underlying OpenSearch DSL format from the internal search request. This includes the hits, as well as lots of extra metadata:
{ "took": 1, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 12, "relation": "eq" }, "max_score": 1.0, "hits": [ { "_index": "fruits", "_id": "d88d69445faebeaaa9c0683d", "_score": 1.0, "_source": { "name": "strawberry", "color": "red", "approx_size_in": 2, "has_pit": false } }, { "_index": "fruits", "_id": "8bec12157951f7e9003a13b4", "_score": 1.0, "_source": { "name": "apple", "color": "red", "approx_size_in": 4, "has_pit": false } } ] } }The problem is that this doesn't work for more sophisticated queries. For anything that requires some level of postprocessing (e.g.
spathextraction), it's not obvious what the result should look like, and the metadata might not make sense anymore (e.g. the time taken should include the spath extraction time). It also isn't clear how to express this format for join queries, which span multiple indices and thus invalidate the_shardsmetrics. Even for the queries that work, directly returning DSL without going through the normal result serialization system punches a hole in our query execution that's hard to work around.Why should we bring it back?
The primary JSON interface we ship today is JDBC, which orients the records around a list of columns.
{ "schema": [ { "name": "name", "type": "string" }, { "name": "approx_size_in", "type": "bigint" }, { "name": "has_pit", "type": "boolean" }, { "name": "color", "type": "string" } ], "datarows": [ [ "strawberry", 2, false, "red" ], [ "apple", 4, false, "red" ] ], "total": 2, "size": 2 }The problem is that this format isn't very convenient for standard JSON parsing flows: you need one pass to load the columns, and another to correlate it with the data rows. It's not easy without custom logic to do any analysis of the results, and when you have lots of columns it's hard to tell which one is which. That custom logic requirement breaks tools like
jq. This is inconvenient enough that I use a custom script to fix this:Only then can I natively pass query output to these tools.
Long-term Goals
Proposal
I propose re-introducing at least one
jsonformat. I think there are three formats worth considering:json_array[{"approx_size_in":2,"color":"red","has_pit":false,"name":"strawberry"}, {"approx_size_in":4,"color":"red","has_pit":false,"name":"apple"}]ndjson. It's also theoretically easier to stream over transport if we'd like to optimize latency here in the future (but if we need that we should invest in shipping a binary format).{"approx_size_in":2,"color":"red","has_pit":false,"name":"strawberry"} {"approx_size_in":4,"color":"red","has_pit":false,"name":"apple"}hitsand_sourcenesting and grants compatibility to 2.x clients that don't want to migrate formats,dsl. We don't need to reimplement all of the metadata in one batch, I think it's safe to assume that "most" users are just pulling out the_sourcerecords.Approach
The actual formats can be implemented one-by-one, I don't think there's any cross-dependency (maybe share object serialization if we need to do column traversal, but that's implementation nits).
To give maximum flexibility to users and enable seamless migration, we can define a
jsonformat which can default to any of the others with a config likeplugins.query.default_json_formatthat takes any of the above values. This then aliases json to be equivalent to?format=ndjson,?format=dsl, ... Users who are migrating 2.x clusters can set this config todsl:I think the nicest default for most end users will be
json_array.Alternatives
ndjsonis probably the best candidate for exclusion. I selfishly would like it as I use it regularly as part of my development, and if we have the infrastructure to implement the other two then it seems easy to add on.jsonwith a documentation note like "This is an alias todslprovided for backward compatibility."?format=jsonand expect good results, I think there's a world where it becomes sort-of tribal knowledge to say "Oh, you don't want to use json, you want to use json_array."dslbe separate, but this requires user code to seamlessly handle usingdslon 3.x andjsonon 2.x.Implementation Discussion
Not particularly interesting: we can register a new formatter that takes a
QueryResponseand does arbitrary processing. The existing JDBC formatters already give a pretty good template for this.Metadata can be added to
dslincrementally as needed, an initial implementation might only havehitsand_source."took"would be a nice piece of metadata to also put on the existing JDBC format, it could be implemented by just putting a simple timer around the query execution. This could help users debug backend vs frontend latency at a glance.