From 3a555db6ca93e51dee1b2a4a18ab4bb45b543200 Mon Sep 17 00:00:00 2001 From: Lukas Kalbertodt Date: Thu, 20 Mar 2025 10:36:37 +0100 Subject: [PATCH 1/6] Discuss: various topics --- docs/common/acl.md | 26 ++++++++++++ docs/common/index.md | 57 +++++++++++++++++++++++++ docs/common/types.md | 53 +++++++++++++++++++++++ docs/event/acl.md | 12 ++++++ docs/event/index.md | 28 +++++++++++++ docs/important-differences.md | 79 +++++++++++++++++++++++++++++++++++ docs/index.md | 49 ++++++++++++++++++++++ docs/open-questions.md | 19 +++++++++ 8 files changed, 323 insertions(+) create mode 100644 docs/common/acl.md create mode 100644 docs/common/index.md create mode 100644 docs/common/types.md create mode 100644 docs/event/acl.md create mode 100644 docs/event/index.md create mode 100644 docs/important-differences.md create mode 100644 docs/index.md create mode 100644 docs/open-questions.md diff --git a/docs/common/acl.md b/docs/common/acl.md new file mode 100644 index 0000000..cb61343 --- /dev/null +++ b/docs/common/acl.md @@ -0,0 +1,26 @@ +--- +sidebar_position: 1 +--- + +# ACL (access control list) + +ACLs control access to Opencast entities. +An ACL is simply a list of `role` + `action` pairs. +An entry gives users that have that particular `role` the permission to perform the specified `action` on the entity. +Both, `role` and `action` have the [type `Label`](./types).(1?) + +There are two special actions recognized by Opencast. +Other actions can be used for custom purposes by external applications. +- `read`: generally, gives read access to an entity +- `write`: generally, gives write access to an entity (changing or deleting it) + +*Impl note*: `read` and `write` roles should likely be stored in a way that allows for fast filtering, e.g. in a `read_roles` DB column that has a DB index. + + +--- + +:::danger[Open questions] + +- (1?) Is it fine to restrict roles and actions like that? Or can we restrict it even more? + +::: diff --git a/docs/common/index.md b/docs/common/index.md new file mode 100644 index 0000000..ff7d50f --- /dev/null +++ b/docs/common/index.md @@ -0,0 +1,57 @@ +--- +sidebar_position: 3 +--- + +# Common specifications + +## Data storage + +The single source of truth for everything is the database (DB) plus files on the file system¹ referenced by the DB. +Every piece of information is only stored in one place in the DB. + +Only a handful of files are stored on the file system: +- Binary and/or large files like video, audio, images, ... +- Files that need to be delivered in a specific format anyway (VTT subtitles, ...) + +:::info[Differences from current OC model] +In particular, textual metadata, ACLs, cutting information or anything like that is _not_ stored on the file system! +::: + +The database never references files by absolut path or URL. +At most, it stores a path relative to the configured `storage.dir`, but potentially in an even more implicit way. + + +(¹) File system = local file system, or NFS, or S3 storage or potentially others. + +### Derived data storage + +For different purposes, it might be useful to store the same data again in a different form. +For example, using a search index for full text search. +(Note however: whenever possible and useful, use DB indices built into the DBMS.) + +These derived data sources can be slightly behind the DB (e.g. due to indexing times), which is acceptable. +However, it is crucially important that data only ever flows from the DB into other data stores, _never_ the other way around. +Deleting all derived data stores must never result in data loss as they can always be regenerated from the DB. +Rebuilding derived data stores must always results in the same result, regardless of what the derived store previously contained. +Opencast should do its best to keep the derived data stores in sync in a timely manner. + + +## Promised properties + +This data model promises certain properties about certain fields/data, for example: "there is a non-empty title", "this is an array of strings" or "the duration matches the duration all tracks". + +- It's Opencast's responsibility to ensure these properties. Whenever an entity is added or changed, these properties need to be maintained, usually by rejecting the change request (e.g. 4xx response in API). +- If an entity does not have these properties, this should be considered a bug in Opencast and should be fixed ASAP. + - We should never find us in the situation where external apps (e.g. LMS plugins) need to work around a broken property of Opencast. +- The same goes for legacy events, which might be broken in the new model. They cannot be kept as is, they need to be changed/migrated to exhibit these properties. +- The implementation should try, wherever possible, to make broken events impossible to represent. As a simple example, the title field in the DB should be `non null`. + + +## Well defined API response + +While not technically part of the data model, the possible responses of APIs should be well defined and documented. +The documentation should automatically be derived from the code in order to keep it up to date (which otherwise will absolutely fail). +The implementation details need to be figured out, but the idea is that the same "code" (e.g. a Java `record` definition with attributes) that leads to the serialized API response is also used as source for the documentation. + +Users of the API should never need to look at an actual response to know what to expect. +An actual response is always something *specific* and does not communicate what fields are optional and what possible values to expect for each field. diff --git a/docs/common/types.md b/docs/common/types.md new file mode 100644 index 0000000..58cd367 --- /dev/null +++ b/docs/common/types.md @@ -0,0 +1,53 @@ +--- +sidebar_position: 2 +--- + +# Common types + +These are types used throughout the rest of this specification and defined here once to avoid repetition. + +- `string`: a valid UTF-8 string. While being processed in code, it might be in a different encoding temporarily, but in the public interface of Opencast, these are always valid UTF-8. +- `NonBlankString`: A string that is not "blank", meaning it is not empty and does not consists only of [Unicode `White_Space`](https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt). +- `NonBlankAsciiString`: A `NonBlankString` that is also restricted to only using ASCII characters. +- `Label`: a `NonBlankAsciiString` that only consists of letters, numbers or `-._~!*:@,;`. This means a label is URL-safe except for use in the domain part.(2?) +- `ID`: a `Label` that cannot be changed after being created. +- `Username`: TODO define rules for usernames +- `LangCode`: specifies a language and optionally a region, e.g. `en` or `en-US`. Based on the [IETF BCP 47 language tag specification](https://www.rfc-editor.org/info/rfc5646): a two letter language code, optionally followed by a hyphen and a two letter region tag. +- `int8`, `int16`, `int32`, `int64`: signed integers of specific bit width. +- `uint8`, `uint16`, `uint32`, `uint64`: unsigned integers of specific bit width.(1?) +- `Milliseconds`: a `uint64` representing a duration or a video timestamp in milliseconds (ms). Impl note: whenever possible, in code, this should be a custom type and not just `int`. +- `DateTime`: a date + time with timezone, i.e. a specific moment in a specific timezone. +- `Timestamp`: a specific moment in time, without time zone (e.g. always stored as UTC). + +Generally, this basically uses TypeScript syntax: + +- `T?`: denotes an optional type, i.e. `bool?` means the field could be either `true`, `false` or undefined. All fields without `?` are _required_ / `non null`. +- `T[]`: array of type `T`. +- `[T, U, ...]`: a tuple of values. +- `"foo" | "bar"`: one of the listed constant values. + +## JSON serialization + +For most types, the JSON serialization is the obvious one, but there are some minor important details. +- `bool` as `bool` +- `string` and all "string with extra requirements" (e.g. `Label`, `ID`, `NonBlankAsciiString`) as string +- Integers as number. + - Note on 64 bit integers: In JavaScript, there is only one `number` type, which is a 64 bit floating point number (`double`, `f64`). + Those can only exactly represent integers up to 253. + While JSON is closely related to JS, the format itself is allowed to exceed `f64` precision and may in fact encode arbitrary precision numbers. + Opencast should serialize a 64 bit integer as exact integer into JSON and *not* rounded like an `f64`. + Rounding might happen in the frontend, but the API should emit the exact integer value. +- Arrays as arrays +- Tuples as arrays +- `Map` is serialized as object +- `DateTime`: as ISO 8601-compatible formatted string. The ISO standard actually allows a number of different formats by ommitting parts of the string. Opencast shall format all date times as either `YYYY-MM-DDTHH:mm:ss.sssZ` or `YYYY-MM-DDTHH:mm:ssZ`, i.e. only the sub-second part is optional. The parts on this format string are best described in [the ECMAScript specification](https://tc39.es/ecma262/multipage/numbers-and-dates.html#sec-date-time-string-format) (which again, is a subset of ISO 8601). Only thing of note: `Z` could either be literal `Z` or a timezone offset like `+02`. +- `Timestamp`: like `DateTime` but always in UTC, so always ending with literal `Z`. + +--- + +:::danger[Open questions] + +- (1?) Java famously has no/bad support for unsigned integers. Decide how to deal with that: do we just give up one bit or do we require proper unsigned usage via `Integer.*Unsigned` methods? Either way: these values must never be negative! +- (2?) Maybe disallow more of these special characters? + +::: diff --git a/docs/event/acl.md b/docs/event/acl.md new file mode 100644 index 0000000..2800609 --- /dev/null +++ b/docs/event/acl.md @@ -0,0 +1,12 @@ +--- +sidebar_position: 3 +--- + +# ACL + +See [the common ACL specifications](../common/acl). + +- `read`: allows a user to read all metadata, the ACL and all non-internal assets (their metadata and the asset files themselves). +- `write`: allows a user to change any editable metadata, change the ACL, change anything about assets (delete, change, add). TODO: what about internal assets? + +TODO: specify how `listed` works. diff --git a/docs/event/index.md b/docs/event/index.md new file mode 100644 index 0000000..7c8fd30 --- /dev/null +++ b/docs/event/index.md @@ -0,0 +1,28 @@ +--- +sidebar_position: 4 +--- + +# Event + +An event(1?) is the core entity of Opencast, representing a video content. +An event consists of: +- [Metadata](./metadata) +- [ACL](./acl) +- [Assets](./assets) + +As described [here](../common#data-storage), almost all of this data is stored in the DB. +Only the actual asset files are stored on the file system (the metadata about assets is still stored in the DB). + +In terms of API response, it might look like this: + + +--- + +:::danger[Open questions] + +- (1?) Potentially very controversial: rename "event"/"episode" to "video"? + - Intuitively, most people call it "video" + - "Event" is a very generic term and can mean many other things, "episode" implies being part of a series. + - Yes, there can be two _video files_, but we already have a name for that: video stream. So Idon't see a confusion risk here. I don't see any problems with calling a thing a video even if it contains two video streams. + - New name in API would make clear that data model has changed. +::: diff --git a/docs/important-differences.md b/docs/important-differences.md new file mode 100644 index 0000000..3dc2047 --- /dev/null +++ b/docs/important-differences.md @@ -0,0 +1,79 @@ +--- +sidebar_position: 2 +--- + +# Important differences from the current model + +This page mentions a number of major ways, in how this specification differs from the Opencast status quo. + + +## No snapshot system anymore + +The old system of creating snapshots and using hardlinks on the file system is no more. +Whether and how want to version parts of an entity's data is still questionable (see [Open Questions](./open-questions)). + + +## No publications + +There is no "engage", "external API", OAIMPH or any other internal _publication_ anymore. +There might still be a place for external publications in the sense of interacting with another system like YouTube. +These would require some async data synchronization and stuff. +But hardly anyone is using that, so while reading this specification just think: there are no publications at all. +The term does not exist anymore. + +Instead, the DB, file system and all APIs have the same view of the world. +If an event with title "Banana" exists in Opencast, then it exists _everywhere_, i.e. in the DB, on the file system, and in all APIs¹. + +This also includes modifications and deletions. +There is no staging area for changes anymore: all metadata and ACL changes to Opencast entities (event, series, ...) are instantly reflected in all APIs¹. +Changing metadata and ACLs does not require running a workflow anymore. +APIs for modifying this data promise that once they return 2xx, the change has been finalized to the database (the single source of truth). + +A small number of Opencast users might like the two-stage metadata changing. +_If_ it is really desired, this "feature" can be implemented on top of the core Opencast, e.g. in the Admin UI (but disabled by default). + +(¹) A small delay to update the search index is fine. + +### Long running operations + +Of course, there are some modifications or operations that cannot be done immediately, e.g. encoding a video or generating subtitles. +APIs starting these operations are _async_, i.e. they return 2xx to just state the operation has been started, but don't wait for the operation to finish. +But even with these operations, there is still only one view of the world. +For example, say a subtitle generation for an event was started: until the moment that operation finishes, the event has its previous subtitles (e.g. none) and that's reflected in all APIs. + +An event is visible in APIs immediately after ingesting. +Of course, while the video is not encoded yet, there are no URLs to video tracks yet. +The API should represent that fact in a way that makes it easy for external apps to check if a video is still processing. + +Sometimes, long running operations need to be run on metadata changes, e.g. to generate thumbnails with metadata in them (aside: this is usually not a great idea). +This can still be done, with the difference that the DB/API immediately reflects the changed metadata, while the thumbnail needs to catch up. +Again: the DB is the single source of truth. +Everything derived from it (e.g. search index, thumbnails, ...) needs to catch up. + +As an aside, we should treat fewer operations as "long running" and thus offer synchronous APIs for them. +Cutting subtitles, generating thumbnails in different sizes, and more are things that can be easily done in tens of milliseconds. + +## Storage format & API format + +### Independence + +How Opencast stores data should be independent of how Opencast exposes data in its API. +Just because the API format is JSON, does not mean that Opencast should store everything as JSON in the DB or on the file system. + +Further, the structure of classes in Opencast code or the format in the search service should also not leak into the API. +The structure of the API response should be selected purely based on good API design and not on internals. +Avoiding to leak internals makes it easier to change these internals without breaking the API. +(The rewrite of the search service from Solr to ElasticSearch demonstrates how badly this can fail: the very widely used search API changed a lot.) + +The implementation should do everything to ensure this separation. +For example, by having separate `record` definitions which are *only* used for API serialization. +This also makes it a lot harder to accidentally change the API. + +### Unified response for all entities + +An event in the API should always be represented with the same JSON response, regardless of whether it was fetched by ID, or returned from a full text search, or as the entry of a series. +Previously, this differed depending on whether it was loaded from the search index or the database or elsewhere. + +Ideally, there shouldn't be a separate `search` endpoint anyway, but rather have the search feature be part of the external event API. +As an API user, I don't care what indices or data structures Opencast uses to give me the data. +And now that we use ElasticSearch/OpenSearch, there is no reason why there are nodes that couldn't perform that search. diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..64fa049 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,49 @@ +--- +sidebar_position: 1 +title: Introduction +--- + +# Opencast Data Model + +This document specifies the _future_ data model of Opencast. +The data model describes everything that is stored, what types and requirements certain data has, how it is represented in the API, how data can be changed, and more. + +:::warning +This specification does *not* describe the current state of Opencast! +Also, it is a work in progress and is currently being developed and discuss in the community. +::: + +Readers familiar with Opencast should ignore their prior knowledge while reading this, and treat this as a specification for a completely new software. +Do not interpret any existing OC behavior into this specification, if it isn't explicitly mentioned. +Also read the special [Important Differences](./important-differences) page, which explains where this data model differs in significant ways from the current Opencast. + + +## Goals + +There are multiple reasons we are proposing this new data model: +- Improve robustness of Opencast by having a stricter and well defined data model. Be clear about what is allowed and what isn't, and catch invalid data as early as possible. +- Simplify developement of external applications: currently, the API responses are grossly underspecified and it is unclear what properties apps can expect from Opencast (e.g. do I need to deal with duration = -1?). +- Improve robustness by clearly specifying the source of truth for data and reducing the number of places/APIs that store/return data. +- Enable immediate modification of metadata (e.g. changing a video's title) without running a workflow. +- Improve performance by changing how data is stored. + +The goal behind this very specification is to allow for easy discussion in the community, and eventually to have a written specification. + +This specification is written mainly as if it was talking to API users, i.e. developers of external apps who want to integrate with Opencast. +I think this is a useful choice to define the "public interface" of Opencast. +The document does contain quite a bit of implementation notes, too, which just define how things should be handled inside Opencast. + +## Contributing to this specification + +Discussing every single detail in the community beforehand is not viable and not necessary. +Instead, the idea is that there is one main person working on this spec, writing most of the text, therefore proposing parts of the model. +These proposals are discussed in regular meetings and on GitHub. +See [the `opencast/data-model`](https://github.com/opencast/data-model) repository, and in particular the pull requests and discussions tabs. + +## Backwards compatibility and breaking changes + +It is very clear that we need to be able to migrate existing data to the new model. +We also don't want to change every single piece without good reason, in order to keep the overall change managable. +The new model was designed with that in mind. +That said, this document (especially its initial version) does contain incompatibilities and breaking changes, and does not yet consider every single use case. +I expect these use cases to be discussed during the community review of this. diff --git a/docs/open-questions.md b/docs/open-questions.md new file mode 100644 index 0000000..ad4c347 --- /dev/null +++ b/docs/open-questions.md @@ -0,0 +1,19 @@ +# Open questions + +- Should all data be versioned? + - It adds complexity, but having access to old data is nice. + - Storage wise, keeping old metadata does not cost much. + - Via the `internal` asset system, we can already kind of version assets. + - Get rid of the current asset manager/snapshot system to avoid hardlinks. +- Use more compact ID encoding by default? Hex encoded UUID4 is wasteful and makes for long URLs. (36 -> 21 chars... maybe not worth it) +- What do we generally think about size limitations for various fields? + - Abuse protection: this is just to prevent abuse, DOS, slow downs and stuff like that. Limit `description` to 216 bytes, limit `title`, `license`, ... to 1024 bytes. I think these limits make sense and should prevent OC suffering from bad payloads. + - Semantic limits: for example, for `license`, we could say "it should just be a identifier for a license, so limit to 64 bytes". This is a lot more tricky as one has to really think of the intended use case and runs the risk of making use cases impossible. + + +## TODO + +- Metadata can be changed when a workflow is running or an event is scheduled + - Mhhh small problem: some workflows might depend on metadata, e.g. when creating images with metadata in them. So maybe workflows can declare dependencies to metadata? + - So maybe we cannot do this now, this feature we can still add in a second step. When we rework the workflow system 😈 +- Explain how snapshots are removed From 69698943283c2cf909d1c13393ae892f4b2ce4f3 Mon Sep 17 00:00:00 2001 From: Lukas Kalbertodt Date: Thu, 22 May 2025 14:05:41 +0200 Subject: [PATCH 2/6] Add note about exchange format in APIs --- docs/common/index.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/common/index.md b/docs/common/index.md index ff7d50f..cc7a634 100644 --- a/docs/common/index.md +++ b/docs/common/index.md @@ -15,6 +15,7 @@ Only a handful of files are stored on the file system: :::info[Differences from current OC model] In particular, textual metadata, ACLs, cutting information or anything like that is _not_ stored on the file system! +(Some APIs might still accept or produce these information in non-JSON exchange formats.) ::: The database never references files by absolut path or URL. From d2a6f6d808638bc9f5d086d95221e5cccdbc09b9 Mon Sep 17 00:00:00 2001 From: Lukas Kalbertodt Date: Thu, 22 May 2025 14:06:02 +0200 Subject: [PATCH 3/6] Rewrite section "well defined API responses" to be more clear --- docs/common/index.md | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/docs/common/index.md b/docs/common/index.md index cc7a634..d2c2c0f 100644 --- a/docs/common/index.md +++ b/docs/common/index.md @@ -50,9 +50,8 @@ This data model promises certain properties about certain fields/data, for examp ## Well defined API response -While not technically part of the data model, the possible responses of APIs should be well defined and documented. -The documentation should automatically be derived from the code in order to keep it up to date (which otherwise will absolutely fail). -The implementation details need to be figured out, but the idea is that the same "code" (e.g. a Java `record` definition with attributes) that leads to the serialized API response is also used as source for the documentation. - -Users of the API should never need to look at an actual response to know what to expect. -An actual response is always something *specific* and does not communicate what fields are optional and what possible values to expect for each field. +Opencast's API should have a well defined/typed response that is derived from code in a [DRY](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself) fashion. +Specifically, the API documentation for e.g. `GET /event/{id}` needs to specify what kind of JSON object will be returned by the API. +This could be done via a [JSON Schema](https://json-schema.org/) or via GraphQL or other means. +Someone interested in using the API should know _exactly_ what response to expect, without sending a single test request to the API. +It is important, that the response specification is generated from the same code that is used for the actual API response serialization, to ensure they are always in sync. From 240e6510c271bb1465937b42f14deebcdf6f245c Mon Sep 17 00:00:00 2001 From: Lukas Kalbertodt Date: Thu, 22 May 2025 14:11:47 +0200 Subject: [PATCH 4/6] Change description of "event" from "video" to "multimedia" --- docs/event/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/event/index.md b/docs/event/index.md index 7c8fd30..5396307 100644 --- a/docs/event/index.md +++ b/docs/event/index.md @@ -4,7 +4,7 @@ sidebar_position: 4 # Event -An event(1?) is the core entity of Opencast, representing a video content. +An event(1?) is the core entity of Opencast, representing a multimedia content. An event consists of: - [Metadata](./metadata) - [ACL](./acl) From 1f8d24516b3eb7a684f2dd744eea61056ec2a92e Mon Sep 17 00:00:00 2001 From: Lukas Kalbertodt Date: Thu, 22 May 2025 14:49:00 +0200 Subject: [PATCH 5/6] Remove unfinished API response example Adding this right now might be more confusing than helpful. I will add more examples in the future. --- docs/event/index.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/docs/event/index.md b/docs/event/index.md index 5396307..a354b7d 100644 --- a/docs/event/index.md +++ b/docs/event/index.md @@ -13,8 +13,6 @@ An event consists of: As described [here](../common#data-storage), almost all of this data is stored in the DB. Only the actual asset files are stored on the file system (the metadata about assets is still stored in the DB). -In terms of API response, it might look like this: - --- From e23974f346e165ad7fd727f150439fa6740b205b Mon Sep 17 00:00:00 2001 From: Lukas Kalbertodt Date: Thu, 22 May 2025 14:49:29 +0200 Subject: [PATCH 6/6] Remove open question about new ID generation We already have more than enough to figure out, lets just not do that. --- docs/open-questions.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/open-questions.md b/docs/open-questions.md index ad4c347..7c83123 100644 --- a/docs/open-questions.md +++ b/docs/open-questions.md @@ -5,7 +5,6 @@ - Storage wise, keeping old metadata does not cost much. - Via the `internal` asset system, we can already kind of version assets. - Get rid of the current asset manager/snapshot system to avoid hardlinks. -- Use more compact ID encoding by default? Hex encoded UUID4 is wasteful and makes for long URLs. (36 -> 21 chars... maybe not worth it) - What do we generally think about size limitations for various fields? - Abuse protection: this is just to prevent abuse, DOS, slow downs and stuff like that. Limit `description` to 216 bytes, limit `title`, `license`, ... to 1024 bytes. I think these limits make sense and should prevent OC suffering from bad payloads. - Semantic limits: for example, for `license`, we could say "it should just be a identifier for a license, so limit to 64 bytes". This is a lot more tricky as one has to really think of the intended use case and runs the risk of making use cases impossible.