diff --git a/docs/common/acl.md b/docs/common/acl.md new file mode 100644 index 0000000..cb61343 --- /dev/null +++ b/docs/common/acl.md @@ -0,0 +1,26 @@ +--- +sidebar_position: 1 +--- + +# ACL (access control list) + +ACLs control access to Opencast entities. +An ACL is simply a list of `role` + `action` pairs. +An entry gives users that have that particular `role` the permission to perform the specified `action` on the entity. +Both, `role` and `action` have the [type `Label`](./types).(1?) + +There are two special actions recognized by Opencast. +Other actions can be used for custom purposes by external applications. +- `read`: generally, gives read access to an entity +- `write`: generally, gives write access to an entity (changing or deleting it) + +*Impl note*: `read` and `write` roles should likely be stored in a way that allows for fast filtering, e.g. in a `read_roles` DB column that has a DB index. + + +--- + +:::danger[Open questions] + +- (1?) Is it fine to restrict roles and actions like that? Or can we restrict it even more? + +::: diff --git a/docs/common/index.md b/docs/common/index.md new file mode 100644 index 0000000..d2c2c0f --- /dev/null +++ b/docs/common/index.md @@ -0,0 +1,57 @@ +--- +sidebar_position: 3 +--- + +# Common specifications + +## Data storage + +The single source of truth for everything is the database (DB) plus files on the file system¹ referenced by the DB. +Every piece of information is only stored in one place in the DB. + +Only a handful of files are stored on the file system: +- Binary and/or large files like video, audio, images, ... +- Files that need to be delivered in a specific format anyway (VTT subtitles, ...) + +:::info[Differences from current OC model] +In particular, textual metadata, ACLs, cutting information or anything like that is _not_ stored on the file system! +(Some APIs might still accept or produce these information in non-JSON exchange formats.) +::: + +The database never references files by absolut path or URL. +At most, it stores a path relative to the configured `storage.dir`, but potentially in an even more implicit way. + + +(¹) File system = local file system, or NFS, or S3 storage or potentially others. + +### Derived data storage + +For different purposes, it might be useful to store the same data again in a different form. +For example, using a search index for full text search. +(Note however: whenever possible and useful, use DB indices built into the DBMS.) + +These derived data sources can be slightly behind the DB (e.g. due to indexing times), which is acceptable. +However, it is crucially important that data only ever flows from the DB into other data stores, _never_ the other way around. +Deleting all derived data stores must never result in data loss as they can always be regenerated from the DB. +Rebuilding derived data stores must always results in the same result, regardless of what the derived store previously contained. +Opencast should do its best to keep the derived data stores in sync in a timely manner. + + +## Promised properties + +This data model promises certain properties about certain fields/data, for example: "there is a non-empty title", "this is an array of strings" or "the duration matches the duration all tracks". + +- It's Opencast's responsibility to ensure these properties. Whenever an entity is added or changed, these properties need to be maintained, usually by rejecting the change request (e.g. 4xx response in API). +- If an entity does not have these properties, this should be considered a bug in Opencast and should be fixed ASAP. + - We should never find us in the situation where external apps (e.g. LMS plugins) need to work around a broken property of Opencast. +- The same goes for legacy events, which might be broken in the new model. They cannot be kept as is, they need to be changed/migrated to exhibit these properties. +- The implementation should try, wherever possible, to make broken events impossible to represent. As a simple example, the title field in the DB should be `non null`. + + +## Well defined API response + +Opencast's API should have a well defined/typed response that is derived from code in a [DRY](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself) fashion. +Specifically, the API documentation for e.g. `GET /event/{id}` needs to specify what kind of JSON object will be returned by the API. +This could be done via a [JSON Schema](https://json-schema.org/) or via GraphQL or other means. +Someone interested in using the API should know _exactly_ what response to expect, without sending a single test request to the API. +It is important, that the response specification is generated from the same code that is used for the actual API response serialization, to ensure they are always in sync. diff --git a/docs/common/types.md b/docs/common/types.md new file mode 100644 index 0000000..58cd367 --- /dev/null +++ b/docs/common/types.md @@ -0,0 +1,53 @@ +--- +sidebar_position: 2 +--- + +# Common types + +These are types used throughout the rest of this specification and defined here once to avoid repetition. + +- `string`: a valid UTF-8 string. While being processed in code, it might be in a different encoding temporarily, but in the public interface of Opencast, these are always valid UTF-8. +- `NonBlankString`: A string that is not "blank", meaning it is not empty and does not consists only of [Unicode `White_Space`](https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt). +- `NonBlankAsciiString`: A `NonBlankString` that is also restricted to only using ASCII characters. +- `Label`: a `NonBlankAsciiString` that only consists of letters, numbers or `-._~!*:@,;`. This means a label is URL-safe except for use in the domain part.(2?) +- `ID`: a `Label` that cannot be changed after being created. +- `Username`: TODO define rules for usernames +- `LangCode`: specifies a language and optionally a region, e.g. `en` or `en-US`. Based on the [IETF BCP 47 language tag specification](https://www.rfc-editor.org/info/rfc5646): a two letter language code, optionally followed by a hyphen and a two letter region tag. +- `int8`, `int16`, `int32`, `int64`: signed integers of specific bit width. +- `uint8`, `uint16`, `uint32`, `uint64`: unsigned integers of specific bit width.(1?) +- `Milliseconds`: a `uint64` representing a duration or a video timestamp in milliseconds (ms). Impl note: whenever possible, in code, this should be a custom type and not just `int`. +- `DateTime`: a date + time with timezone, i.e. a specific moment in a specific timezone. +- `Timestamp`: a specific moment in time, without time zone (e.g. always stored as UTC). + +Generally, this basically uses TypeScript syntax: + +- `T?`: denotes an optional type, i.e. `bool?` means the field could be either `true`, `false` or undefined. All fields without `?` are _required_ / `non null`. +- `T[]`: array of type `T`. +- `[T, U, ...]`: a tuple of values. +- `"foo" | "bar"`: one of the listed constant values. + +## JSON serialization + +For most types, the JSON serialization is the obvious one, but there are some minor important details. +- `bool` as `bool` +- `string` and all "string with extra requirements" (e.g. `Label`, `ID`, `NonBlankAsciiString`) as string +- Integers as number. + - Note on 64 bit integers: In JavaScript, there is only one `number` type, which is a 64 bit floating point number (`double`, `f64`). + Those can only exactly represent integers up to 253. + While JSON is closely related to JS, the format itself is allowed to exceed `f64` precision and may in fact encode arbitrary precision numbers. + Opencast should serialize a 64 bit integer as exact integer into JSON and *not* rounded like an `f64`. + Rounding might happen in the frontend, but the API should emit the exact integer value. +- Arrays as arrays +- Tuples as arrays +- `Map` is serialized as object +- `DateTime`: as ISO 8601-compatible formatted string. The ISO standard actually allows a number of different formats by ommitting parts of the string. Opencast shall format all date times as either `YYYY-MM-DDTHH:mm:ss.sssZ` or `YYYY-MM-DDTHH:mm:ssZ`, i.e. only the sub-second part is optional. The parts on this format string are best described in [the ECMAScript specification](https://tc39.es/ecma262/multipage/numbers-and-dates.html#sec-date-time-string-format) (which again, is a subset of ISO 8601). Only thing of note: `Z` could either be literal `Z` or a timezone offset like `+02`. +- `Timestamp`: like `DateTime` but always in UTC, so always ending with literal `Z`. + +--- + +:::danger[Open questions] + +- (1?) Java famously has no/bad support for unsigned integers. Decide how to deal with that: do we just give up one bit or do we require proper unsigned usage via `Integer.*Unsigned` methods? Either way: these values must never be negative! +- (2?) Maybe disallow more of these special characters? + +::: diff --git a/docs/event/acl.md b/docs/event/acl.md new file mode 100644 index 0000000..2800609 --- /dev/null +++ b/docs/event/acl.md @@ -0,0 +1,12 @@ +--- +sidebar_position: 3 +--- + +# ACL + +See [the common ACL specifications](../common/acl). + +- `read`: allows a user to read all metadata, the ACL and all non-internal assets (their metadata and the asset files themselves). +- `write`: allows a user to change any editable metadata, change the ACL, change anything about assets (delete, change, add). TODO: what about internal assets? + +TODO: specify how `listed` works. diff --git a/docs/event/index.md b/docs/event/index.md new file mode 100644 index 0000000..a354b7d --- /dev/null +++ b/docs/event/index.md @@ -0,0 +1,26 @@ +--- +sidebar_position: 4 +--- + +# Event + +An event(1?) is the core entity of Opencast, representing a multimedia content. +An event consists of: +- [Metadata](./metadata) +- [ACL](./acl) +- [Assets](./assets) + +As described [here](../common#data-storage), almost all of this data is stored in the DB. +Only the actual asset files are stored on the file system (the metadata about assets is still stored in the DB). + + +--- + +:::danger[Open questions] + +- (1?) Potentially very controversial: rename "event"/"episode" to "video"? + - Intuitively, most people call it "video" + - "Event" is a very generic term and can mean many other things, "episode" implies being part of a series. + - Yes, there can be two _video files_, but we already have a name for that: video stream. So Idon't see a confusion risk here. I don't see any problems with calling a thing a video even if it contains two video streams. + - New name in API would make clear that data model has changed. +::: diff --git a/docs/important-differences.md b/docs/important-differences.md new file mode 100644 index 0000000..3dc2047 --- /dev/null +++ b/docs/important-differences.md @@ -0,0 +1,79 @@ +--- +sidebar_position: 2 +--- + +# Important differences from the current model + +This page mentions a number of major ways, in how this specification differs from the Opencast status quo. + + +## No snapshot system anymore + +The old system of creating snapshots and using hardlinks on the file system is no more. +Whether and how want to version parts of an entity's data is still questionable (see [Open Questions](./open-questions)). + + +## No publications + +There is no "engage", "external API", OAIMPH or any other internal _publication_ anymore. +There might still be a place for external publications in the sense of interacting with another system like YouTube. +These would require some async data synchronization and stuff. +But hardly anyone is using that, so while reading this specification just think: there are no publications at all. +The term does not exist anymore. + +Instead, the DB, file system and all APIs have the same view of the world. +If an event with title "Banana" exists in Opencast, then it exists _everywhere_, i.e. in the DB, on the file system, and in all APIs¹. + +This also includes modifications and deletions. +There is no staging area for changes anymore: all metadata and ACL changes to Opencast entities (event, series, ...) are instantly reflected in all APIs¹. +Changing metadata and ACLs does not require running a workflow anymore. +APIs for modifying this data promise that once they return 2xx, the change has been finalized to the database (the single source of truth). + +A small number of Opencast users might like the two-stage metadata changing. +_If_ it is really desired, this "feature" can be implemented on top of the core Opencast, e.g. in the Admin UI (but disabled by default). + +(¹) A small delay to update the search index is fine. + +### Long running operations + +Of course, there are some modifications or operations that cannot be done immediately, e.g. encoding a video or generating subtitles. +APIs starting these operations are _async_, i.e. they return 2xx to just state the operation has been started, but don't wait for the operation to finish. +But even with these operations, there is still only one view of the world. +For example, say a subtitle generation for an event was started: until the moment that operation finishes, the event has its previous subtitles (e.g. none) and that's reflected in all APIs. + +An event is visible in APIs immediately after ingesting. +Of course, while the video is not encoded yet, there are no URLs to video tracks yet. +The API should represent that fact in a way that makes it easy for external apps to check if a video is still processing. + +Sometimes, long running operations need to be run on metadata changes, e.g. to generate thumbnails with metadata in them (aside: this is usually not a great idea). +This can still be done, with the difference that the DB/API immediately reflects the changed metadata, while the thumbnail needs to catch up. +Again: the DB is the single source of truth. +Everything derived from it (e.g. search index, thumbnails, ...) needs to catch up. + +As an aside, we should treat fewer operations as "long running" and thus offer synchronous APIs for them. +Cutting subtitles, generating thumbnails in different sizes, and more are things that can be easily done in tens of milliseconds. + +## Storage format & API format + +### Independence + +How Opencast stores data should be independent of how Opencast exposes data in its API. +Just because the API format is JSON, does not mean that Opencast should store everything as JSON in the DB or on the file system. + +Further, the structure of classes in Opencast code or the format in the search service should also not leak into the API. +The structure of the API response should be selected purely based on good API design and not on internals. +Avoiding to leak internals makes it easier to change these internals without breaking the API. +(The rewrite of the search service from Solr to ElasticSearch demonstrates how badly this can fail: the very widely used search API changed a lot.) + +The implementation should do everything to ensure this separation. +For example, by having separate `record` definitions which are *only* used for API serialization. +This also makes it a lot harder to accidentally change the API. + +### Unified response for all entities + +An event in the API should always be represented with the same JSON response, regardless of whether it was fetched by ID, or returned from a full text search, or as the entry of a series. +Previously, this differed depending on whether it was loaded from the search index or the database or elsewhere. + +Ideally, there shouldn't be a separate `search` endpoint anyway, but rather have the search feature be part of the external event API. +As an API user, I don't care what indices or data structures Opencast uses to give me the data. +And now that we use ElasticSearch/OpenSearch, there is no reason why there are nodes that couldn't perform that search. diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..64fa049 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,49 @@ +--- +sidebar_position: 1 +title: Introduction +--- + +# Opencast Data Model + +This document specifies the _future_ data model of Opencast. +The data model describes everything that is stored, what types and requirements certain data has, how it is represented in the API, how data can be changed, and more. + +:::warning +This specification does *not* describe the current state of Opencast! +Also, it is a work in progress and is currently being developed and discuss in the community. +::: + +Readers familiar with Opencast should ignore their prior knowledge while reading this, and treat this as a specification for a completely new software. +Do not interpret any existing OC behavior into this specification, if it isn't explicitly mentioned. +Also read the special [Important Differences](./important-differences) page, which explains where this data model differs in significant ways from the current Opencast. + + +## Goals + +There are multiple reasons we are proposing this new data model: +- Improve robustness of Opencast by having a stricter and well defined data model. Be clear about what is allowed and what isn't, and catch invalid data as early as possible. +- Simplify developement of external applications: currently, the API responses are grossly underspecified and it is unclear what properties apps can expect from Opencast (e.g. do I need to deal with duration = -1?). +- Improve robustness by clearly specifying the source of truth for data and reducing the number of places/APIs that store/return data. +- Enable immediate modification of metadata (e.g. changing a video's title) without running a workflow. +- Improve performance by changing how data is stored. + +The goal behind this very specification is to allow for easy discussion in the community, and eventually to have a written specification. + +This specification is written mainly as if it was talking to API users, i.e. developers of external apps who want to integrate with Opencast. +I think this is a useful choice to define the "public interface" of Opencast. +The document does contain quite a bit of implementation notes, too, which just define how things should be handled inside Opencast. + +## Contributing to this specification + +Discussing every single detail in the community beforehand is not viable and not necessary. +Instead, the idea is that there is one main person working on this spec, writing most of the text, therefore proposing parts of the model. +These proposals are discussed in regular meetings and on GitHub. +See [the `opencast/data-model`](https://github.com/opencast/data-model) repository, and in particular the pull requests and discussions tabs. + +## Backwards compatibility and breaking changes + +It is very clear that we need to be able to migrate existing data to the new model. +We also don't want to change every single piece without good reason, in order to keep the overall change managable. +The new model was designed with that in mind. +That said, this document (especially its initial version) does contain incompatibilities and breaking changes, and does not yet consider every single use case. +I expect these use cases to be discussed during the community review of this. diff --git a/docs/open-questions.md b/docs/open-questions.md new file mode 100644 index 0000000..7c83123 --- /dev/null +++ b/docs/open-questions.md @@ -0,0 +1,18 @@ +# Open questions + +- Should all data be versioned? + - It adds complexity, but having access to old data is nice. + - Storage wise, keeping old metadata does not cost much. + - Via the `internal` asset system, we can already kind of version assets. + - Get rid of the current asset manager/snapshot system to avoid hardlinks. +- What do we generally think about size limitations for various fields? + - Abuse protection: this is just to prevent abuse, DOS, slow downs and stuff like that. Limit `description` to 216 bytes, limit `title`, `license`, ... to 1024 bytes. I think these limits make sense and should prevent OC suffering from bad payloads. + - Semantic limits: for example, for `license`, we could say "it should just be a identifier for a license, so limit to 64 bytes". This is a lot more tricky as one has to really think of the intended use case and runs the risk of making use cases impossible. + + +## TODO + +- Metadata can be changed when a workflow is running or an event is scheduled + - Mhhh small problem: some workflows might depend on metadata, e.g. when creating images with metadata in them. So maybe workflows can declare dependencies to metadata? + - So maybe we cannot do this now, this feature we can still add in a second step. When we rework the workflow system 😈 +- Explain how snapshots are removed