diff --git a/docs/lakehouse/catalogs/maxcompute-catalog.md b/docs/lakehouse/catalogs/maxcompute-catalog.md index ae73869d6fd49..e611f96f12f0e 100644 --- a/docs/lakehouse/catalogs/maxcompute-catalog.md +++ b/docs/lakehouse/catalogs/maxcompute-catalog.md @@ -11,9 +11,9 @@ ## Applicable Scenarios | Scenario | Description | -| ---- | ------------------------------------------------------ | +| ---- | ---- | | Data Integration | Read MaxCompute data and write to Doris internal tables. | -| Data Write-back | Using INSERT command to write data into MaxCompute Table. (Supported since version 4.1.0) | +| Data Write-back | Using INSERT command to write data into MaxCompute tables. (Supported since version 4.1.0) | ## Usage Notes @@ -30,52 +30,82 @@ ```sql CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( 'type' = 'max_compute', + {McAuthProperties}, {McRequiredProperties}, {McOptionalProperties}, {CommonProperties} ); ``` +* `{McAuthProperties}` + + These properties control how Doris authenticates to MaxCompute for both query and write operations. + + > Version note: `mc.auth.type`, `mc.ram_role_arn`, and `mc.ecs_ram_role` are supported since **4.0.4**. + + | Property Name | Default Value | Description | Required | Supported Doris Version | + | --- | --- | --- | --- | --- | + | `mc.auth.type` | `ak_sk` | Authentication type. Supported values: `ak_sk`, `ram_role_arn`, and `ecs_ram_role`. | No | 4.0.4 (inclusive) and later | + | `mc.access_key` | None | Alibaba Cloud AccessKey. | Required when `mc.auth.type` is `ak_sk` (default) or `ram_role_arn`. | | + | `mc.secret_key` | None | Alibaba Cloud SecretKey. | Required when `mc.auth.type` is `ak_sk` (default) or `ram_role_arn`. | | + | `mc.ram_role_arn` | None | Alibaba Cloud RAM Role ARN. | Required when `mc.auth.type` is `ram_role_arn`. | 4.0.4 (inclusive) and later | + | `mc.ecs_ram_role` | None | Alibaba Cloud ECS RAM Role name attached to the instance. | Required when `mc.auth.type` is `ecs_ram_role`. | 4.0.4 (inclusive) and later | + + Supported values for `mc.auth.type`: + + | Value | Description | + | --- | --- | + | `ak_sk` | Use Alibaba Cloud AccessKey and SecretKey directly. | + | `ram_role_arn` | Use `mc.access_key` and `mc.secret_key` as source credentials to call STS `AssumeRole`, then access MaxCompute with the returned temporary credentials. | + | `ecs_ram_role` | Obtain temporary credentials from the ECS metadata service. Ensure the Doris FE and BE nodes that access MaxCompute can use the role specified by `mc.ecs_ram_role`. | + + Validation rules: + + 1. If `mc.auth.type` is omitted, Doris uses `ak_sk`. + 2. When `mc.auth.type` is `ram_role_arn`, you must configure `mc.access_key`, `mc.secret_key`, and `mc.ram_role_arn`. + 3. When `mc.auth.type` is `ecs_ram_role`, you must configure `mc.ecs_ram_role`. + 4. When `mc.access_key` and `mc.secret_key` are used, they must be configured together. + + For SQL examples of different authentication types, see [Basic Example](#basic-example). + * `{McRequiredProperties}` - | Property Name | Description | Supported Doris Version | - | ------------------ | ------------------------------------------------------------------------------------------------------------------ | ------------ | - | `mc.default.project` | The name of the MaxCompute project to access. You can create and manage projects in the [MaxCompute Project List](https://maxcompute.console.aliyun.com/cn-beijing/project-list). | | - | `mc.access_key` | AccessKey. You can create and manage it in the [Alibaba Cloud Console](https://ram.console.aliyun.com/manage/ak). | | - | `mc.secret_key` | SecretKey. You can create and manage it in the [Alibaba Cloud Console](https://ram.console.aliyun.com/manage/ak). | | - | `mc.region` | The region where MaxCompute is activated. You can find the corresponding Region from the Endpoint. | Before 2.1.7 (exclusive) | - | `mc.endpoint` | The region where MaxCompute is activated. Please refer to the "How to Obtain Endpoint and Quota" section below for configuration. | 2.1.7 (inclusive) and later | + | Property Name | Description | Supported Doris Version | + | --- | --- | --- | + | `mc.default.project` | The name of the MaxCompute project to access. You can create and manage projects in the [MaxCompute Project List](https://maxcompute.console.aliyun.com/cn-beijing/project-list). | | + | `mc.region` | The region where MaxCompute is activated. You can find the corresponding Region from the Endpoint. | Before 2.1.7 (exclusive) | + | `mc.endpoint` | The region where MaxCompute is activated. Please refer to the "How to Obtain Endpoint and Quota" section below for configuration. | 2.1.7 (inclusive) and later | * `{McOptionalProperties}` - | Property Name | Default Value | Description | Supported Doris Version | - | -------------------------- | ------------- | -------------------------------------------------------------------------- | ------------ | - | `mc.tunnel_endpoint` | None | Refer to "Custom Service Address" in the appendix. | Before 2.1.7 (exclusive) | - | `mc.odps_endpoint` | None | Refer to "Custom Service Address" in the appendix. | Before 2.1.7 (exclusive) | - | `mc.quota` | `pay-as-you-go` | Quota name. Please refer to the "How to Obtain Endpoint and Quota" section below for configuration. | 2.1.7 (inclusive) and later | - | `mc.split_strategy` | `byte_size` | Sets the split partitioning method. Can be set to partition by byte size `byte_size` or by row count `row_count`. | 2.1.7 (inclusive) and later | - | `mc.split_byte_size` | `268435456` | The file size each split reads, in bytes. Default is 256MB. Only effective when `"mc.split_strategy" = "byte_size"`. | 2.1.7 (inclusive) and later | - | `mc.split_row_count` | `1048576` | Number of rows each split reads. Only effective when `"mc.split_strategy" = "row_count"`. | 2.1.7 (inclusive) and later | - | `mc.split_cross_partition` | `false` | Whether the generated splits cross partitions. | 2.1.8 (inclusive) and later | - | `mc.connect_timeout` | `10s` | Connection timeout for MaxCompute. | 2.1.8 (inclusive) and later | - | `mc.read_timeout` | `120s` | Read timeout for MaxCompute. | 2.1.8 (inclusive) and later | - | `mc.retry_count` | `4` | Number of retries after timeout. | 2.1.8 (inclusive) and later | - | `mc.datetime_predicate_push_down` | `true` | Whether to allow predicate push-down for `timestamp/timestamp_ntz` types. Doris loses precision (9 -> 6) when syncing these two types. Therefore, if the original data precision is higher than 6 digits, predicate push-down may lead to inaccurate results. | 2.1.9/3.0.5 (inclusive) and later | - | `mc.account_format` | `name` | The account systems of Alibaba Cloud International and China sites are inconsistent. For international site users, if you encounter errors like `user 'RAM$xxxxxx:xxxxx' is not a valid aliyun account`, you can set this parameter to `id`. | 3.0.9/3.1.1 (inclusive) and later | - | `mc.enable.namespace.schema` | `false` | Whether to support MaxCompute schema hierarchy. See: https://help.aliyun.com/zh/maxcompute/user-guide/schema-related-operations | 3.1.3 (inclusive) and later | - | `mc.max_field_size_bytes` | `8388608` (8 MB) | Maximum bytes allowed for a single field in a write session. When writing data that contains large string or binary fields, the write may fail if the field size exceeds this value. You can increase this value based on your actual data. | 4.1.0 (inclusive) and later | + | Property Name | Default Value | Description | Supported Doris Version | + | --- | --- | --- | --- | + | `mc.tunnel_endpoint` | None | Refer to "Custom Service Address" in the appendix. | Before 2.1.7 (exclusive) | + | `mc.odps_endpoint` | None | Refer to "Custom Service Address" in the appendix. | Before 2.1.7 (exclusive) | + | `mc.quota` | `pay-as-you-go` | Quota name. Please refer to the "How to Obtain Endpoint and Quota" section below for configuration. | 2.1.7 (inclusive) and later | + | `mc.split_strategy` | `byte_size` | Sets the Split partitioning method. Can be set to partition by byte size `byte_size` or by row count `row_count`. | 2.1.7 (inclusive) and later | + | `mc.split_byte_size` | `268435456` | The file size each Split reads, in bytes. Default is 256 MB. Only effective when `"mc.split_strategy" = "byte_size"`. | 2.1.7 (inclusive) and later | + | `mc.split_row_count` | `1048576` | Number of rows each Split reads. Only effective when `"mc.split_strategy" = "row_count"`. | 2.1.7 (inclusive) and later | + | `mc.split_cross_partition` | `false` | Whether the generated Splits cross partitions. | 2.1.8 (inclusive) and later | + | `mc.connect_timeout` | `10s` | Connection timeout for MaxCompute. | 2.1.8 (inclusive) and later | + | `mc.read_timeout` | `120s` | Read timeout for MaxCompute. | 2.1.8 (inclusive) and later | + | `mc.retry_count` | `4` | Number of retries after timeout. | 2.1.8 (inclusive) and later | + | `mc.datetime_predicate_push_down` | `true` | Whether to allow predicate push-down for `timestamp/timestamp_ntz` types. Doris loses precision (9 -> 6) when syncing these two types. Therefore, if the original data precision is higher than 6 digits, predicate push-down may lead to inaccurate results. | 2.1.9/3.0.5 (inclusive) and later | + | `mc.account_format` | `name` | The account systems of Alibaba Cloud International and China sites are inconsistent. For international site users, if you encounter errors like `user 'RAM$xxxxxx:xxxxx' is not a valid aliyun account`, you can set this parameter to `id`. | 3.0.9/3.1.1 (inclusive) and later | + | `mc.enable.namespace.schema` | `false` | Whether to support MaxCompute Schema hierarchy. See: https://help.aliyun.com/zh/maxcompute/user-guide/schema-related-operations | 3.1.3 (inclusive) and later | + | `mc.max_field_size_bytes` | `8388608` (8 MB) | Maximum bytes allowed for a single field in a write session. When writing data that contains large string or binary fields, the write may fail if the field size exceeds this value. You can increase this value based on your actual data. | 4.1.0 (inclusive) and later | - - `mc.max_field_size_bytes` + - `mc.max_field_size_bytes` - MaxCompute allows a maximum of 8 MB per field by default. If the data being written contains large string or binary fields, the write may fail. + MaxCompute allows a maximum of 8 MB per field by default. If the data being written contains large string or binary fields, the write may fail. - To adjust this limit, first execute the following command in the MaxCompute console SQL editor: + To adjust this limit, first execute the following command in the MaxCompute console SQL editor: - `setproject odps.sql.cfile2.field.maxsize=262144;` + `setproject odps.sql.cfile2.field.maxsize=262144;` - This adjusts the maximum bytes for a single field. The unit is KB and the maximum value is 262144. + This adjusts the maximum bytes for a single field. The unit is KB and the maximum value is 262144. - Then set `mc.max_field_size_bytes` to 262144 in the Doris catalog properties (this value must not exceed the MaxCompute setting). + Then set `mc.max_field_size_bytes` to the corresponding byte value in the Doris Catalog properties (this value must not exceed the MaxCompute setting). * `{CommonProperties}` @@ -93,7 +123,7 @@ Only the public cloud version of MaxCompute is supported. For private cloud vers ## Hierarchy Mapping -- When `mc.enable.namespace.schema` is false +- When `mc.enable.namespace.schema` is `false` | Doris | MaxCompute | | -------- | ---------- | @@ -101,7 +131,7 @@ Only the public cloud version of MaxCompute is supported. For private cloud vers | Database | Project | | Table | Table | -- When `mc.enable.namespace.schema` is true +- When `mc.enable.namespace.schema` is `true` | Doris | MaxCompute | | -------- | ---------- | @@ -121,7 +151,7 @@ Only the public cloud version of MaxCompute is supported. For private cloud vers | bigint | bigint | | | float | float | | | double | double | | -| decimal(P, S) | decimal(P, S) | 1 <= P <= 38, 0 <= scale <= 18 | +| decimal(P, S) | decimal(P, S) | 1 <= P <= 38, 0 <= S <= 18 | | char(N) | char(N) | | | varchar(N) | varchar(N) | | | string | string | | @@ -136,17 +166,45 @@ Only the public cloud version of MaxCompute is supported. For private cloud vers ## Basic Example +Default `ak_sk` authentication: + +```sql +CREATE CATALOG mc_catalog PROPERTIES ( + 'type' = 'max_compute', + 'mc.default.project' = 'project', + 'mc.access_key' = 'AKxxxxx', + 'mc.secret_key' = 'SKxxxxx', + 'mc.endpoint' = 'http://service.cn-beijing-vpc.maxcompute.aliyun-inc.com/api' +); +``` + +`ram_role_arn` authentication (4.0.4+): + +```sql +CREATE CATALOG mc_catalog PROPERTIES ( + 'type' = 'max_compute', + 'mc.default.project' = 'project', + 'mc.auth.type' = 'ram_role_arn', + 'mc.access_key' = 'AKxxxxx', + 'mc.secret_key' = 'SKxxxxx', + 'mc.ram_role_arn' = 'acs:ram:::role/', + 'mc.endpoint' = 'http://service.cn-beijing-vpc.maxcompute.aliyun-inc.com/api' +); +``` + +`ecs_ram_role` authentication (4.0.4+): + ```sql CREATE CATALOG mc_catalog PROPERTIES ( 'type' = 'max_compute', 'mc.default.project' = 'project', - 'mc.access_key' = 'sk', - 'mc.secret_key' = 'ak', - 'mc.endpoint' = 'http://service.cn-beijing-vpc.MaxCompute.aliyun-inc.com/api' + 'mc.auth.type' = 'ecs_ram_role', + 'mc.ecs_ram_role' = '', + 'mc.endpoint' = 'http://service.cn-beijing-vpc.maxcompute.aliyun-inc.com/api' ); ``` -If using a version before 2.1.7 (exclusive), please use the following statement. (It is recommended to upgrade to 2.1.8 or later) +If using a version before 2.1.7 (exclusive), please use the following statement (it is recommended to upgrade to 2.1.8 or later): ```sql CREATE CATALOG mc_catalog PROPERTIES ( @@ -180,16 +238,16 @@ CREATE CATALOG mc_catalog PROPERTIES ( ### Basic Query ```sql --- 1. switch to catalog, use database and query +-- 1. Switch to Catalog, use database and query SWITCH mc_ctl; -USE mc_ctl; +USE mc_db; SELECT * FROM mc_tbl LIMIT 10; --- 2. use mc database directly +-- 2. Use mc database directly USE mc_ctl.mc_db; SELECT * FROM mc_tbl LIMIT 10; --- 3. use full qualified name to query +-- 3. Use full qualified name to query SELECT * FROM mc_ctl.mc_db.mc_tbl LIMIT 10; ``` @@ -276,7 +334,7 @@ Starting from version 4.1.0, Doris supports creating and dropping MaxCompute dat - Does not support creating clustered tables, transactional tables, Delta Tables, or external tables. ::: -> This feature is only available when the `mc.enable.namespace.schema` property is set to `true`. +This feature is only available when the `mc.enable.namespace.schema` property is set to `true`. ### Creating and Dropping Databases @@ -377,11 +435,11 @@ By default, MaxCompute Catalog generates endpoints based on `mc.region` and `mc. The generated format is as follows: | `mc.public_access` | `mc.odps_endpoint` | `mc.tunnel_endpoint` | -| ------------------- | -------------------------------------------------------- | ----------------------------------------------- | -| false | `http://service.{mc.region}.maxcompute.aliyun-inc.com/api` | `http://dt.{mc.region}.maxcompute.aliyun-inc.com` | -| true | `http://service.{mc.region}.maxcompute.aliyun.com/api` | `http://dt.{mc.region}.maxcompute.aliyun.com` | +| --- | --- | --- | +| `false` | `http://service.{mc.region}.maxcompute.aliyun-inc.com/api` | `http://dt.{mc.region}.maxcompute.aliyun-inc.com` | +| `true` | `http://service.{mc.region}.maxcompute.aliyun.com/api` | `http://dt.{mc.region}.maxcompute.aliyun.com` | -Users can also specify `mc.odps_endpoint` and `mc.tunnel_endpoint` individually to customize the service address, which is suitable for some privately deployed MaxCompute environments. +Users can also specify `mc.odps_endpoint` and `mc.tunnel_endpoint` individually to customize the service address, which is suitable for privately deployed MaxCompute environments. For configuring MaxCompute Endpoint and Tunnel Endpoint, please refer to [Endpoints for Different Regions and Network Connection Methods](https://help.aliyun.com/zh/maxcompute/user-guide/endpoints). @@ -399,7 +457,7 @@ Note: - It is recommended to write to specified partitions whenever possible, e.g. `INSERT INTO mc_tbl PARTITION(ds='20250201')`. When no partition is specified, due to limitations of the MaxCompute Storage API, data for each partition must be written sequentially. As a result, the execution plan will sort data based on the partition columns, which can consume significant memory resources when the data volume is large and may cause the write to fail. -- When writing without specifying a partition, do not set `set enable_strict_consistency_dml=false`. This forcefully removes the sort node, causing partition data to be written out of order, which will ultimately result in an error from MaxCompute. +- When writing without specifying a partition, do not set `SET enable_strict_consistency_dml = false`. This forcefully removes the sort node, causing partition data to be written out of order, which will ultimately result in an error from MaxCompute. - Do not add a `LIMIT` clause. When a `LIMIT` clause is added, Doris will use only a single thread for writing to guarantee the write count. This can be used for small-scale testing, but if the `LIMIT` value is large, write performance will be poor. diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/maxcompute-catalog.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/maxcompute-catalog.md index 0ba3b8d1161b4..d86e2b4c12be1 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/maxcompute-catalog.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/maxcompute-catalog.md @@ -6,20 +6,20 @@ } --- -[MaxCompute](https://help.aliyun.com/zh/maxcompute/) 是阿里云上的企业级 SaaS(Software as a Service)模式云数据仓库。通过 MaxCompute 提供的开放存储 SDK,Doris 可以获取 MaxCompute 的表信息并进行查询和写入操作。 +[MaxCompute](https://help.aliyun.com/zh/maxcompute/) 是阿里云上的企业级 SaaS(Software as a Service)模式云数据仓库。通过 MaxCompute 提供的开放存储 SDK,Doris 可以获取 MaxCompute 的表信息,并进行查询和写入操作。 ## 适用场景 -| 场景 | 说明 | -| ---- | ------------------------------------------------------ | +| 场景 | 说明 | +| ---- | ---- | | 数据集成 | 读取 MaxCompute 数据并写入到 Doris 内表。 | -| 数据写回 | 通过 INSERT 命令将数据写入 MaxCompute 表。(4.1.0 版本支持) | +| 数据写回 | 通过 INSERT 命令将数据写入 MaxCompute 表。(4.1.0 版本支持) | ## 使用须知 -1. 自 2.1.7 版本开始,MaxCompute Catalog 基于 [开放存储 SDK](https://help.aliyun.com/zh/maxcompute/user-guide/overview-1) 开发,在这之前,基于 Tunnel API 进行开发。 +1. 自 2.1.7 版本开始,MaxCompute Catalog 基于[开放存储 SDK](https://help.aliyun.com/zh/maxcompute/user-guide/overview-1) 开发,在这之前,基于 Tunnel API 进行开发。 -2. 开放存储 SDK 的使用有一定的限制,请参照该 [文档](https://help.aliyun.com/zh/maxcompute/user-guide/overview-1) 中 `使用限制` 的章节。 +2. 开放存储 SDK 的使用有一定的限制,请参照该[文档](https://help.aliyun.com/zh/maxcompute/user-guide/overview-1)中「使用限制」的章节。 3. 在 Doris 3.1.3 版本之前,MaxCompute 中的 Project 相当于 Doris 中的 Database。3.1.3 版本中,可以通过 `mc.enable.namespace.schema` 参数引入 MaxCompute 的 schema 层级。 @@ -30,52 +30,82 @@ ```sql CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( 'type' = 'max_compute', + {McAuthProperties}, {McRequiredProperties}, {McOptionalProperties}, {CommonProperties} ); ``` +* `{McAuthProperties}` + + 这些属性用于控制 Doris 通过 Catalog 访问 MaxCompute 时的认证方式,适用于查询和写入场景。 + + > 版本说明:`mc.auth.type`、`mc.ram_role_arn` 和 `mc.ecs_ram_role` 自 **4.0.4** 起支持。 + + | 属性名 | 默认值 | 说明 | 是否必填 | 支持的 Doris 版本 | + | --- | --- | --- | --- | --- | + | `mc.auth.type` | `ak_sk` | 认证类型。支持 `ak_sk`、`ram_role_arn` 和 `ecs_ram_role`。 | 否 | 4.0.4(含)之后 | + | `mc.access_key` | 无 | 阿里云 AccessKey。 | 当 `mc.auth.type` 为 `ak_sk`(默认)或 `ram_role_arn` 时必填。 | | + | `mc.secret_key` | 无 | 阿里云 SecretKey。 | 当 `mc.auth.type` 为 `ak_sk`(默认)或 `ram_role_arn` 时必填。 | | + | `mc.ram_role_arn` | 无 | 阿里云 RAM Role ARN。 | 当 `mc.auth.type` 为 `ram_role_arn` 时必填。 | 4.0.4(含)之后 | + | `mc.ecs_ram_role` | 无 | ECS 实例绑定的 RAM Role 名称。 | 当 `mc.auth.type` 为 `ecs_ram_role` 时必填。 | 4.0.4(含)之后 | + + `mc.auth.type` 可选值: + + | 取值 | 说明 | + | --- | --- | + | `ak_sk` | 使用阿里云 AccessKey / SecretKey 直接访问 MaxCompute。 | + | `ram_role_arn` | 使用 `mc.access_key` 和 `mc.secret_key` 作为源凭证调用 STS `AssumeRole`,再使用返回的临时凭证访问 MaxCompute。 | + | `ecs_ram_role` | 通过 ECS Metadata Service 获取临时凭证。请确保访问 MaxCompute 的 Doris FE 和 BE 节点都可以使用 `mc.ecs_ram_role` 指定的角色。 | + + 生效规则: + + 1. 未配置 `mc.auth.type` 时,默认使用 `ak_sk`。 + 2. 当 `mc.auth.type` 为 `ram_role_arn` 时,必须同时配置 `mc.access_key`、`mc.secret_key` 和 `mc.ram_role_arn`。 + 3. 当 `mc.auth.type` 为 `ecs_ram_role` 时,必须配置 `mc.ecs_ram_role`。 + 4. 使用 `mc.access_key` 和 `mc.secret_key` 时,这两个参数必须成对配置。 + + 不同认证方式的 SQL 示例请参见下方的[基础示例](#基础示例)。 + * `{McRequiredProperties}` - | 属性名 | 说明 | 支持的 Doris 版本 | - | ------------------ | ------------------------------------------------------------------------------------------------------------------ | ------------ | - | `mc.default.project` | 想要访问的 MaxCompute 项目名称。可以在 [MaxCompute 项目列表](https://maxcompute.console.aliyun.com/cn-beijing/project-list) 中创建和管理。 | | - | `mc.access_key` | AccessKey。可以在 [阿里云控制台](https://ram.console.aliyun.com/manage/ak) 中创建和管理。 | | - | `mc.secret_key` | SecretKey。可以在 [阿里云控制台](https://ram.console.aliyun.com/manage/ak) 中创建和管理。 | | - | `mc.region` | MaxCompute 开通的地域。可以从 Endpoint 中找到对应的 Region | 2.1.7(不含)之前 | - | `mc.endpoint` | MaxCompute 开通的地域。请参照下文的如何获取 Endpoint 和 Quota 来配置。 | 2.1.7(含)之后 | + | 属性名 | 说明 | 支持的 Doris 版本 | + | --- | --- | --- | + | `mc.default.project` | 想要访问的 MaxCompute 项目名称。可以在 [MaxCompute 项目列表](https://maxcompute.console.aliyun.com/cn-beijing/project-list)中创建和管理。 | | + | `mc.region` | MaxCompute 开通的地域。可以从 Endpoint 中找到对应的 Region。 | 2.1.7(不含)之前 | + | `mc.endpoint` | MaxCompute 开通的地域。请参照下文的「如何获取 Endpoint 和 Quota」来配置。 | 2.1.7(含)之后 | * `{McOptionalProperties}` - | 属性名 | 默认值 | 说明 | 支持的 Doris 版本 | - | -------------------------- | ------------- | -------------------------------------------------------------------------- | ------------ | - | `mc.tunnel_endpoint` | 无 | 参考附录中的「自定义服务地址」。 | 2.1.7(不含)之前 | - | `mc.odps_endpoint` | 无 | 参考附录中的「自定义服务地址」。 | 2.1.7(不含)之前 | - | `mc.quota` | `pay-as-you-go` | Quota 名称。请参照下文的「如何获取 Endpoint 和 Quota」来配置。 | 2.1.7(含)之后 | - | `mc.split_strategy` | `byte_size` | 设置 split 的划分方式,可设置为按照字节大小划分 `byte_size` 和按照数据行数划分 `row_count` | 2.1.7(含)之后 | - | `mc.split_byte_size` | `268435456` | 每个 split 读取的文件大小,单位为字节,默认为 256MB,当且仅当 `"mc.split_strategy" = "byte_size"` 时生效 | 2.1.7(含)之后 | - | `mc.split_row_count` | `1048576` | 每个 split 读多少行,当且仅当 `"mc.split_strategy" = "row_count"` 时生效 | 2.1.7(含)之后 | - | `mc.split_cross_partition` | `false` | 生成的 split 是否跨分区。 | 2.1.8(含)之后 | - | `mc.connect_timeout` | `10s` | 连接 MaxCompute 的超时时间。 | 2.1.8(含)之后 | - | `mc.read_timeout` | `120s` | 读取 MaxCompute 的超时时间。 | 2.1.8(含)之后 | - | `mc.retry_count` | `4` | 超时后的重试次数。 | 2.1.8(含)之后 | - | `mc.datetime_predicate_push_down` | `true` | 是否允许下推 `timestamp/timestamp_ntz` 类型的谓词条件。Doris 对这两个类型的同步会丢失精度(9 -> 6)。因此如果原数据精度高于 6 位,则条件下推可能导致结果不准确。 | 2.1.9/3.0.5(含)之后 | - | `mc.account_format` | `name` | 阿里云国际站和中国站的账号系统不一致,对于国际站用户,如出现如 `user 'RAM$xxxxxx:xxxxx' is not a valid aliyun account` 的错误,可指定该参数为 `id`。 | 3.0.9/3.1.1(含)之后 | - | `mc.enable.namespace.schema` | `false` | 是否支持 MaxCompute 的 schema 层级。详见:https://help.aliyun.com/zh/maxcompute/user-guide/schema-related-operations | 3.1.3(含)之后 | - | `mc.max_field_size_bytes` | `8388608`(8 MB) | 写入会话中单个字段允许的最大字节数。当写入包含大型字符串或二进制字段的数据时,如果字段大小超过该值,可能会导致写入失败。可根据实际数据情况适当调大该值。 | 4.1.0(含)之后 | + | 属性名 | 默认值 | 说明 | 支持的 Doris 版本 | + | --- | --- | --- | --- | + | `mc.tunnel_endpoint` | 无 | 参考附录中的「自定义服务地址」。 | 2.1.7(不含)之前 | + | `mc.odps_endpoint` | 无 | 参考附录中的「自定义服务地址」。 | 2.1.7(不含)之前 | + | `mc.quota` | `pay-as-you-go` | Quota 名称。请参照下文的「如何获取 Endpoint 和 Quota」来配置。 | 2.1.7(含)之后 | + | `mc.split_strategy` | `byte_size` | 设置 Split 的划分方式,可设置为按照字节大小划分 `byte_size` 和按照数据行数划分 `row_count`。 | 2.1.7(含)之后 | + | `mc.split_byte_size` | `268435456` | 每个 Split 读取的文件大小,单位为字节,默认为 256 MB。当且仅当 `"mc.split_strategy" = "byte_size"` 时生效。 | 2.1.7(含)之后 | + | `mc.split_row_count` | `1048576` | 每个 Split 读多少行。当且仅当 `"mc.split_strategy" = "row_count"` 时生效。 | 2.1.7(含)之后 | + | `mc.split_cross_partition` | `false` | 生成的 Split 是否跨分区。 | 2.1.8(含)之后 | + | `mc.connect_timeout` | `10s` | 连接 MaxCompute 的超时时间。 | 2.1.8(含)之后 | + | `mc.read_timeout` | `120s` | 读取 MaxCompute 的超时时间。 | 2.1.8(含)之后 | + | `mc.retry_count` | `4` | 超时后的重试次数。 | 2.1.8(含)之后 | + | `mc.datetime_predicate_push_down` | `true` | 是否允许下推 `timestamp/timestamp_ntz` 类型的谓词条件。Doris 对这两个类型的同步会丢失精度(9 -> 6)。因此如果原数据精度高于 6 位,则条件下推可能导致结果不准确。 | 2.1.9/3.0.5(含)之后 | + | `mc.account_format` | `name` | 阿里云国际站和中国站的账号系统不一致,对于国际站用户,如出现如 `user 'RAM$xxxxxx:xxxxx' is not a valid aliyun account` 的错误,可指定该参数为 `id`。 | 3.0.9/3.1.1(含)之后 | + | `mc.enable.namespace.schema` | `false` | 是否支持 MaxCompute 的 Schema 层级。详见:https://help.aliyun.com/zh/maxcompute/user-guide/schema-related-operations 。 | 3.1.3(含)之后 | + | `mc.max_field_size_bytes` | `8388608`(8 MB) | 写入会话中单个字段允许的最大字节数。当写入包含大型字符串或二进制字段的数据时,如果字段大小超过该值,可能会导致写入失败。可根据实际数据情况适当调大该值。 | 4.1.0(含)之后 | - - `mc.max_field_size_bytes` + - `mc.max_field_size_bytes` - MaxCompute 默认允许单个字段最大为 8MB,如果写入的数据中包含大型字符串或二进制字段,可能会导致写入失败。 + MaxCompute 默认允许单个字段最大为 8 MB,如果写入的数据中包含大型字符串或二进制字段,可能会导致写入失败。 - 如需调整,需要先在 MaxCompute 控制台的 SQL 编辑器中执行: + 如需调整,需要先在 MaxCompute 控制台的 SQL 编辑器中执行: - `setproject odps.sql.cfile2.field.maxsize=262144;` + `setproject odps.sql.cfile2.field.maxsize=262144;` - 以调整单个字段的最大字节数。单位为 KB,最大值为 262144。 + 以调整单个字段的最大字节数。单位为 KB,最大值为 262144。 - 然后在 Doris 的 catalog 属性中设置 `mc.max_field_size_bytes` 为 262144(该值不能大于 MaxCompute 的设置值)。 + 然后在 Doris 的 Catalog 属性中设置 `mc.max_field_size_bytes` 为对应的字节值(该值不能大于 MaxCompute 的设置值)。 * `{CommonProperties}` @@ -93,7 +123,7 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( ## 层级映射 -- `mc.enable.namespace.schema` 为 false +- `mc.enable.namespace.schema` 为 `false` | Doris | MaxCompute | | -------- | ---------- | @@ -101,7 +131,7 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( | Database | Project | | Table | Table | -- `mc.enable.namespace.schema` 为 true +- `mc.enable.namespace.schema` 为 `true` | Doris | MaxCompute | | -------- | ---------- | @@ -113,7 +143,7 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( | MaxCompute Type | Doris Type | Comment | | ---------------- | ------------- | ---------------------------------------------------------------------------- | -| bolean | boolean | | +| boolean | boolean | | | tiny | tinyint | | | tinyint | tinyint | | | smallint | smallint | | @@ -121,12 +151,12 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( | bigint | bigint | | | float | float | | | double | double | | -| decimal(P, S) | decimal(P, S) | 1 <= P <= 38, 0 <= scale <= 18 | +| decimal(P, S) | decimal(P, S) | 1 <= P <= 38,0 <= S <= 18 | | char(N) | char(N) | | | varchar(N) | varchar(N) | | | string | string | | | date | date | | -| datetime | datetime(3) | 固定映射到精度 3。可以通过 `SET [GLOBAL] time_zone = 'Asia/Shanghai'` 来指定时区。 | +| datetime | datetime(3) | 固定映射到精度 3。可以通过 `SET [GLOBAL] time_zone = 'Asia/Shanghai'` 来指定时区。 | | timestamp_ntz | datetime(6) | MaxCompute 的 `timestamp_ntz` 精度为 9,Doris 的 DATETIME 最大精度只有 6,故读取数据时会将多的部分直接截断。 | | timestamp | datetime(6) | 自 2.1.9/3.0.5 支持。MaxCompute 的 `timestamp` 精度为 9,Doris 的 DATETIME 最大精度只有 6,故读取数据时会将多的部分直接截断。 | | array | array | | @@ -136,17 +166,45 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( ## 基础示例 +默认 `ak_sk` 认证: + +```sql +CREATE CATALOG mc_catalog PROPERTIES ( + 'type' = 'max_compute', + 'mc.default.project' = 'project', + 'mc.access_key' = 'AKxxxxx', + 'mc.secret_key' = 'SKxxxxx', + 'mc.endpoint' = 'http://service.cn-beijing-vpc.maxcompute.aliyun-inc.com/api' +); +``` + +`ram_role_arn` 认证(4.0.4+): + +```sql +CREATE CATALOG mc_catalog PROPERTIES ( + 'type' = 'max_compute', + 'mc.default.project' = 'project', + 'mc.auth.type' = 'ram_role_arn', + 'mc.access_key' = 'AKxxxxx', + 'mc.secret_key' = 'SKxxxxx', + 'mc.ram_role_arn' = 'acs:ram:::role/', + 'mc.endpoint' = 'http://service.cn-beijing-vpc.maxcompute.aliyun-inc.com/api' +); +``` + +`ecs_ram_role` 认证(4.0.4+): + ```sql CREATE CATALOG mc_catalog PROPERTIES ( 'type' = 'max_compute', 'mc.default.project' = 'project', - 'mc.access_key' = 'sk', - 'mc.secret_key' = 'ak', - 'mc.endpoint' = 'http://service.cn-beijing-vpc.MaxCompute.aliyun-inc.com/api' + 'mc.auth.type' = 'ecs_ram_role', + 'mc.ecs_ram_role' = '', + 'mc.endpoint' = 'http://service.cn-beijing-vpc.maxcompute.aliyun-inc.com/api' ); ``` -如使用 2.1.7(不含)之前的版本,请使用如下语句。(建议升级到 2.1.8 后使用) +如使用 2.1.7(不含)之前的版本,请使用如下语句(建议升级到 2.1.8 后使用): ```sql CREATE CATALOG mc_catalog PROPERTIES ( @@ -180,22 +238,22 @@ CREATE CATALOG mc_catalog PROPERTIES ( ### 基础查询 ```sql --- 1. switch to catalog, use database and query +-- 1. 切换到 Catalog,使用 Database 并查询 SWITCH mc_ctl; -USE mc_ctl; +USE mc_db; SELECT * FROM mc_tbl LIMIT 10; --- 2. use mc database directly +-- 2. 直接使用 Database USE mc_ctl.mc_db; SELECT * FROM mc_tbl LIMIT 10; --- 3. use full qualified name to query +-- 3. 使用全限定名查询 SELECT * FROM mc_ctl.mc_db.mc_tbl LIMIT 10; ``` ### 查询优化 -- LIMIT 查询优化 (自 4.1.0 起) +- LIMIT 查询优化(自 4.1.0 起) 该参数仅适用于需要频繁使用 `LIMIT 1` 来检测数据是否存在的场景。 @@ -268,7 +326,7 @@ CREATE TABLE mc_tbl AS SELECT * FROM other_table; ## 库表管理 -自 4.1.0 版本,Doris 支持创建和删除 MaxCompute 的库表。 +自 4.1.0 版本起,Doris 支持创建和删除 MaxCompute 的库表。 :::note - 该功能为实验功能,自 4.1.0 版本开始支持。 @@ -276,7 +334,7 @@ CREATE TABLE mc_tbl AS SELECT * FROM other_table; - 不支持创建聚簇表、事务表、Delta Table 和外部表。 ::: -> 该功能仅在 `mc.enable.namespace.schema` 属性为 `true` 时可用。 +该功能仅在 `mc.enable.namespace.schema` 属性为 `true` 时可用。 ### 创建和删除库 @@ -300,7 +358,7 @@ DROP DATABASE [IF EXISTS] mc.mc_schema; ``` :::caution -对于 MaxCompute Database,删除后,会同时删除其下的所有表。 +对于 MaxCompute Database,删除后会同时删除其下的所有表。 ::: ### 创建和删除表 @@ -350,7 +408,7 @@ DROP DATABASE [IF EXISTS] mc.mc_schema; ## 常见问题 -### 如何获取 Endpoint 和 Quota(适用于 Doris 2.1.7 之后) +### 如何获取 Endpoint 和 Quota(适用于 Doris 2.1.7 及之后) 1. 如果使用数据传输服务独享资源组 @@ -364,7 +422,7 @@ DROP DATABASE [IF EXISTS] mc.mc_schema; 使用 VPC 访问的用户,需要根据「各地域 Endpoint 对照表(阿里云 VPC 网络连接方式)」表中的「VPC 网络 Endpoint」列来配置 `mc.endpoint`。使用公网访问的用户,可以选择「各地域 Endpoint 对照表(阿里云经典网络连接方式)」表中的「经典网络 Endpoint」列、或者选择「各地域 Endpoint 对照表(外网连接方式)」表中的「外网 Endpoint」列来配置 `mc.endpoint`。 -### 自定义服务地址 (适用于 Doris 2.1.7 之前) +### 自定义服务地址(适用于 Doris 2.1.7 之前) 在 Doris 2.1.7 之前的版本中,使用 Tunnel SDK 与 MaxCompute 交互,因此需要使用以下两个 endpoint 属性: @@ -376,18 +434,18 @@ DROP DATABASE [IF EXISTS] mc.mc_schema; 生成后的格式如下: -| `mc.public_access` | `mc.odps_endpoint` | `mc.tunnel_endpoint` | -| ------------------- | -------------------------------------------------------- | ----------------------------------------------- | -| false | `http://service.{mc.region}.maxcompute.aliyun-inc.com/api` | `http://dt.{mc.region}.maxcompute.aliyun-inc.com` | -| true | `http://service.{mc.region}.maxcompute.aliyun.com/api` | `http://dt.{mc.region}.maxcompute.aliyun.com` | +| `mc.public_access` | `mc.odps_endpoint` | `mc.tunnel_endpoint` | +| --- | --- | --- | +| `false` | `http://service.{mc.region}.maxcompute.aliyun-inc.com/api` | `http://dt.{mc.region}.maxcompute.aliyun-inc.com` | +| `true` | `http://service.{mc.region}.maxcompute.aliyun.com/api` | `http://dt.{mc.region}.maxcompute.aliyun.com` | -用户也可以单独指定 `mc.odps_endpoint` 和 `mc.tunnel_endpoint` 来自定义服务地址,适用于一些私有部署的 MaxCompute 环境。 +用户也可以单独指定 `mc.odps_endpoint` 和 `mc.tunnel_endpoint` 来自定义服务地址,适用于私有部署的 MaxCompute 环境。 -MaxCompute Endpoint 和 Tunnel Endpoint 的配置请参见[各地域及不同网络连接方式下的 Endpoint](https://help.aliyun.com/zh/maxcompute/user-guide/endpoints)。 +MaxCompute Endpoint 和 Tunnel Endpoint 的配置,请参见[各地域及不同网络连接方式下的 Endpoint](https://help.aliyun.com/zh/maxcompute/user-guide/endpoints)。 ### 资源使用控制 -用户可以通过调整 `parallel_pipeline_task_num`、`num_scanner_threads` 这两个 Session Variable 来调整[表级别请求并发数量](https://help.aliyun.com/zh/maxcompute/user-guide/data-transfer-service-quota-manage?spm=a2c4g.11186623.help-menu-search-27797.d_2),以控制数据传输服务中的资源消耗。其对应的并发数量等于 `max(parallel_pipeline_task_num * be num * num_scanner_threads)`。 +用户可以通过调整 `parallel_pipeline_task_num` 和 `num_scanner_threads` 这两个 Session Variable 来调整[表级别请求并发数量](https://help.aliyun.com/zh/maxcompute/user-guide/data-transfer-service-quota-manage?spm=a2c4g.11186623.help-menu-search-27797.d_2),以控制数据传输服务中的资源消耗。其对应的并发数量等于 `max(parallel_pipeline_task_num * be num * num_scanner_threads)`。 需要注意: @@ -397,12 +455,12 @@ MaxCompute Endpoint 和 Tunnel Endpoint 的配置请参见[各地域及不同网 ### 写入最佳实践 -- 建议优先选择指定分区写入,如 `INSERT INTO mc_tbl PARTITION(ds='20250201')`。当不指定分区时,由于 MaxCompute Storage API 的限制,各个分区的数据需要顺序写入,所以在执行计划中会基于 Partition 字段进行排序,当数据量较大时,对内存资源消耗较大,可能导致写入失败。 +- 建议优先选择指定分区写入,如 `INSERT INTO mc_tbl PARTITION(ds='20250201')`。当不指定分区时,由于 MaxCompute Storage API 的限制,各个分区的数据需要顺序写入,因此执行计划中会基于分区字段进行排序。当数据量较大时,对内存资源消耗较大,可能导致写入失败。 -- 当不指定分区写入时,不要设置 `set enable_strict_consistency_dml=false`。该方法会强制取消排序节点,导致分区数据乱序写入,最终 MaxCompute 会报错。 +- 当不指定分区写入时,不要设置 `SET enable_strict_consistency_dml = false`。该方法会强制取消排序节点,导致分区数据乱序写入,最终 MaxCompute 会报错。 -- 不要添加 `LIMIT` 子句。当添加 `LIMIT` 子句时,Doris 仅会使用单线程写出,以保证写入的数量。可以用于小数据量测试,如果 `LIMIT` 数量较大,写入性能不佳。 +- 不要添加 `LIMIT` 子句。当添加 `LIMIT` 子句时,Doris 仅会使用单线程写出,以保证写入的数量。该方式可以用于小数据量测试,但如果 `LIMIT` 数量较大,写入性能不佳。 ### 写入报错:`Data invalid: ODPS-0020041:StringOutOfMaxLength` -参考 `mc.max_field_size_bytes` 的说明。 \ No newline at end of file +参考 `mc.max_field_size_bytes` 的说明。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/lakehouse/catalogs/maxcompute-catalog.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/lakehouse/catalogs/maxcompute-catalog.md index 0ba3b8d1161b4..d86e2b4c12be1 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/lakehouse/catalogs/maxcompute-catalog.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/lakehouse/catalogs/maxcompute-catalog.md @@ -6,20 +6,20 @@ } --- -[MaxCompute](https://help.aliyun.com/zh/maxcompute/) 是阿里云上的企业级 SaaS(Software as a Service)模式云数据仓库。通过 MaxCompute 提供的开放存储 SDK,Doris 可以获取 MaxCompute 的表信息并进行查询和写入操作。 +[MaxCompute](https://help.aliyun.com/zh/maxcompute/) 是阿里云上的企业级 SaaS(Software as a Service)模式云数据仓库。通过 MaxCompute 提供的开放存储 SDK,Doris 可以获取 MaxCompute 的表信息,并进行查询和写入操作。 ## 适用场景 -| 场景 | 说明 | -| ---- | ------------------------------------------------------ | +| 场景 | 说明 | +| ---- | ---- | | 数据集成 | 读取 MaxCompute 数据并写入到 Doris 内表。 | -| 数据写回 | 通过 INSERT 命令将数据写入 MaxCompute 表。(4.1.0 版本支持) | +| 数据写回 | 通过 INSERT 命令将数据写入 MaxCompute 表。(4.1.0 版本支持) | ## 使用须知 -1. 自 2.1.7 版本开始,MaxCompute Catalog 基于 [开放存储 SDK](https://help.aliyun.com/zh/maxcompute/user-guide/overview-1) 开发,在这之前,基于 Tunnel API 进行开发。 +1. 自 2.1.7 版本开始,MaxCompute Catalog 基于[开放存储 SDK](https://help.aliyun.com/zh/maxcompute/user-guide/overview-1) 开发,在这之前,基于 Tunnel API 进行开发。 -2. 开放存储 SDK 的使用有一定的限制,请参照该 [文档](https://help.aliyun.com/zh/maxcompute/user-guide/overview-1) 中 `使用限制` 的章节。 +2. 开放存储 SDK 的使用有一定的限制,请参照该[文档](https://help.aliyun.com/zh/maxcompute/user-guide/overview-1)中「使用限制」的章节。 3. 在 Doris 3.1.3 版本之前,MaxCompute 中的 Project 相当于 Doris 中的 Database。3.1.3 版本中,可以通过 `mc.enable.namespace.schema` 参数引入 MaxCompute 的 schema 层级。 @@ -30,52 +30,82 @@ ```sql CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( 'type' = 'max_compute', + {McAuthProperties}, {McRequiredProperties}, {McOptionalProperties}, {CommonProperties} ); ``` +* `{McAuthProperties}` + + 这些属性用于控制 Doris 通过 Catalog 访问 MaxCompute 时的认证方式,适用于查询和写入场景。 + + > 版本说明:`mc.auth.type`、`mc.ram_role_arn` 和 `mc.ecs_ram_role` 自 **4.0.4** 起支持。 + + | 属性名 | 默认值 | 说明 | 是否必填 | 支持的 Doris 版本 | + | --- | --- | --- | --- | --- | + | `mc.auth.type` | `ak_sk` | 认证类型。支持 `ak_sk`、`ram_role_arn` 和 `ecs_ram_role`。 | 否 | 4.0.4(含)之后 | + | `mc.access_key` | 无 | 阿里云 AccessKey。 | 当 `mc.auth.type` 为 `ak_sk`(默认)或 `ram_role_arn` 时必填。 | | + | `mc.secret_key` | 无 | 阿里云 SecretKey。 | 当 `mc.auth.type` 为 `ak_sk`(默认)或 `ram_role_arn` 时必填。 | | + | `mc.ram_role_arn` | 无 | 阿里云 RAM Role ARN。 | 当 `mc.auth.type` 为 `ram_role_arn` 时必填。 | 4.0.4(含)之后 | + | `mc.ecs_ram_role` | 无 | ECS 实例绑定的 RAM Role 名称。 | 当 `mc.auth.type` 为 `ecs_ram_role` 时必填。 | 4.0.4(含)之后 | + + `mc.auth.type` 可选值: + + | 取值 | 说明 | + | --- | --- | + | `ak_sk` | 使用阿里云 AccessKey / SecretKey 直接访问 MaxCompute。 | + | `ram_role_arn` | 使用 `mc.access_key` 和 `mc.secret_key` 作为源凭证调用 STS `AssumeRole`,再使用返回的临时凭证访问 MaxCompute。 | + | `ecs_ram_role` | 通过 ECS Metadata Service 获取临时凭证。请确保访问 MaxCompute 的 Doris FE 和 BE 节点都可以使用 `mc.ecs_ram_role` 指定的角色。 | + + 生效规则: + + 1. 未配置 `mc.auth.type` 时,默认使用 `ak_sk`。 + 2. 当 `mc.auth.type` 为 `ram_role_arn` 时,必须同时配置 `mc.access_key`、`mc.secret_key` 和 `mc.ram_role_arn`。 + 3. 当 `mc.auth.type` 为 `ecs_ram_role` 时,必须配置 `mc.ecs_ram_role`。 + 4. 使用 `mc.access_key` 和 `mc.secret_key` 时,这两个参数必须成对配置。 + + 不同认证方式的 SQL 示例请参见下方的[基础示例](#基础示例)。 + * `{McRequiredProperties}` - | 属性名 | 说明 | 支持的 Doris 版本 | - | ------------------ | ------------------------------------------------------------------------------------------------------------------ | ------------ | - | `mc.default.project` | 想要访问的 MaxCompute 项目名称。可以在 [MaxCompute 项目列表](https://maxcompute.console.aliyun.com/cn-beijing/project-list) 中创建和管理。 | | - | `mc.access_key` | AccessKey。可以在 [阿里云控制台](https://ram.console.aliyun.com/manage/ak) 中创建和管理。 | | - | `mc.secret_key` | SecretKey。可以在 [阿里云控制台](https://ram.console.aliyun.com/manage/ak) 中创建和管理。 | | - | `mc.region` | MaxCompute 开通的地域。可以从 Endpoint 中找到对应的 Region | 2.1.7(不含)之前 | - | `mc.endpoint` | MaxCompute 开通的地域。请参照下文的如何获取 Endpoint 和 Quota 来配置。 | 2.1.7(含)之后 | + | 属性名 | 说明 | 支持的 Doris 版本 | + | --- | --- | --- | + | `mc.default.project` | 想要访问的 MaxCompute 项目名称。可以在 [MaxCompute 项目列表](https://maxcompute.console.aliyun.com/cn-beijing/project-list)中创建和管理。 | | + | `mc.region` | MaxCompute 开通的地域。可以从 Endpoint 中找到对应的 Region。 | 2.1.7(不含)之前 | + | `mc.endpoint` | MaxCompute 开通的地域。请参照下文的「如何获取 Endpoint 和 Quota」来配置。 | 2.1.7(含)之后 | * `{McOptionalProperties}` - | 属性名 | 默认值 | 说明 | 支持的 Doris 版本 | - | -------------------------- | ------------- | -------------------------------------------------------------------------- | ------------ | - | `mc.tunnel_endpoint` | 无 | 参考附录中的「自定义服务地址」。 | 2.1.7(不含)之前 | - | `mc.odps_endpoint` | 无 | 参考附录中的「自定义服务地址」。 | 2.1.7(不含)之前 | - | `mc.quota` | `pay-as-you-go` | Quota 名称。请参照下文的「如何获取 Endpoint 和 Quota」来配置。 | 2.1.7(含)之后 | - | `mc.split_strategy` | `byte_size` | 设置 split 的划分方式,可设置为按照字节大小划分 `byte_size` 和按照数据行数划分 `row_count` | 2.1.7(含)之后 | - | `mc.split_byte_size` | `268435456` | 每个 split 读取的文件大小,单位为字节,默认为 256MB,当且仅当 `"mc.split_strategy" = "byte_size"` 时生效 | 2.1.7(含)之后 | - | `mc.split_row_count` | `1048576` | 每个 split 读多少行,当且仅当 `"mc.split_strategy" = "row_count"` 时生效 | 2.1.7(含)之后 | - | `mc.split_cross_partition` | `false` | 生成的 split 是否跨分区。 | 2.1.8(含)之后 | - | `mc.connect_timeout` | `10s` | 连接 MaxCompute 的超时时间。 | 2.1.8(含)之后 | - | `mc.read_timeout` | `120s` | 读取 MaxCompute 的超时时间。 | 2.1.8(含)之后 | - | `mc.retry_count` | `4` | 超时后的重试次数。 | 2.1.8(含)之后 | - | `mc.datetime_predicate_push_down` | `true` | 是否允许下推 `timestamp/timestamp_ntz` 类型的谓词条件。Doris 对这两个类型的同步会丢失精度(9 -> 6)。因此如果原数据精度高于 6 位,则条件下推可能导致结果不准确。 | 2.1.9/3.0.5(含)之后 | - | `mc.account_format` | `name` | 阿里云国际站和中国站的账号系统不一致,对于国际站用户,如出现如 `user 'RAM$xxxxxx:xxxxx' is not a valid aliyun account` 的错误,可指定该参数为 `id`。 | 3.0.9/3.1.1(含)之后 | - | `mc.enable.namespace.schema` | `false` | 是否支持 MaxCompute 的 schema 层级。详见:https://help.aliyun.com/zh/maxcompute/user-guide/schema-related-operations | 3.1.3(含)之后 | - | `mc.max_field_size_bytes` | `8388608`(8 MB) | 写入会话中单个字段允许的最大字节数。当写入包含大型字符串或二进制字段的数据时,如果字段大小超过该值,可能会导致写入失败。可根据实际数据情况适当调大该值。 | 4.1.0(含)之后 | + | 属性名 | 默认值 | 说明 | 支持的 Doris 版本 | + | --- | --- | --- | --- | + | `mc.tunnel_endpoint` | 无 | 参考附录中的「自定义服务地址」。 | 2.1.7(不含)之前 | + | `mc.odps_endpoint` | 无 | 参考附录中的「自定义服务地址」。 | 2.1.7(不含)之前 | + | `mc.quota` | `pay-as-you-go` | Quota 名称。请参照下文的「如何获取 Endpoint 和 Quota」来配置。 | 2.1.7(含)之后 | + | `mc.split_strategy` | `byte_size` | 设置 Split 的划分方式,可设置为按照字节大小划分 `byte_size` 和按照数据行数划分 `row_count`。 | 2.1.7(含)之后 | + | `mc.split_byte_size` | `268435456` | 每个 Split 读取的文件大小,单位为字节,默认为 256 MB。当且仅当 `"mc.split_strategy" = "byte_size"` 时生效。 | 2.1.7(含)之后 | + | `mc.split_row_count` | `1048576` | 每个 Split 读多少行。当且仅当 `"mc.split_strategy" = "row_count"` 时生效。 | 2.1.7(含)之后 | + | `mc.split_cross_partition` | `false` | 生成的 Split 是否跨分区。 | 2.1.8(含)之后 | + | `mc.connect_timeout` | `10s` | 连接 MaxCompute 的超时时间。 | 2.1.8(含)之后 | + | `mc.read_timeout` | `120s` | 读取 MaxCompute 的超时时间。 | 2.1.8(含)之后 | + | `mc.retry_count` | `4` | 超时后的重试次数。 | 2.1.8(含)之后 | + | `mc.datetime_predicate_push_down` | `true` | 是否允许下推 `timestamp/timestamp_ntz` 类型的谓词条件。Doris 对这两个类型的同步会丢失精度(9 -> 6)。因此如果原数据精度高于 6 位,则条件下推可能导致结果不准确。 | 2.1.9/3.0.5(含)之后 | + | `mc.account_format` | `name` | 阿里云国际站和中国站的账号系统不一致,对于国际站用户,如出现如 `user 'RAM$xxxxxx:xxxxx' is not a valid aliyun account` 的错误,可指定该参数为 `id`。 | 3.0.9/3.1.1(含)之后 | + | `mc.enable.namespace.schema` | `false` | 是否支持 MaxCompute 的 Schema 层级。详见:https://help.aliyun.com/zh/maxcompute/user-guide/schema-related-operations 。 | 3.1.3(含)之后 | + | `mc.max_field_size_bytes` | `8388608`(8 MB) | 写入会话中单个字段允许的最大字节数。当写入包含大型字符串或二进制字段的数据时,如果字段大小超过该值,可能会导致写入失败。可根据实际数据情况适当调大该值。 | 4.1.0(含)之后 | - - `mc.max_field_size_bytes` + - `mc.max_field_size_bytes` - MaxCompute 默认允许单个字段最大为 8MB,如果写入的数据中包含大型字符串或二进制字段,可能会导致写入失败。 + MaxCompute 默认允许单个字段最大为 8 MB,如果写入的数据中包含大型字符串或二进制字段,可能会导致写入失败。 - 如需调整,需要先在 MaxCompute 控制台的 SQL 编辑器中执行: + 如需调整,需要先在 MaxCompute 控制台的 SQL 编辑器中执行: - `setproject odps.sql.cfile2.field.maxsize=262144;` + `setproject odps.sql.cfile2.field.maxsize=262144;` - 以调整单个字段的最大字节数。单位为 KB,最大值为 262144。 + 以调整单个字段的最大字节数。单位为 KB,最大值为 262144。 - 然后在 Doris 的 catalog 属性中设置 `mc.max_field_size_bytes` 为 262144(该值不能大于 MaxCompute 的设置值)。 + 然后在 Doris 的 Catalog 属性中设置 `mc.max_field_size_bytes` 为对应的字节值(该值不能大于 MaxCompute 的设置值)。 * `{CommonProperties}` @@ -93,7 +123,7 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( ## 层级映射 -- `mc.enable.namespace.schema` 为 false +- `mc.enable.namespace.schema` 为 `false` | Doris | MaxCompute | | -------- | ---------- | @@ -101,7 +131,7 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( | Database | Project | | Table | Table | -- `mc.enable.namespace.schema` 为 true +- `mc.enable.namespace.schema` 为 `true` | Doris | MaxCompute | | -------- | ---------- | @@ -113,7 +143,7 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( | MaxCompute Type | Doris Type | Comment | | ---------------- | ------------- | ---------------------------------------------------------------------------- | -| bolean | boolean | | +| boolean | boolean | | | tiny | tinyint | | | tinyint | tinyint | | | smallint | smallint | | @@ -121,12 +151,12 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( | bigint | bigint | | | float | float | | | double | double | | -| decimal(P, S) | decimal(P, S) | 1 <= P <= 38, 0 <= scale <= 18 | +| decimal(P, S) | decimal(P, S) | 1 <= P <= 38,0 <= S <= 18 | | char(N) | char(N) | | | varchar(N) | varchar(N) | | | string | string | | | date | date | | -| datetime | datetime(3) | 固定映射到精度 3。可以通过 `SET [GLOBAL] time_zone = 'Asia/Shanghai'` 来指定时区。 | +| datetime | datetime(3) | 固定映射到精度 3。可以通过 `SET [GLOBAL] time_zone = 'Asia/Shanghai'` 来指定时区。 | | timestamp_ntz | datetime(6) | MaxCompute 的 `timestamp_ntz` 精度为 9,Doris 的 DATETIME 最大精度只有 6,故读取数据时会将多的部分直接截断。 | | timestamp | datetime(6) | 自 2.1.9/3.0.5 支持。MaxCompute 的 `timestamp` 精度为 9,Doris 的 DATETIME 最大精度只有 6,故读取数据时会将多的部分直接截断。 | | array | array | | @@ -136,17 +166,45 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( ## 基础示例 +默认 `ak_sk` 认证: + +```sql +CREATE CATALOG mc_catalog PROPERTIES ( + 'type' = 'max_compute', + 'mc.default.project' = 'project', + 'mc.access_key' = 'AKxxxxx', + 'mc.secret_key' = 'SKxxxxx', + 'mc.endpoint' = 'http://service.cn-beijing-vpc.maxcompute.aliyun-inc.com/api' +); +``` + +`ram_role_arn` 认证(4.0.4+): + +```sql +CREATE CATALOG mc_catalog PROPERTIES ( + 'type' = 'max_compute', + 'mc.default.project' = 'project', + 'mc.auth.type' = 'ram_role_arn', + 'mc.access_key' = 'AKxxxxx', + 'mc.secret_key' = 'SKxxxxx', + 'mc.ram_role_arn' = 'acs:ram:::role/', + 'mc.endpoint' = 'http://service.cn-beijing-vpc.maxcompute.aliyun-inc.com/api' +); +``` + +`ecs_ram_role` 认证(4.0.4+): + ```sql CREATE CATALOG mc_catalog PROPERTIES ( 'type' = 'max_compute', 'mc.default.project' = 'project', - 'mc.access_key' = 'sk', - 'mc.secret_key' = 'ak', - 'mc.endpoint' = 'http://service.cn-beijing-vpc.MaxCompute.aliyun-inc.com/api' + 'mc.auth.type' = 'ecs_ram_role', + 'mc.ecs_ram_role' = '', + 'mc.endpoint' = 'http://service.cn-beijing-vpc.maxcompute.aliyun-inc.com/api' ); ``` -如使用 2.1.7(不含)之前的版本,请使用如下语句。(建议升级到 2.1.8 后使用) +如使用 2.1.7(不含)之前的版本,请使用如下语句(建议升级到 2.1.8 后使用): ```sql CREATE CATALOG mc_catalog PROPERTIES ( @@ -180,22 +238,22 @@ CREATE CATALOG mc_catalog PROPERTIES ( ### 基础查询 ```sql --- 1. switch to catalog, use database and query +-- 1. 切换到 Catalog,使用 Database 并查询 SWITCH mc_ctl; -USE mc_ctl; +USE mc_db; SELECT * FROM mc_tbl LIMIT 10; --- 2. use mc database directly +-- 2. 直接使用 Database USE mc_ctl.mc_db; SELECT * FROM mc_tbl LIMIT 10; --- 3. use full qualified name to query +-- 3. 使用全限定名查询 SELECT * FROM mc_ctl.mc_db.mc_tbl LIMIT 10; ``` ### 查询优化 -- LIMIT 查询优化 (自 4.1.0 起) +- LIMIT 查询优化(自 4.1.0 起) 该参数仅适用于需要频繁使用 `LIMIT 1` 来检测数据是否存在的场景。 @@ -268,7 +326,7 @@ CREATE TABLE mc_tbl AS SELECT * FROM other_table; ## 库表管理 -自 4.1.0 版本,Doris 支持创建和删除 MaxCompute 的库表。 +自 4.1.0 版本起,Doris 支持创建和删除 MaxCompute 的库表。 :::note - 该功能为实验功能,自 4.1.0 版本开始支持。 @@ -276,7 +334,7 @@ CREATE TABLE mc_tbl AS SELECT * FROM other_table; - 不支持创建聚簇表、事务表、Delta Table 和外部表。 ::: -> 该功能仅在 `mc.enable.namespace.schema` 属性为 `true` 时可用。 +该功能仅在 `mc.enable.namespace.schema` 属性为 `true` 时可用。 ### 创建和删除库 @@ -300,7 +358,7 @@ DROP DATABASE [IF EXISTS] mc.mc_schema; ``` :::caution -对于 MaxCompute Database,删除后,会同时删除其下的所有表。 +对于 MaxCompute Database,删除后会同时删除其下的所有表。 ::: ### 创建和删除表 @@ -350,7 +408,7 @@ DROP DATABASE [IF EXISTS] mc.mc_schema; ## 常见问题 -### 如何获取 Endpoint 和 Quota(适用于 Doris 2.1.7 之后) +### 如何获取 Endpoint 和 Quota(适用于 Doris 2.1.7 及之后) 1. 如果使用数据传输服务独享资源组 @@ -364,7 +422,7 @@ DROP DATABASE [IF EXISTS] mc.mc_schema; 使用 VPC 访问的用户,需要根据「各地域 Endpoint 对照表(阿里云 VPC 网络连接方式)」表中的「VPC 网络 Endpoint」列来配置 `mc.endpoint`。使用公网访问的用户,可以选择「各地域 Endpoint 对照表(阿里云经典网络连接方式)」表中的「经典网络 Endpoint」列、或者选择「各地域 Endpoint 对照表(外网连接方式)」表中的「外网 Endpoint」列来配置 `mc.endpoint`。 -### 自定义服务地址 (适用于 Doris 2.1.7 之前) +### 自定义服务地址(适用于 Doris 2.1.7 之前) 在 Doris 2.1.7 之前的版本中,使用 Tunnel SDK 与 MaxCompute 交互,因此需要使用以下两个 endpoint 属性: @@ -376,18 +434,18 @@ DROP DATABASE [IF EXISTS] mc.mc_schema; 生成后的格式如下: -| `mc.public_access` | `mc.odps_endpoint` | `mc.tunnel_endpoint` | -| ------------------- | -------------------------------------------------------- | ----------------------------------------------- | -| false | `http://service.{mc.region}.maxcompute.aliyun-inc.com/api` | `http://dt.{mc.region}.maxcompute.aliyun-inc.com` | -| true | `http://service.{mc.region}.maxcompute.aliyun.com/api` | `http://dt.{mc.region}.maxcompute.aliyun.com` | +| `mc.public_access` | `mc.odps_endpoint` | `mc.tunnel_endpoint` | +| --- | --- | --- | +| `false` | `http://service.{mc.region}.maxcompute.aliyun-inc.com/api` | `http://dt.{mc.region}.maxcompute.aliyun-inc.com` | +| `true` | `http://service.{mc.region}.maxcompute.aliyun.com/api` | `http://dt.{mc.region}.maxcompute.aliyun.com` | -用户也可以单独指定 `mc.odps_endpoint` 和 `mc.tunnel_endpoint` 来自定义服务地址,适用于一些私有部署的 MaxCompute 环境。 +用户也可以单独指定 `mc.odps_endpoint` 和 `mc.tunnel_endpoint` 来自定义服务地址,适用于私有部署的 MaxCompute 环境。 -MaxCompute Endpoint 和 Tunnel Endpoint 的配置请参见[各地域及不同网络连接方式下的 Endpoint](https://help.aliyun.com/zh/maxcompute/user-guide/endpoints)。 +MaxCompute Endpoint 和 Tunnel Endpoint 的配置,请参见[各地域及不同网络连接方式下的 Endpoint](https://help.aliyun.com/zh/maxcompute/user-guide/endpoints)。 ### 资源使用控制 -用户可以通过调整 `parallel_pipeline_task_num`、`num_scanner_threads` 这两个 Session Variable 来调整[表级别请求并发数量](https://help.aliyun.com/zh/maxcompute/user-guide/data-transfer-service-quota-manage?spm=a2c4g.11186623.help-menu-search-27797.d_2),以控制数据传输服务中的资源消耗。其对应的并发数量等于 `max(parallel_pipeline_task_num * be num * num_scanner_threads)`。 +用户可以通过调整 `parallel_pipeline_task_num` 和 `num_scanner_threads` 这两个 Session Variable 来调整[表级别请求并发数量](https://help.aliyun.com/zh/maxcompute/user-guide/data-transfer-service-quota-manage?spm=a2c4g.11186623.help-menu-search-27797.d_2),以控制数据传输服务中的资源消耗。其对应的并发数量等于 `max(parallel_pipeline_task_num * be num * num_scanner_threads)`。 需要注意: @@ -397,12 +455,12 @@ MaxCompute Endpoint 和 Tunnel Endpoint 的配置请参见[各地域及不同网 ### 写入最佳实践 -- 建议优先选择指定分区写入,如 `INSERT INTO mc_tbl PARTITION(ds='20250201')`。当不指定分区时,由于 MaxCompute Storage API 的限制,各个分区的数据需要顺序写入,所以在执行计划中会基于 Partition 字段进行排序,当数据量较大时,对内存资源消耗较大,可能导致写入失败。 +- 建议优先选择指定分区写入,如 `INSERT INTO mc_tbl PARTITION(ds='20250201')`。当不指定分区时,由于 MaxCompute Storage API 的限制,各个分区的数据需要顺序写入,因此执行计划中会基于分区字段进行排序。当数据量较大时,对内存资源消耗较大,可能导致写入失败。 -- 当不指定分区写入时,不要设置 `set enable_strict_consistency_dml=false`。该方法会强制取消排序节点,导致分区数据乱序写入,最终 MaxCompute 会报错。 +- 当不指定分区写入时,不要设置 `SET enable_strict_consistency_dml = false`。该方法会强制取消排序节点,导致分区数据乱序写入,最终 MaxCompute 会报错。 -- 不要添加 `LIMIT` 子句。当添加 `LIMIT` 子句时,Doris 仅会使用单线程写出,以保证写入的数量。可以用于小数据量测试,如果 `LIMIT` 数量较大,写入性能不佳。 +- 不要添加 `LIMIT` 子句。当添加 `LIMIT` 子句时,Doris 仅会使用单线程写出,以保证写入的数量。该方式可以用于小数据量测试,但如果 `LIMIT` 数量较大,写入性能不佳。 ### 写入报错:`Data invalid: ODPS-0020041:StringOutOfMaxLength` -参考 `mc.max_field_size_bytes` 的说明。 \ No newline at end of file +参考 `mc.max_field_size_bytes` 的说明。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/lakehouse/catalogs/maxcompute-catalog.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/lakehouse/catalogs/maxcompute-catalog.md index 0ba3b8d1161b4..d86e2b4c12be1 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/lakehouse/catalogs/maxcompute-catalog.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/lakehouse/catalogs/maxcompute-catalog.md @@ -6,20 +6,20 @@ } --- -[MaxCompute](https://help.aliyun.com/zh/maxcompute/) 是阿里云上的企业级 SaaS(Software as a Service)模式云数据仓库。通过 MaxCompute 提供的开放存储 SDK,Doris 可以获取 MaxCompute 的表信息并进行查询和写入操作。 +[MaxCompute](https://help.aliyun.com/zh/maxcompute/) 是阿里云上的企业级 SaaS(Software as a Service)模式云数据仓库。通过 MaxCompute 提供的开放存储 SDK,Doris 可以获取 MaxCompute 的表信息,并进行查询和写入操作。 ## 适用场景 -| 场景 | 说明 | -| ---- | ------------------------------------------------------ | +| 场景 | 说明 | +| ---- | ---- | | 数据集成 | 读取 MaxCompute 数据并写入到 Doris 内表。 | -| 数据写回 | 通过 INSERT 命令将数据写入 MaxCompute 表。(4.1.0 版本支持) | +| 数据写回 | 通过 INSERT 命令将数据写入 MaxCompute 表。(4.1.0 版本支持) | ## 使用须知 -1. 自 2.1.7 版本开始,MaxCompute Catalog 基于 [开放存储 SDK](https://help.aliyun.com/zh/maxcompute/user-guide/overview-1) 开发,在这之前,基于 Tunnel API 进行开发。 +1. 自 2.1.7 版本开始,MaxCompute Catalog 基于[开放存储 SDK](https://help.aliyun.com/zh/maxcompute/user-guide/overview-1) 开发,在这之前,基于 Tunnel API 进行开发。 -2. 开放存储 SDK 的使用有一定的限制,请参照该 [文档](https://help.aliyun.com/zh/maxcompute/user-guide/overview-1) 中 `使用限制` 的章节。 +2. 开放存储 SDK 的使用有一定的限制,请参照该[文档](https://help.aliyun.com/zh/maxcompute/user-guide/overview-1)中「使用限制」的章节。 3. 在 Doris 3.1.3 版本之前,MaxCompute 中的 Project 相当于 Doris 中的 Database。3.1.3 版本中,可以通过 `mc.enable.namespace.schema` 参数引入 MaxCompute 的 schema 层级。 @@ -30,52 +30,82 @@ ```sql CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( 'type' = 'max_compute', + {McAuthProperties}, {McRequiredProperties}, {McOptionalProperties}, {CommonProperties} ); ``` +* `{McAuthProperties}` + + 这些属性用于控制 Doris 通过 Catalog 访问 MaxCompute 时的认证方式,适用于查询和写入场景。 + + > 版本说明:`mc.auth.type`、`mc.ram_role_arn` 和 `mc.ecs_ram_role` 自 **4.0.4** 起支持。 + + | 属性名 | 默认值 | 说明 | 是否必填 | 支持的 Doris 版本 | + | --- | --- | --- | --- | --- | + | `mc.auth.type` | `ak_sk` | 认证类型。支持 `ak_sk`、`ram_role_arn` 和 `ecs_ram_role`。 | 否 | 4.0.4(含)之后 | + | `mc.access_key` | 无 | 阿里云 AccessKey。 | 当 `mc.auth.type` 为 `ak_sk`(默认)或 `ram_role_arn` 时必填。 | | + | `mc.secret_key` | 无 | 阿里云 SecretKey。 | 当 `mc.auth.type` 为 `ak_sk`(默认)或 `ram_role_arn` 时必填。 | | + | `mc.ram_role_arn` | 无 | 阿里云 RAM Role ARN。 | 当 `mc.auth.type` 为 `ram_role_arn` 时必填。 | 4.0.4(含)之后 | + | `mc.ecs_ram_role` | 无 | ECS 实例绑定的 RAM Role 名称。 | 当 `mc.auth.type` 为 `ecs_ram_role` 时必填。 | 4.0.4(含)之后 | + + `mc.auth.type` 可选值: + + | 取值 | 说明 | + | --- | --- | + | `ak_sk` | 使用阿里云 AccessKey / SecretKey 直接访问 MaxCompute。 | + | `ram_role_arn` | 使用 `mc.access_key` 和 `mc.secret_key` 作为源凭证调用 STS `AssumeRole`,再使用返回的临时凭证访问 MaxCompute。 | + | `ecs_ram_role` | 通过 ECS Metadata Service 获取临时凭证。请确保访问 MaxCompute 的 Doris FE 和 BE 节点都可以使用 `mc.ecs_ram_role` 指定的角色。 | + + 生效规则: + + 1. 未配置 `mc.auth.type` 时,默认使用 `ak_sk`。 + 2. 当 `mc.auth.type` 为 `ram_role_arn` 时,必须同时配置 `mc.access_key`、`mc.secret_key` 和 `mc.ram_role_arn`。 + 3. 当 `mc.auth.type` 为 `ecs_ram_role` 时,必须配置 `mc.ecs_ram_role`。 + 4. 使用 `mc.access_key` 和 `mc.secret_key` 时,这两个参数必须成对配置。 + + 不同认证方式的 SQL 示例请参见下方的[基础示例](#基础示例)。 + * `{McRequiredProperties}` - | 属性名 | 说明 | 支持的 Doris 版本 | - | ------------------ | ------------------------------------------------------------------------------------------------------------------ | ------------ | - | `mc.default.project` | 想要访问的 MaxCompute 项目名称。可以在 [MaxCompute 项目列表](https://maxcompute.console.aliyun.com/cn-beijing/project-list) 中创建和管理。 | | - | `mc.access_key` | AccessKey。可以在 [阿里云控制台](https://ram.console.aliyun.com/manage/ak) 中创建和管理。 | | - | `mc.secret_key` | SecretKey。可以在 [阿里云控制台](https://ram.console.aliyun.com/manage/ak) 中创建和管理。 | | - | `mc.region` | MaxCompute 开通的地域。可以从 Endpoint 中找到对应的 Region | 2.1.7(不含)之前 | - | `mc.endpoint` | MaxCompute 开通的地域。请参照下文的如何获取 Endpoint 和 Quota 来配置。 | 2.1.7(含)之后 | + | 属性名 | 说明 | 支持的 Doris 版本 | + | --- | --- | --- | + | `mc.default.project` | 想要访问的 MaxCompute 项目名称。可以在 [MaxCompute 项目列表](https://maxcompute.console.aliyun.com/cn-beijing/project-list)中创建和管理。 | | + | `mc.region` | MaxCompute 开通的地域。可以从 Endpoint 中找到对应的 Region。 | 2.1.7(不含)之前 | + | `mc.endpoint` | MaxCompute 开通的地域。请参照下文的「如何获取 Endpoint 和 Quota」来配置。 | 2.1.7(含)之后 | * `{McOptionalProperties}` - | 属性名 | 默认值 | 说明 | 支持的 Doris 版本 | - | -------------------------- | ------------- | -------------------------------------------------------------------------- | ------------ | - | `mc.tunnel_endpoint` | 无 | 参考附录中的「自定义服务地址」。 | 2.1.7(不含)之前 | - | `mc.odps_endpoint` | 无 | 参考附录中的「自定义服务地址」。 | 2.1.7(不含)之前 | - | `mc.quota` | `pay-as-you-go` | Quota 名称。请参照下文的「如何获取 Endpoint 和 Quota」来配置。 | 2.1.7(含)之后 | - | `mc.split_strategy` | `byte_size` | 设置 split 的划分方式,可设置为按照字节大小划分 `byte_size` 和按照数据行数划分 `row_count` | 2.1.7(含)之后 | - | `mc.split_byte_size` | `268435456` | 每个 split 读取的文件大小,单位为字节,默认为 256MB,当且仅当 `"mc.split_strategy" = "byte_size"` 时生效 | 2.1.7(含)之后 | - | `mc.split_row_count` | `1048576` | 每个 split 读多少行,当且仅当 `"mc.split_strategy" = "row_count"` 时生效 | 2.1.7(含)之后 | - | `mc.split_cross_partition` | `false` | 生成的 split 是否跨分区。 | 2.1.8(含)之后 | - | `mc.connect_timeout` | `10s` | 连接 MaxCompute 的超时时间。 | 2.1.8(含)之后 | - | `mc.read_timeout` | `120s` | 读取 MaxCompute 的超时时间。 | 2.1.8(含)之后 | - | `mc.retry_count` | `4` | 超时后的重试次数。 | 2.1.8(含)之后 | - | `mc.datetime_predicate_push_down` | `true` | 是否允许下推 `timestamp/timestamp_ntz` 类型的谓词条件。Doris 对这两个类型的同步会丢失精度(9 -> 6)。因此如果原数据精度高于 6 位,则条件下推可能导致结果不准确。 | 2.1.9/3.0.5(含)之后 | - | `mc.account_format` | `name` | 阿里云国际站和中国站的账号系统不一致,对于国际站用户,如出现如 `user 'RAM$xxxxxx:xxxxx' is not a valid aliyun account` 的错误,可指定该参数为 `id`。 | 3.0.9/3.1.1(含)之后 | - | `mc.enable.namespace.schema` | `false` | 是否支持 MaxCompute 的 schema 层级。详见:https://help.aliyun.com/zh/maxcompute/user-guide/schema-related-operations | 3.1.3(含)之后 | - | `mc.max_field_size_bytes` | `8388608`(8 MB) | 写入会话中单个字段允许的最大字节数。当写入包含大型字符串或二进制字段的数据时,如果字段大小超过该值,可能会导致写入失败。可根据实际数据情况适当调大该值。 | 4.1.0(含)之后 | + | 属性名 | 默认值 | 说明 | 支持的 Doris 版本 | + | --- | --- | --- | --- | + | `mc.tunnel_endpoint` | 无 | 参考附录中的「自定义服务地址」。 | 2.1.7(不含)之前 | + | `mc.odps_endpoint` | 无 | 参考附录中的「自定义服务地址」。 | 2.1.7(不含)之前 | + | `mc.quota` | `pay-as-you-go` | Quota 名称。请参照下文的「如何获取 Endpoint 和 Quota」来配置。 | 2.1.7(含)之后 | + | `mc.split_strategy` | `byte_size` | 设置 Split 的划分方式,可设置为按照字节大小划分 `byte_size` 和按照数据行数划分 `row_count`。 | 2.1.7(含)之后 | + | `mc.split_byte_size` | `268435456` | 每个 Split 读取的文件大小,单位为字节,默认为 256 MB。当且仅当 `"mc.split_strategy" = "byte_size"` 时生效。 | 2.1.7(含)之后 | + | `mc.split_row_count` | `1048576` | 每个 Split 读多少行。当且仅当 `"mc.split_strategy" = "row_count"` 时生效。 | 2.1.7(含)之后 | + | `mc.split_cross_partition` | `false` | 生成的 Split 是否跨分区。 | 2.1.8(含)之后 | + | `mc.connect_timeout` | `10s` | 连接 MaxCompute 的超时时间。 | 2.1.8(含)之后 | + | `mc.read_timeout` | `120s` | 读取 MaxCompute 的超时时间。 | 2.1.8(含)之后 | + | `mc.retry_count` | `4` | 超时后的重试次数。 | 2.1.8(含)之后 | + | `mc.datetime_predicate_push_down` | `true` | 是否允许下推 `timestamp/timestamp_ntz` 类型的谓词条件。Doris 对这两个类型的同步会丢失精度(9 -> 6)。因此如果原数据精度高于 6 位,则条件下推可能导致结果不准确。 | 2.1.9/3.0.5(含)之后 | + | `mc.account_format` | `name` | 阿里云国际站和中国站的账号系统不一致,对于国际站用户,如出现如 `user 'RAM$xxxxxx:xxxxx' is not a valid aliyun account` 的错误,可指定该参数为 `id`。 | 3.0.9/3.1.1(含)之后 | + | `mc.enable.namespace.schema` | `false` | 是否支持 MaxCompute 的 Schema 层级。详见:https://help.aliyun.com/zh/maxcompute/user-guide/schema-related-operations 。 | 3.1.3(含)之后 | + | `mc.max_field_size_bytes` | `8388608`(8 MB) | 写入会话中单个字段允许的最大字节数。当写入包含大型字符串或二进制字段的数据时,如果字段大小超过该值,可能会导致写入失败。可根据实际数据情况适当调大该值。 | 4.1.0(含)之后 | - - `mc.max_field_size_bytes` + - `mc.max_field_size_bytes` - MaxCompute 默认允许单个字段最大为 8MB,如果写入的数据中包含大型字符串或二进制字段,可能会导致写入失败。 + MaxCompute 默认允许单个字段最大为 8 MB,如果写入的数据中包含大型字符串或二进制字段,可能会导致写入失败。 - 如需调整,需要先在 MaxCompute 控制台的 SQL 编辑器中执行: + 如需调整,需要先在 MaxCompute 控制台的 SQL 编辑器中执行: - `setproject odps.sql.cfile2.field.maxsize=262144;` + `setproject odps.sql.cfile2.field.maxsize=262144;` - 以调整单个字段的最大字节数。单位为 KB,最大值为 262144。 + 以调整单个字段的最大字节数。单位为 KB,最大值为 262144。 - 然后在 Doris 的 catalog 属性中设置 `mc.max_field_size_bytes` 为 262144(该值不能大于 MaxCompute 的设置值)。 + 然后在 Doris 的 Catalog 属性中设置 `mc.max_field_size_bytes` 为对应的字节值(该值不能大于 MaxCompute 的设置值)。 * `{CommonProperties}` @@ -93,7 +123,7 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( ## 层级映射 -- `mc.enable.namespace.schema` 为 false +- `mc.enable.namespace.schema` 为 `false` | Doris | MaxCompute | | -------- | ---------- | @@ -101,7 +131,7 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( | Database | Project | | Table | Table | -- `mc.enable.namespace.schema` 为 true +- `mc.enable.namespace.schema` 为 `true` | Doris | MaxCompute | | -------- | ---------- | @@ -113,7 +143,7 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( | MaxCompute Type | Doris Type | Comment | | ---------------- | ------------- | ---------------------------------------------------------------------------- | -| bolean | boolean | | +| boolean | boolean | | | tiny | tinyint | | | tinyint | tinyint | | | smallint | smallint | | @@ -121,12 +151,12 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( | bigint | bigint | | | float | float | | | double | double | | -| decimal(P, S) | decimal(P, S) | 1 <= P <= 38, 0 <= scale <= 18 | +| decimal(P, S) | decimal(P, S) | 1 <= P <= 38,0 <= S <= 18 | | char(N) | char(N) | | | varchar(N) | varchar(N) | | | string | string | | | date | date | | -| datetime | datetime(3) | 固定映射到精度 3。可以通过 `SET [GLOBAL] time_zone = 'Asia/Shanghai'` 来指定时区。 | +| datetime | datetime(3) | 固定映射到精度 3。可以通过 `SET [GLOBAL] time_zone = 'Asia/Shanghai'` 来指定时区。 | | timestamp_ntz | datetime(6) | MaxCompute 的 `timestamp_ntz` 精度为 9,Doris 的 DATETIME 最大精度只有 6,故读取数据时会将多的部分直接截断。 | | timestamp | datetime(6) | 自 2.1.9/3.0.5 支持。MaxCompute 的 `timestamp` 精度为 9,Doris 的 DATETIME 最大精度只有 6,故读取数据时会将多的部分直接截断。 | | array | array | | @@ -136,17 +166,45 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( ## 基础示例 +默认 `ak_sk` 认证: + +```sql +CREATE CATALOG mc_catalog PROPERTIES ( + 'type' = 'max_compute', + 'mc.default.project' = 'project', + 'mc.access_key' = 'AKxxxxx', + 'mc.secret_key' = 'SKxxxxx', + 'mc.endpoint' = 'http://service.cn-beijing-vpc.maxcompute.aliyun-inc.com/api' +); +``` + +`ram_role_arn` 认证(4.0.4+): + +```sql +CREATE CATALOG mc_catalog PROPERTIES ( + 'type' = 'max_compute', + 'mc.default.project' = 'project', + 'mc.auth.type' = 'ram_role_arn', + 'mc.access_key' = 'AKxxxxx', + 'mc.secret_key' = 'SKxxxxx', + 'mc.ram_role_arn' = 'acs:ram:::role/', + 'mc.endpoint' = 'http://service.cn-beijing-vpc.maxcompute.aliyun-inc.com/api' +); +``` + +`ecs_ram_role` 认证(4.0.4+): + ```sql CREATE CATALOG mc_catalog PROPERTIES ( 'type' = 'max_compute', 'mc.default.project' = 'project', - 'mc.access_key' = 'sk', - 'mc.secret_key' = 'ak', - 'mc.endpoint' = 'http://service.cn-beijing-vpc.MaxCompute.aliyun-inc.com/api' + 'mc.auth.type' = 'ecs_ram_role', + 'mc.ecs_ram_role' = '', + 'mc.endpoint' = 'http://service.cn-beijing-vpc.maxcompute.aliyun-inc.com/api' ); ``` -如使用 2.1.7(不含)之前的版本,请使用如下语句。(建议升级到 2.1.8 后使用) +如使用 2.1.7(不含)之前的版本,请使用如下语句(建议升级到 2.1.8 后使用): ```sql CREATE CATALOG mc_catalog PROPERTIES ( @@ -180,22 +238,22 @@ CREATE CATALOG mc_catalog PROPERTIES ( ### 基础查询 ```sql --- 1. switch to catalog, use database and query +-- 1. 切换到 Catalog,使用 Database 并查询 SWITCH mc_ctl; -USE mc_ctl; +USE mc_db; SELECT * FROM mc_tbl LIMIT 10; --- 2. use mc database directly +-- 2. 直接使用 Database USE mc_ctl.mc_db; SELECT * FROM mc_tbl LIMIT 10; --- 3. use full qualified name to query +-- 3. 使用全限定名查询 SELECT * FROM mc_ctl.mc_db.mc_tbl LIMIT 10; ``` ### 查询优化 -- LIMIT 查询优化 (自 4.1.0 起) +- LIMIT 查询优化(自 4.1.0 起) 该参数仅适用于需要频繁使用 `LIMIT 1` 来检测数据是否存在的场景。 @@ -268,7 +326,7 @@ CREATE TABLE mc_tbl AS SELECT * FROM other_table; ## 库表管理 -自 4.1.0 版本,Doris 支持创建和删除 MaxCompute 的库表。 +自 4.1.0 版本起,Doris 支持创建和删除 MaxCompute 的库表。 :::note - 该功能为实验功能,自 4.1.0 版本开始支持。 @@ -276,7 +334,7 @@ CREATE TABLE mc_tbl AS SELECT * FROM other_table; - 不支持创建聚簇表、事务表、Delta Table 和外部表。 ::: -> 该功能仅在 `mc.enable.namespace.schema` 属性为 `true` 时可用。 +该功能仅在 `mc.enable.namespace.schema` 属性为 `true` 时可用。 ### 创建和删除库 @@ -300,7 +358,7 @@ DROP DATABASE [IF EXISTS] mc.mc_schema; ``` :::caution -对于 MaxCompute Database,删除后,会同时删除其下的所有表。 +对于 MaxCompute Database,删除后会同时删除其下的所有表。 ::: ### 创建和删除表 @@ -350,7 +408,7 @@ DROP DATABASE [IF EXISTS] mc.mc_schema; ## 常见问题 -### 如何获取 Endpoint 和 Quota(适用于 Doris 2.1.7 之后) +### 如何获取 Endpoint 和 Quota(适用于 Doris 2.1.7 及之后) 1. 如果使用数据传输服务独享资源组 @@ -364,7 +422,7 @@ DROP DATABASE [IF EXISTS] mc.mc_schema; 使用 VPC 访问的用户,需要根据「各地域 Endpoint 对照表(阿里云 VPC 网络连接方式)」表中的「VPC 网络 Endpoint」列来配置 `mc.endpoint`。使用公网访问的用户,可以选择「各地域 Endpoint 对照表(阿里云经典网络连接方式)」表中的「经典网络 Endpoint」列、或者选择「各地域 Endpoint 对照表(外网连接方式)」表中的「外网 Endpoint」列来配置 `mc.endpoint`。 -### 自定义服务地址 (适用于 Doris 2.1.7 之前) +### 自定义服务地址(适用于 Doris 2.1.7 之前) 在 Doris 2.1.7 之前的版本中,使用 Tunnel SDK 与 MaxCompute 交互,因此需要使用以下两个 endpoint 属性: @@ -376,18 +434,18 @@ DROP DATABASE [IF EXISTS] mc.mc_schema; 生成后的格式如下: -| `mc.public_access` | `mc.odps_endpoint` | `mc.tunnel_endpoint` | -| ------------------- | -------------------------------------------------------- | ----------------------------------------------- | -| false | `http://service.{mc.region}.maxcompute.aliyun-inc.com/api` | `http://dt.{mc.region}.maxcompute.aliyun-inc.com` | -| true | `http://service.{mc.region}.maxcompute.aliyun.com/api` | `http://dt.{mc.region}.maxcompute.aliyun.com` | +| `mc.public_access` | `mc.odps_endpoint` | `mc.tunnel_endpoint` | +| --- | --- | --- | +| `false` | `http://service.{mc.region}.maxcompute.aliyun-inc.com/api` | `http://dt.{mc.region}.maxcompute.aliyun-inc.com` | +| `true` | `http://service.{mc.region}.maxcompute.aliyun.com/api` | `http://dt.{mc.region}.maxcompute.aliyun.com` | -用户也可以单独指定 `mc.odps_endpoint` 和 `mc.tunnel_endpoint` 来自定义服务地址,适用于一些私有部署的 MaxCompute 环境。 +用户也可以单独指定 `mc.odps_endpoint` 和 `mc.tunnel_endpoint` 来自定义服务地址,适用于私有部署的 MaxCompute 环境。 -MaxCompute Endpoint 和 Tunnel Endpoint 的配置请参见[各地域及不同网络连接方式下的 Endpoint](https://help.aliyun.com/zh/maxcompute/user-guide/endpoints)。 +MaxCompute Endpoint 和 Tunnel Endpoint 的配置,请参见[各地域及不同网络连接方式下的 Endpoint](https://help.aliyun.com/zh/maxcompute/user-guide/endpoints)。 ### 资源使用控制 -用户可以通过调整 `parallel_pipeline_task_num`、`num_scanner_threads` 这两个 Session Variable 来调整[表级别请求并发数量](https://help.aliyun.com/zh/maxcompute/user-guide/data-transfer-service-quota-manage?spm=a2c4g.11186623.help-menu-search-27797.d_2),以控制数据传输服务中的资源消耗。其对应的并发数量等于 `max(parallel_pipeline_task_num * be num * num_scanner_threads)`。 +用户可以通过调整 `parallel_pipeline_task_num` 和 `num_scanner_threads` 这两个 Session Variable 来调整[表级别请求并发数量](https://help.aliyun.com/zh/maxcompute/user-guide/data-transfer-service-quota-manage?spm=a2c4g.11186623.help-menu-search-27797.d_2),以控制数据传输服务中的资源消耗。其对应的并发数量等于 `max(parallel_pipeline_task_num * be num * num_scanner_threads)`。 需要注意: @@ -397,12 +455,12 @@ MaxCompute Endpoint 和 Tunnel Endpoint 的配置请参见[各地域及不同网 ### 写入最佳实践 -- 建议优先选择指定分区写入,如 `INSERT INTO mc_tbl PARTITION(ds='20250201')`。当不指定分区时,由于 MaxCompute Storage API 的限制,各个分区的数据需要顺序写入,所以在执行计划中会基于 Partition 字段进行排序,当数据量较大时,对内存资源消耗较大,可能导致写入失败。 +- 建议优先选择指定分区写入,如 `INSERT INTO mc_tbl PARTITION(ds='20250201')`。当不指定分区时,由于 MaxCompute Storage API 的限制,各个分区的数据需要顺序写入,因此执行计划中会基于分区字段进行排序。当数据量较大时,对内存资源消耗较大,可能导致写入失败。 -- 当不指定分区写入时,不要设置 `set enable_strict_consistency_dml=false`。该方法会强制取消排序节点,导致分区数据乱序写入,最终 MaxCompute 会报错。 +- 当不指定分区写入时,不要设置 `SET enable_strict_consistency_dml = false`。该方法会强制取消排序节点,导致分区数据乱序写入,最终 MaxCompute 会报错。 -- 不要添加 `LIMIT` 子句。当添加 `LIMIT` 子句时,Doris 仅会使用单线程写出,以保证写入的数量。可以用于小数据量测试,如果 `LIMIT` 数量较大,写入性能不佳。 +- 不要添加 `LIMIT` 子句。当添加 `LIMIT` 子句时,Doris 仅会使用单线程写出,以保证写入的数量。该方式可以用于小数据量测试,但如果 `LIMIT` 数量较大,写入性能不佳。 ### 写入报错:`Data invalid: ODPS-0020041:StringOutOfMaxLength` -参考 `mc.max_field_size_bytes` 的说明。 \ No newline at end of file +参考 `mc.max_field_size_bytes` 的说明。 diff --git a/versioned_docs/version-3.x/lakehouse/catalogs/maxcompute-catalog.md b/versioned_docs/version-3.x/lakehouse/catalogs/maxcompute-catalog.md index ae73869d6fd49..e611f96f12f0e 100644 --- a/versioned_docs/version-3.x/lakehouse/catalogs/maxcompute-catalog.md +++ b/versioned_docs/version-3.x/lakehouse/catalogs/maxcompute-catalog.md @@ -11,9 +11,9 @@ ## Applicable Scenarios | Scenario | Description | -| ---- | ------------------------------------------------------ | +| ---- | ---- | | Data Integration | Read MaxCompute data and write to Doris internal tables. | -| Data Write-back | Using INSERT command to write data into MaxCompute Table. (Supported since version 4.1.0) | +| Data Write-back | Using INSERT command to write data into MaxCompute tables. (Supported since version 4.1.0) | ## Usage Notes @@ -30,52 +30,82 @@ ```sql CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( 'type' = 'max_compute', + {McAuthProperties}, {McRequiredProperties}, {McOptionalProperties}, {CommonProperties} ); ``` +* `{McAuthProperties}` + + These properties control how Doris authenticates to MaxCompute for both query and write operations. + + > Version note: `mc.auth.type`, `mc.ram_role_arn`, and `mc.ecs_ram_role` are supported since **4.0.4**. + + | Property Name | Default Value | Description | Required | Supported Doris Version | + | --- | --- | --- | --- | --- | + | `mc.auth.type` | `ak_sk` | Authentication type. Supported values: `ak_sk`, `ram_role_arn`, and `ecs_ram_role`. | No | 4.0.4 (inclusive) and later | + | `mc.access_key` | None | Alibaba Cloud AccessKey. | Required when `mc.auth.type` is `ak_sk` (default) or `ram_role_arn`. | | + | `mc.secret_key` | None | Alibaba Cloud SecretKey. | Required when `mc.auth.type` is `ak_sk` (default) or `ram_role_arn`. | | + | `mc.ram_role_arn` | None | Alibaba Cloud RAM Role ARN. | Required when `mc.auth.type` is `ram_role_arn`. | 4.0.4 (inclusive) and later | + | `mc.ecs_ram_role` | None | Alibaba Cloud ECS RAM Role name attached to the instance. | Required when `mc.auth.type` is `ecs_ram_role`. | 4.0.4 (inclusive) and later | + + Supported values for `mc.auth.type`: + + | Value | Description | + | --- | --- | + | `ak_sk` | Use Alibaba Cloud AccessKey and SecretKey directly. | + | `ram_role_arn` | Use `mc.access_key` and `mc.secret_key` as source credentials to call STS `AssumeRole`, then access MaxCompute with the returned temporary credentials. | + | `ecs_ram_role` | Obtain temporary credentials from the ECS metadata service. Ensure the Doris FE and BE nodes that access MaxCompute can use the role specified by `mc.ecs_ram_role`. | + + Validation rules: + + 1. If `mc.auth.type` is omitted, Doris uses `ak_sk`. + 2. When `mc.auth.type` is `ram_role_arn`, you must configure `mc.access_key`, `mc.secret_key`, and `mc.ram_role_arn`. + 3. When `mc.auth.type` is `ecs_ram_role`, you must configure `mc.ecs_ram_role`. + 4. When `mc.access_key` and `mc.secret_key` are used, they must be configured together. + + For SQL examples of different authentication types, see [Basic Example](#basic-example). + * `{McRequiredProperties}` - | Property Name | Description | Supported Doris Version | - | ------------------ | ------------------------------------------------------------------------------------------------------------------ | ------------ | - | `mc.default.project` | The name of the MaxCompute project to access. You can create and manage projects in the [MaxCompute Project List](https://maxcompute.console.aliyun.com/cn-beijing/project-list). | | - | `mc.access_key` | AccessKey. You can create and manage it in the [Alibaba Cloud Console](https://ram.console.aliyun.com/manage/ak). | | - | `mc.secret_key` | SecretKey. You can create and manage it in the [Alibaba Cloud Console](https://ram.console.aliyun.com/manage/ak). | | - | `mc.region` | The region where MaxCompute is activated. You can find the corresponding Region from the Endpoint. | Before 2.1.7 (exclusive) | - | `mc.endpoint` | The region where MaxCompute is activated. Please refer to the "How to Obtain Endpoint and Quota" section below for configuration. | 2.1.7 (inclusive) and later | + | Property Name | Description | Supported Doris Version | + | --- | --- | --- | + | `mc.default.project` | The name of the MaxCompute project to access. You can create and manage projects in the [MaxCompute Project List](https://maxcompute.console.aliyun.com/cn-beijing/project-list). | | + | `mc.region` | The region where MaxCompute is activated. You can find the corresponding Region from the Endpoint. | Before 2.1.7 (exclusive) | + | `mc.endpoint` | The region where MaxCompute is activated. Please refer to the "How to Obtain Endpoint and Quota" section below for configuration. | 2.1.7 (inclusive) and later | * `{McOptionalProperties}` - | Property Name | Default Value | Description | Supported Doris Version | - | -------------------------- | ------------- | -------------------------------------------------------------------------- | ------------ | - | `mc.tunnel_endpoint` | None | Refer to "Custom Service Address" in the appendix. | Before 2.1.7 (exclusive) | - | `mc.odps_endpoint` | None | Refer to "Custom Service Address" in the appendix. | Before 2.1.7 (exclusive) | - | `mc.quota` | `pay-as-you-go` | Quota name. Please refer to the "How to Obtain Endpoint and Quota" section below for configuration. | 2.1.7 (inclusive) and later | - | `mc.split_strategy` | `byte_size` | Sets the split partitioning method. Can be set to partition by byte size `byte_size` or by row count `row_count`. | 2.1.7 (inclusive) and later | - | `mc.split_byte_size` | `268435456` | The file size each split reads, in bytes. Default is 256MB. Only effective when `"mc.split_strategy" = "byte_size"`. | 2.1.7 (inclusive) and later | - | `mc.split_row_count` | `1048576` | Number of rows each split reads. Only effective when `"mc.split_strategy" = "row_count"`. | 2.1.7 (inclusive) and later | - | `mc.split_cross_partition` | `false` | Whether the generated splits cross partitions. | 2.1.8 (inclusive) and later | - | `mc.connect_timeout` | `10s` | Connection timeout for MaxCompute. | 2.1.8 (inclusive) and later | - | `mc.read_timeout` | `120s` | Read timeout for MaxCompute. | 2.1.8 (inclusive) and later | - | `mc.retry_count` | `4` | Number of retries after timeout. | 2.1.8 (inclusive) and later | - | `mc.datetime_predicate_push_down` | `true` | Whether to allow predicate push-down for `timestamp/timestamp_ntz` types. Doris loses precision (9 -> 6) when syncing these two types. Therefore, if the original data precision is higher than 6 digits, predicate push-down may lead to inaccurate results. | 2.1.9/3.0.5 (inclusive) and later | - | `mc.account_format` | `name` | The account systems of Alibaba Cloud International and China sites are inconsistent. For international site users, if you encounter errors like `user 'RAM$xxxxxx:xxxxx' is not a valid aliyun account`, you can set this parameter to `id`. | 3.0.9/3.1.1 (inclusive) and later | - | `mc.enable.namespace.schema` | `false` | Whether to support MaxCompute schema hierarchy. See: https://help.aliyun.com/zh/maxcompute/user-guide/schema-related-operations | 3.1.3 (inclusive) and later | - | `mc.max_field_size_bytes` | `8388608` (8 MB) | Maximum bytes allowed for a single field in a write session. When writing data that contains large string or binary fields, the write may fail if the field size exceeds this value. You can increase this value based on your actual data. | 4.1.0 (inclusive) and later | + | Property Name | Default Value | Description | Supported Doris Version | + | --- | --- | --- | --- | + | `mc.tunnel_endpoint` | None | Refer to "Custom Service Address" in the appendix. | Before 2.1.7 (exclusive) | + | `mc.odps_endpoint` | None | Refer to "Custom Service Address" in the appendix. | Before 2.1.7 (exclusive) | + | `mc.quota` | `pay-as-you-go` | Quota name. Please refer to the "How to Obtain Endpoint and Quota" section below for configuration. | 2.1.7 (inclusive) and later | + | `mc.split_strategy` | `byte_size` | Sets the Split partitioning method. Can be set to partition by byte size `byte_size` or by row count `row_count`. | 2.1.7 (inclusive) and later | + | `mc.split_byte_size` | `268435456` | The file size each Split reads, in bytes. Default is 256 MB. Only effective when `"mc.split_strategy" = "byte_size"`. | 2.1.7 (inclusive) and later | + | `mc.split_row_count` | `1048576` | Number of rows each Split reads. Only effective when `"mc.split_strategy" = "row_count"`. | 2.1.7 (inclusive) and later | + | `mc.split_cross_partition` | `false` | Whether the generated Splits cross partitions. | 2.1.8 (inclusive) and later | + | `mc.connect_timeout` | `10s` | Connection timeout for MaxCompute. | 2.1.8 (inclusive) and later | + | `mc.read_timeout` | `120s` | Read timeout for MaxCompute. | 2.1.8 (inclusive) and later | + | `mc.retry_count` | `4` | Number of retries after timeout. | 2.1.8 (inclusive) and later | + | `mc.datetime_predicate_push_down` | `true` | Whether to allow predicate push-down for `timestamp/timestamp_ntz` types. Doris loses precision (9 -> 6) when syncing these two types. Therefore, if the original data precision is higher than 6 digits, predicate push-down may lead to inaccurate results. | 2.1.9/3.0.5 (inclusive) and later | + | `mc.account_format` | `name` | The account systems of Alibaba Cloud International and China sites are inconsistent. For international site users, if you encounter errors like `user 'RAM$xxxxxx:xxxxx' is not a valid aliyun account`, you can set this parameter to `id`. | 3.0.9/3.1.1 (inclusive) and later | + | `mc.enable.namespace.schema` | `false` | Whether to support MaxCompute Schema hierarchy. See: https://help.aliyun.com/zh/maxcompute/user-guide/schema-related-operations | 3.1.3 (inclusive) and later | + | `mc.max_field_size_bytes` | `8388608` (8 MB) | Maximum bytes allowed for a single field in a write session. When writing data that contains large string or binary fields, the write may fail if the field size exceeds this value. You can increase this value based on your actual data. | 4.1.0 (inclusive) and later | - - `mc.max_field_size_bytes` + - `mc.max_field_size_bytes` - MaxCompute allows a maximum of 8 MB per field by default. If the data being written contains large string or binary fields, the write may fail. + MaxCompute allows a maximum of 8 MB per field by default. If the data being written contains large string or binary fields, the write may fail. - To adjust this limit, first execute the following command in the MaxCompute console SQL editor: + To adjust this limit, first execute the following command in the MaxCompute console SQL editor: - `setproject odps.sql.cfile2.field.maxsize=262144;` + `setproject odps.sql.cfile2.field.maxsize=262144;` - This adjusts the maximum bytes for a single field. The unit is KB and the maximum value is 262144. + This adjusts the maximum bytes for a single field. The unit is KB and the maximum value is 262144. - Then set `mc.max_field_size_bytes` to 262144 in the Doris catalog properties (this value must not exceed the MaxCompute setting). + Then set `mc.max_field_size_bytes` to the corresponding byte value in the Doris Catalog properties (this value must not exceed the MaxCompute setting). * `{CommonProperties}` @@ -93,7 +123,7 @@ Only the public cloud version of MaxCompute is supported. For private cloud vers ## Hierarchy Mapping -- When `mc.enable.namespace.schema` is false +- When `mc.enable.namespace.schema` is `false` | Doris | MaxCompute | | -------- | ---------- | @@ -101,7 +131,7 @@ Only the public cloud version of MaxCompute is supported. For private cloud vers | Database | Project | | Table | Table | -- When `mc.enable.namespace.schema` is true +- When `mc.enable.namespace.schema` is `true` | Doris | MaxCompute | | -------- | ---------- | @@ -121,7 +151,7 @@ Only the public cloud version of MaxCompute is supported. For private cloud vers | bigint | bigint | | | float | float | | | double | double | | -| decimal(P, S) | decimal(P, S) | 1 <= P <= 38, 0 <= scale <= 18 | +| decimal(P, S) | decimal(P, S) | 1 <= P <= 38, 0 <= S <= 18 | | char(N) | char(N) | | | varchar(N) | varchar(N) | | | string | string | | @@ -136,17 +166,45 @@ Only the public cloud version of MaxCompute is supported. For private cloud vers ## Basic Example +Default `ak_sk` authentication: + +```sql +CREATE CATALOG mc_catalog PROPERTIES ( + 'type' = 'max_compute', + 'mc.default.project' = 'project', + 'mc.access_key' = 'AKxxxxx', + 'mc.secret_key' = 'SKxxxxx', + 'mc.endpoint' = 'http://service.cn-beijing-vpc.maxcompute.aliyun-inc.com/api' +); +``` + +`ram_role_arn` authentication (4.0.4+): + +```sql +CREATE CATALOG mc_catalog PROPERTIES ( + 'type' = 'max_compute', + 'mc.default.project' = 'project', + 'mc.auth.type' = 'ram_role_arn', + 'mc.access_key' = 'AKxxxxx', + 'mc.secret_key' = 'SKxxxxx', + 'mc.ram_role_arn' = 'acs:ram:::role/', + 'mc.endpoint' = 'http://service.cn-beijing-vpc.maxcompute.aliyun-inc.com/api' +); +``` + +`ecs_ram_role` authentication (4.0.4+): + ```sql CREATE CATALOG mc_catalog PROPERTIES ( 'type' = 'max_compute', 'mc.default.project' = 'project', - 'mc.access_key' = 'sk', - 'mc.secret_key' = 'ak', - 'mc.endpoint' = 'http://service.cn-beijing-vpc.MaxCompute.aliyun-inc.com/api' + 'mc.auth.type' = 'ecs_ram_role', + 'mc.ecs_ram_role' = '', + 'mc.endpoint' = 'http://service.cn-beijing-vpc.maxcompute.aliyun-inc.com/api' ); ``` -If using a version before 2.1.7 (exclusive), please use the following statement. (It is recommended to upgrade to 2.1.8 or later) +If using a version before 2.1.7 (exclusive), please use the following statement (it is recommended to upgrade to 2.1.8 or later): ```sql CREATE CATALOG mc_catalog PROPERTIES ( @@ -180,16 +238,16 @@ CREATE CATALOG mc_catalog PROPERTIES ( ### Basic Query ```sql --- 1. switch to catalog, use database and query +-- 1. Switch to Catalog, use database and query SWITCH mc_ctl; -USE mc_ctl; +USE mc_db; SELECT * FROM mc_tbl LIMIT 10; --- 2. use mc database directly +-- 2. Use mc database directly USE mc_ctl.mc_db; SELECT * FROM mc_tbl LIMIT 10; --- 3. use full qualified name to query +-- 3. Use full qualified name to query SELECT * FROM mc_ctl.mc_db.mc_tbl LIMIT 10; ``` @@ -276,7 +334,7 @@ Starting from version 4.1.0, Doris supports creating and dropping MaxCompute dat - Does not support creating clustered tables, transactional tables, Delta Tables, or external tables. ::: -> This feature is only available when the `mc.enable.namespace.schema` property is set to `true`. +This feature is only available when the `mc.enable.namespace.schema` property is set to `true`. ### Creating and Dropping Databases @@ -377,11 +435,11 @@ By default, MaxCompute Catalog generates endpoints based on `mc.region` and `mc. The generated format is as follows: | `mc.public_access` | `mc.odps_endpoint` | `mc.tunnel_endpoint` | -| ------------------- | -------------------------------------------------------- | ----------------------------------------------- | -| false | `http://service.{mc.region}.maxcompute.aliyun-inc.com/api` | `http://dt.{mc.region}.maxcompute.aliyun-inc.com` | -| true | `http://service.{mc.region}.maxcompute.aliyun.com/api` | `http://dt.{mc.region}.maxcompute.aliyun.com` | +| --- | --- | --- | +| `false` | `http://service.{mc.region}.maxcompute.aliyun-inc.com/api` | `http://dt.{mc.region}.maxcompute.aliyun-inc.com` | +| `true` | `http://service.{mc.region}.maxcompute.aliyun.com/api` | `http://dt.{mc.region}.maxcompute.aliyun.com` | -Users can also specify `mc.odps_endpoint` and `mc.tunnel_endpoint` individually to customize the service address, which is suitable for some privately deployed MaxCompute environments. +Users can also specify `mc.odps_endpoint` and `mc.tunnel_endpoint` individually to customize the service address, which is suitable for privately deployed MaxCompute environments. For configuring MaxCompute Endpoint and Tunnel Endpoint, please refer to [Endpoints for Different Regions and Network Connection Methods](https://help.aliyun.com/zh/maxcompute/user-guide/endpoints). @@ -399,7 +457,7 @@ Note: - It is recommended to write to specified partitions whenever possible, e.g. `INSERT INTO mc_tbl PARTITION(ds='20250201')`. When no partition is specified, due to limitations of the MaxCompute Storage API, data for each partition must be written sequentially. As a result, the execution plan will sort data based on the partition columns, which can consume significant memory resources when the data volume is large and may cause the write to fail. -- When writing without specifying a partition, do not set `set enable_strict_consistency_dml=false`. This forcefully removes the sort node, causing partition data to be written out of order, which will ultimately result in an error from MaxCompute. +- When writing without specifying a partition, do not set `SET enable_strict_consistency_dml = false`. This forcefully removes the sort node, causing partition data to be written out of order, which will ultimately result in an error from MaxCompute. - Do not add a `LIMIT` clause. When a `LIMIT` clause is added, Doris will use only a single thread for writing to guarantee the write count. This can be used for small-scale testing, but if the `LIMIT` value is large, write performance will be poor. diff --git a/versioned_docs/version-4.x/lakehouse/catalogs/maxcompute-catalog.md b/versioned_docs/version-4.x/lakehouse/catalogs/maxcompute-catalog.md index ae73869d6fd49..e611f96f12f0e 100644 --- a/versioned_docs/version-4.x/lakehouse/catalogs/maxcompute-catalog.md +++ b/versioned_docs/version-4.x/lakehouse/catalogs/maxcompute-catalog.md @@ -11,9 +11,9 @@ ## Applicable Scenarios | Scenario | Description | -| ---- | ------------------------------------------------------ | +| ---- | ---- | | Data Integration | Read MaxCompute data and write to Doris internal tables. | -| Data Write-back | Using INSERT command to write data into MaxCompute Table. (Supported since version 4.1.0) | +| Data Write-back | Using INSERT command to write data into MaxCompute tables. (Supported since version 4.1.0) | ## Usage Notes @@ -30,52 +30,82 @@ ```sql CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( 'type' = 'max_compute', + {McAuthProperties}, {McRequiredProperties}, {McOptionalProperties}, {CommonProperties} ); ``` +* `{McAuthProperties}` + + These properties control how Doris authenticates to MaxCompute for both query and write operations. + + > Version note: `mc.auth.type`, `mc.ram_role_arn`, and `mc.ecs_ram_role` are supported since **4.0.4**. + + | Property Name | Default Value | Description | Required | Supported Doris Version | + | --- | --- | --- | --- | --- | + | `mc.auth.type` | `ak_sk` | Authentication type. Supported values: `ak_sk`, `ram_role_arn`, and `ecs_ram_role`. | No | 4.0.4 (inclusive) and later | + | `mc.access_key` | None | Alibaba Cloud AccessKey. | Required when `mc.auth.type` is `ak_sk` (default) or `ram_role_arn`. | | + | `mc.secret_key` | None | Alibaba Cloud SecretKey. | Required when `mc.auth.type` is `ak_sk` (default) or `ram_role_arn`. | | + | `mc.ram_role_arn` | None | Alibaba Cloud RAM Role ARN. | Required when `mc.auth.type` is `ram_role_arn`. | 4.0.4 (inclusive) and later | + | `mc.ecs_ram_role` | None | Alibaba Cloud ECS RAM Role name attached to the instance. | Required when `mc.auth.type` is `ecs_ram_role`. | 4.0.4 (inclusive) and later | + + Supported values for `mc.auth.type`: + + | Value | Description | + | --- | --- | + | `ak_sk` | Use Alibaba Cloud AccessKey and SecretKey directly. | + | `ram_role_arn` | Use `mc.access_key` and `mc.secret_key` as source credentials to call STS `AssumeRole`, then access MaxCompute with the returned temporary credentials. | + | `ecs_ram_role` | Obtain temporary credentials from the ECS metadata service. Ensure the Doris FE and BE nodes that access MaxCompute can use the role specified by `mc.ecs_ram_role`. | + + Validation rules: + + 1. If `mc.auth.type` is omitted, Doris uses `ak_sk`. + 2. When `mc.auth.type` is `ram_role_arn`, you must configure `mc.access_key`, `mc.secret_key`, and `mc.ram_role_arn`. + 3. When `mc.auth.type` is `ecs_ram_role`, you must configure `mc.ecs_ram_role`. + 4. When `mc.access_key` and `mc.secret_key` are used, they must be configured together. + + For SQL examples of different authentication types, see [Basic Example](#basic-example). + * `{McRequiredProperties}` - | Property Name | Description | Supported Doris Version | - | ------------------ | ------------------------------------------------------------------------------------------------------------------ | ------------ | - | `mc.default.project` | The name of the MaxCompute project to access. You can create and manage projects in the [MaxCompute Project List](https://maxcompute.console.aliyun.com/cn-beijing/project-list). | | - | `mc.access_key` | AccessKey. You can create and manage it in the [Alibaba Cloud Console](https://ram.console.aliyun.com/manage/ak). | | - | `mc.secret_key` | SecretKey. You can create and manage it in the [Alibaba Cloud Console](https://ram.console.aliyun.com/manage/ak). | | - | `mc.region` | The region where MaxCompute is activated. You can find the corresponding Region from the Endpoint. | Before 2.1.7 (exclusive) | - | `mc.endpoint` | The region where MaxCompute is activated. Please refer to the "How to Obtain Endpoint and Quota" section below for configuration. | 2.1.7 (inclusive) and later | + | Property Name | Description | Supported Doris Version | + | --- | --- | --- | + | `mc.default.project` | The name of the MaxCompute project to access. You can create and manage projects in the [MaxCompute Project List](https://maxcompute.console.aliyun.com/cn-beijing/project-list). | | + | `mc.region` | The region where MaxCompute is activated. You can find the corresponding Region from the Endpoint. | Before 2.1.7 (exclusive) | + | `mc.endpoint` | The region where MaxCompute is activated. Please refer to the "How to Obtain Endpoint and Quota" section below for configuration. | 2.1.7 (inclusive) and later | * `{McOptionalProperties}` - | Property Name | Default Value | Description | Supported Doris Version | - | -------------------------- | ------------- | -------------------------------------------------------------------------- | ------------ | - | `mc.tunnel_endpoint` | None | Refer to "Custom Service Address" in the appendix. | Before 2.1.7 (exclusive) | - | `mc.odps_endpoint` | None | Refer to "Custom Service Address" in the appendix. | Before 2.1.7 (exclusive) | - | `mc.quota` | `pay-as-you-go` | Quota name. Please refer to the "How to Obtain Endpoint and Quota" section below for configuration. | 2.1.7 (inclusive) and later | - | `mc.split_strategy` | `byte_size` | Sets the split partitioning method. Can be set to partition by byte size `byte_size` or by row count `row_count`. | 2.1.7 (inclusive) and later | - | `mc.split_byte_size` | `268435456` | The file size each split reads, in bytes. Default is 256MB. Only effective when `"mc.split_strategy" = "byte_size"`. | 2.1.7 (inclusive) and later | - | `mc.split_row_count` | `1048576` | Number of rows each split reads. Only effective when `"mc.split_strategy" = "row_count"`. | 2.1.7 (inclusive) and later | - | `mc.split_cross_partition` | `false` | Whether the generated splits cross partitions. | 2.1.8 (inclusive) and later | - | `mc.connect_timeout` | `10s` | Connection timeout for MaxCompute. | 2.1.8 (inclusive) and later | - | `mc.read_timeout` | `120s` | Read timeout for MaxCompute. | 2.1.8 (inclusive) and later | - | `mc.retry_count` | `4` | Number of retries after timeout. | 2.1.8 (inclusive) and later | - | `mc.datetime_predicate_push_down` | `true` | Whether to allow predicate push-down for `timestamp/timestamp_ntz` types. Doris loses precision (9 -> 6) when syncing these two types. Therefore, if the original data precision is higher than 6 digits, predicate push-down may lead to inaccurate results. | 2.1.9/3.0.5 (inclusive) and later | - | `mc.account_format` | `name` | The account systems of Alibaba Cloud International and China sites are inconsistent. For international site users, if you encounter errors like `user 'RAM$xxxxxx:xxxxx' is not a valid aliyun account`, you can set this parameter to `id`. | 3.0.9/3.1.1 (inclusive) and later | - | `mc.enable.namespace.schema` | `false` | Whether to support MaxCompute schema hierarchy. See: https://help.aliyun.com/zh/maxcompute/user-guide/schema-related-operations | 3.1.3 (inclusive) and later | - | `mc.max_field_size_bytes` | `8388608` (8 MB) | Maximum bytes allowed for a single field in a write session. When writing data that contains large string or binary fields, the write may fail if the field size exceeds this value. You can increase this value based on your actual data. | 4.1.0 (inclusive) and later | + | Property Name | Default Value | Description | Supported Doris Version | + | --- | --- | --- | --- | + | `mc.tunnel_endpoint` | None | Refer to "Custom Service Address" in the appendix. | Before 2.1.7 (exclusive) | + | `mc.odps_endpoint` | None | Refer to "Custom Service Address" in the appendix. | Before 2.1.7 (exclusive) | + | `mc.quota` | `pay-as-you-go` | Quota name. Please refer to the "How to Obtain Endpoint and Quota" section below for configuration. | 2.1.7 (inclusive) and later | + | `mc.split_strategy` | `byte_size` | Sets the Split partitioning method. Can be set to partition by byte size `byte_size` or by row count `row_count`. | 2.1.7 (inclusive) and later | + | `mc.split_byte_size` | `268435456` | The file size each Split reads, in bytes. Default is 256 MB. Only effective when `"mc.split_strategy" = "byte_size"`. | 2.1.7 (inclusive) and later | + | `mc.split_row_count` | `1048576` | Number of rows each Split reads. Only effective when `"mc.split_strategy" = "row_count"`. | 2.1.7 (inclusive) and later | + | `mc.split_cross_partition` | `false` | Whether the generated Splits cross partitions. | 2.1.8 (inclusive) and later | + | `mc.connect_timeout` | `10s` | Connection timeout for MaxCompute. | 2.1.8 (inclusive) and later | + | `mc.read_timeout` | `120s` | Read timeout for MaxCompute. | 2.1.8 (inclusive) and later | + | `mc.retry_count` | `4` | Number of retries after timeout. | 2.1.8 (inclusive) and later | + | `mc.datetime_predicate_push_down` | `true` | Whether to allow predicate push-down for `timestamp/timestamp_ntz` types. Doris loses precision (9 -> 6) when syncing these two types. Therefore, if the original data precision is higher than 6 digits, predicate push-down may lead to inaccurate results. | 2.1.9/3.0.5 (inclusive) and later | + | `mc.account_format` | `name` | The account systems of Alibaba Cloud International and China sites are inconsistent. For international site users, if you encounter errors like `user 'RAM$xxxxxx:xxxxx' is not a valid aliyun account`, you can set this parameter to `id`. | 3.0.9/3.1.1 (inclusive) and later | + | `mc.enable.namespace.schema` | `false` | Whether to support MaxCompute Schema hierarchy. See: https://help.aliyun.com/zh/maxcompute/user-guide/schema-related-operations | 3.1.3 (inclusive) and later | + | `mc.max_field_size_bytes` | `8388608` (8 MB) | Maximum bytes allowed for a single field in a write session. When writing data that contains large string or binary fields, the write may fail if the field size exceeds this value. You can increase this value based on your actual data. | 4.1.0 (inclusive) and later | - - `mc.max_field_size_bytes` + - `mc.max_field_size_bytes` - MaxCompute allows a maximum of 8 MB per field by default. If the data being written contains large string or binary fields, the write may fail. + MaxCompute allows a maximum of 8 MB per field by default. If the data being written contains large string or binary fields, the write may fail. - To adjust this limit, first execute the following command in the MaxCompute console SQL editor: + To adjust this limit, first execute the following command in the MaxCompute console SQL editor: - `setproject odps.sql.cfile2.field.maxsize=262144;` + `setproject odps.sql.cfile2.field.maxsize=262144;` - This adjusts the maximum bytes for a single field. The unit is KB and the maximum value is 262144. + This adjusts the maximum bytes for a single field. The unit is KB and the maximum value is 262144. - Then set `mc.max_field_size_bytes` to 262144 in the Doris catalog properties (this value must not exceed the MaxCompute setting). + Then set `mc.max_field_size_bytes` to the corresponding byte value in the Doris Catalog properties (this value must not exceed the MaxCompute setting). * `{CommonProperties}` @@ -93,7 +123,7 @@ Only the public cloud version of MaxCompute is supported. For private cloud vers ## Hierarchy Mapping -- When `mc.enable.namespace.schema` is false +- When `mc.enable.namespace.schema` is `false` | Doris | MaxCompute | | -------- | ---------- | @@ -101,7 +131,7 @@ Only the public cloud version of MaxCompute is supported. For private cloud vers | Database | Project | | Table | Table | -- When `mc.enable.namespace.schema` is true +- When `mc.enable.namespace.schema` is `true` | Doris | MaxCompute | | -------- | ---------- | @@ -121,7 +151,7 @@ Only the public cloud version of MaxCompute is supported. For private cloud vers | bigint | bigint | | | float | float | | | double | double | | -| decimal(P, S) | decimal(P, S) | 1 <= P <= 38, 0 <= scale <= 18 | +| decimal(P, S) | decimal(P, S) | 1 <= P <= 38, 0 <= S <= 18 | | char(N) | char(N) | | | varchar(N) | varchar(N) | | | string | string | | @@ -136,17 +166,45 @@ Only the public cloud version of MaxCompute is supported. For private cloud vers ## Basic Example +Default `ak_sk` authentication: + +```sql +CREATE CATALOG mc_catalog PROPERTIES ( + 'type' = 'max_compute', + 'mc.default.project' = 'project', + 'mc.access_key' = 'AKxxxxx', + 'mc.secret_key' = 'SKxxxxx', + 'mc.endpoint' = 'http://service.cn-beijing-vpc.maxcompute.aliyun-inc.com/api' +); +``` + +`ram_role_arn` authentication (4.0.4+): + +```sql +CREATE CATALOG mc_catalog PROPERTIES ( + 'type' = 'max_compute', + 'mc.default.project' = 'project', + 'mc.auth.type' = 'ram_role_arn', + 'mc.access_key' = 'AKxxxxx', + 'mc.secret_key' = 'SKxxxxx', + 'mc.ram_role_arn' = 'acs:ram:::role/', + 'mc.endpoint' = 'http://service.cn-beijing-vpc.maxcompute.aliyun-inc.com/api' +); +``` + +`ecs_ram_role` authentication (4.0.4+): + ```sql CREATE CATALOG mc_catalog PROPERTIES ( 'type' = 'max_compute', 'mc.default.project' = 'project', - 'mc.access_key' = 'sk', - 'mc.secret_key' = 'ak', - 'mc.endpoint' = 'http://service.cn-beijing-vpc.MaxCompute.aliyun-inc.com/api' + 'mc.auth.type' = 'ecs_ram_role', + 'mc.ecs_ram_role' = '', + 'mc.endpoint' = 'http://service.cn-beijing-vpc.maxcompute.aliyun-inc.com/api' ); ``` -If using a version before 2.1.7 (exclusive), please use the following statement. (It is recommended to upgrade to 2.1.8 or later) +If using a version before 2.1.7 (exclusive), please use the following statement (it is recommended to upgrade to 2.1.8 or later): ```sql CREATE CATALOG mc_catalog PROPERTIES ( @@ -180,16 +238,16 @@ CREATE CATALOG mc_catalog PROPERTIES ( ### Basic Query ```sql --- 1. switch to catalog, use database and query +-- 1. Switch to Catalog, use database and query SWITCH mc_ctl; -USE mc_ctl; +USE mc_db; SELECT * FROM mc_tbl LIMIT 10; --- 2. use mc database directly +-- 2. Use mc database directly USE mc_ctl.mc_db; SELECT * FROM mc_tbl LIMIT 10; --- 3. use full qualified name to query +-- 3. Use full qualified name to query SELECT * FROM mc_ctl.mc_db.mc_tbl LIMIT 10; ``` @@ -276,7 +334,7 @@ Starting from version 4.1.0, Doris supports creating and dropping MaxCompute dat - Does not support creating clustered tables, transactional tables, Delta Tables, or external tables. ::: -> This feature is only available when the `mc.enable.namespace.schema` property is set to `true`. +This feature is only available when the `mc.enable.namespace.schema` property is set to `true`. ### Creating and Dropping Databases @@ -377,11 +435,11 @@ By default, MaxCompute Catalog generates endpoints based on `mc.region` and `mc. The generated format is as follows: | `mc.public_access` | `mc.odps_endpoint` | `mc.tunnel_endpoint` | -| ------------------- | -------------------------------------------------------- | ----------------------------------------------- | -| false | `http://service.{mc.region}.maxcompute.aliyun-inc.com/api` | `http://dt.{mc.region}.maxcompute.aliyun-inc.com` | -| true | `http://service.{mc.region}.maxcompute.aliyun.com/api` | `http://dt.{mc.region}.maxcompute.aliyun.com` | +| --- | --- | --- | +| `false` | `http://service.{mc.region}.maxcompute.aliyun-inc.com/api` | `http://dt.{mc.region}.maxcompute.aliyun-inc.com` | +| `true` | `http://service.{mc.region}.maxcompute.aliyun.com/api` | `http://dt.{mc.region}.maxcompute.aliyun.com` | -Users can also specify `mc.odps_endpoint` and `mc.tunnel_endpoint` individually to customize the service address, which is suitable for some privately deployed MaxCompute environments. +Users can also specify `mc.odps_endpoint` and `mc.tunnel_endpoint` individually to customize the service address, which is suitable for privately deployed MaxCompute environments. For configuring MaxCompute Endpoint and Tunnel Endpoint, please refer to [Endpoints for Different Regions and Network Connection Methods](https://help.aliyun.com/zh/maxcompute/user-guide/endpoints). @@ -399,7 +457,7 @@ Note: - It is recommended to write to specified partitions whenever possible, e.g. `INSERT INTO mc_tbl PARTITION(ds='20250201')`. When no partition is specified, due to limitations of the MaxCompute Storage API, data for each partition must be written sequentially. As a result, the execution plan will sort data based on the partition columns, which can consume significant memory resources when the data volume is large and may cause the write to fail. -- When writing without specifying a partition, do not set `set enable_strict_consistency_dml=false`. This forcefully removes the sort node, causing partition data to be written out of order, which will ultimately result in an error from MaxCompute. +- When writing without specifying a partition, do not set `SET enable_strict_consistency_dml = false`. This forcefully removes the sort node, causing partition data to be written out of order, which will ultimately result in an error from MaxCompute. - Do not add a `LIMIT` clause. When a `LIMIT` clause is added, Doris will use only a single thread for writing to guarantee the write count. This can be used for small-scale testing, but if the `LIMIT` value is large, write performance will be poor.