From c89c482391e8d953d26f7bc92a07ec36edda4edb Mon Sep 17 00:00:00 2001 From: Socrates Date: Tue, 17 Mar 2026 16:08:32 +0800 Subject: [PATCH 1/4] docs(faq): add HMS DNS resolution diagnostic and Pulse toolset --- docs/faq/lakehouse-faq.md | 72 ++++++++++++++++-- .../current/faq/lakehouse-faq.md | 62 ++++++++++++++++ .../version-3.x/faq/lakehouse-faq.md | 64 ++++++++++++++++ .../version-4.x/faq/lakehouse-faq.md | 64 ++++++++++++++++ .../version-3.x/faq/lakehouse-faq.md | 73 +++++++++++++++++-- .../version-4.x/faq/lakehouse-faq.md | 72 ++++++++++++++++-- 6 files changed, 391 insertions(+), 16 deletions(-) diff --git a/docs/faq/lakehouse-faq.md b/docs/faq/lakehouse-faq.md index a5103977fdef3..1b79ba09c434f 100644 --- a/docs/faq/lakehouse-faq.md +++ b/docs/faq/lakehouse-faq.md @@ -245,13 +245,41 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- Since after inserting the data, the corresponding statistical information needs to be updated, and this update operation requires the alter privilege. Therefore, the alter privilege needs to be added for this user on Ranger. -13. When querying ORC files, if an error like - `Orc row reader nextBatch failed. reason = Can't open /usr/share/zoneinfo/+08:00` - occurs. +13. When querying ORC files, if an error like + `Orc row reader nextBatch failed. reason = Can't open /usr/share/zoneinfo/+08:00` + occurs. - First check the `time_zone` setting of the current session. It is recommended to use a region-based timezone name such as `Asia/Shanghai`. + First check the `time_zone` setting of the current session. It is recommended to use a region-based timezone name such as `Asia/Shanghai`. - If the session timezone is already set to `Asia/Shanghai` but the query still fails, it indicates that the ORC file was generated with the timezone `+08:00`. During query execution, this timezone is required when parsing the ORC footer. In this case, you can try creating a symbolic link under the `/usr/share/zoneinfo/` directory that points `+08:00` to an equivalent timezone. + If the session timezone is already set to `Asia/Shanghai` but the query still fails, it indicates that the ORC file was generated with the timezone `+08:00`. During query execution, this timezone is required when parsing the ORC footer. In this case, you can try creating a symbolic link under the `/usr/share/zoneinfo/` directory that points `+08:00` to an equivalent timezone. + +14. **When querying Hive Catalog tables, query planning is extremely slow, the `nereids cost too much time` error occurs, and each HMS access takes a consistently long time (e.g., around 10 seconds).** + + **Root Cause Analysis:** + This issue is usually not caused by slow execution of the HMS RPC itself. Instead, the most common root cause is **incorrect DNS configuration on the Doris FE node**. + During the initialization phase of the Hive Metastore Client, hostname resolution is triggered. If the configured DNS server is unreachable or unresponsive, it causes a DNS resolution timeout (typically 10 seconds) every time a new HMS client connection is established, which severely slows down metadata fetching. + + **Typical Symptoms:** + - **Normal Network Connectivity:** The HMS port is reachable, but metadata access in Doris remains extremely slow. + - **Consistent Delay:** The delay consistently hits a fixed timeout threshold (e.g., 10 seconds). + - **Workarounds Fail:** Simply increasing the HMS client timeout parameter in the Catalog properties only masks the error but does not eliminate the fixed 10-second delay on each connection. + + **Troubleshooting Steps:** + Run the following commands on the Doris FE node to verify the DNS and hostname resolution: + + ```bash + # Check current DNS server configuration + cat /etc/resolv.conf + # Test if the DNS server is reachable and measure resolution latency + ping + dig @ example.com + dig @ -x + ``` + + **Solutions (Choose One):** + 1. **Fix DNS Configuration (Recommended):** Correct the `nameserver` entries in `/etc/resolv.conf` on the Doris FE node to ensure the DNS service is reachable and responds quickly. If DNS is not required in your local network environment, consider commenting out the invalid nameservers. + 2. **Configure Static Hosts Mapping:** Add the IP and Hostname mapping of the HMS nodes to `/etc/hosts` on the FE node. + 3. **Standardize Catalog Properties:** When creating the Catalog, it is highly recommended to use a resolvable hostname instead of a bare IP address for the `hive.metastore.uris` property. 14. When querying a Hive table that uses JSON SerDe (e.g., `org.openx.data.jsonserde.JsonSerDe`), an error occurs: `failed to get schema` or `Storage schema reading not supported` @@ -386,3 +414,37 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- `./parquet-tools meta /path/to/file.parquet` 4. For more functionalities, refer to [Apache Parquet Cli documentation](https://github.com/apache/parquet-java/tree/apache-parquet-1.14.0/parquet-cli) + +## Diagnostic Tools + +### Pulse + +[Pulse](https://github.com/CalvinKirs/Pulse) is a lightweight connectivity testing toolkit designed to diagnose infrastructure dependencies in data lake environments. It includes several specialized tools to help users quickly pinpoint environment-related issues in external table access. + +Pulse consists of the following key toolsets: + +1. **HMS Diagnostic Tool (`hms-tools`)**: + * Designed specifically for troubleshooting Hive Metastore (HMS) issues. + * Supports health checks, ping tests, object metadata retrieval, and configuration diagnostics. + * **Performance Benchmarking**: Features a `bench` mode to measure the response distribution and latency of HMS, helping determine if the bottleneck is at the metadata layer. + +2. **Kerberos Diagnostic Tool (`kerberos-tools`)**: + * Used to validate `krb5.conf` configurations in environments with Kerberos authentication. + * Supports testing KDC reachability, inspecting keytab files, and performing login tests to ensure the security layer is not blocking the connection. + +3. **Object Storage Diagnostic Tools (`s3-tools`, `gcs-tools`, `azure-blob-cpp`)**: + * Diagnostic tools for major cloud storage services (AWS S3, Google GCS, Azure Blob Storage). + * Used for troubleshooting common external table access issues such as "Access Denied" or "Bucket Not Found". + * Supports validating credential sources and STS identities, and performing bucket-level operation tests. + +**Example Commands (e.g., HMS):** + +```bash +# Test basic HMS connectivity and latency details using hms-tools +java -jar hms-tools.jar ping --uris thrift://: --count 3 --verbose + +# Benchmark actual metadata RPC response distribution using hms-tools +java -jar hms-tools.jar bench --uris thrift://: --rpc get_all_databases --iterations 10 +``` + +When metadata access is slow or external table connectivity fails, it is recommended to use the corresponding Pulse tool based on the issue type (e.g., authentication failure, slow metadata, or storage reachability) for investigation. If the `connect` phase is extremely fast but there are significant and consistent delays during the overall initialization, please refer to the FAQ above to check the DNS and hostname resolution settings on the FE node. diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/faq/lakehouse-faq.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/faq/lakehouse-faq.md index c2679e6dcd6c1..6d1b67935f7ee 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/faq/lakehouse-faq.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/faq/lakehouse-faq.md @@ -301,6 +301,34 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 该参数自 2.1.10 和 3.0.6 版本支持。 +15. 查询 Hive Catalog 表时,优化阶段极慢并伴随 `nereids cost too much time` 报错,且每次访问 HMS 的耗时稳定在 10 秒左右。 + + **问题分析:** + 这类问题通常并非 HMS 服务本身的 RPC 执行慢引起,而是由于 **Doris FE 所在机器的 DNS 配置异常** 导致。 + 在 Hive Metastore Client 初始化阶段,底层会触发 hostname 解析。如果系统配置了无效的 DNS Server 或 DNS 服务不可达,会导致每次新建 HMS Client 时发生解析超时(通常为 10 秒),从而严重拖慢元数据获取速度。 + + **典型现象:** + - **基础网络正常**:HMS 端口端到端连通正常,但 Doris 获取元数据依然极慢。 + - **规律性延迟**:耗时常常稳定在一个固定的超时时间(如 10 秒)。 + - **规避无效**:单纯调大 Catalog 属性中的 HMS Client Timeout 只能暂时规避报错,但无法消除每次建立连接时的固定延迟。 + + **排查步骤:** + 在 Doris FE 节点上执行以下命令,检查 DNS 和主机名解析是否正常: + + ```bash + # 查看当前配置的 DNS server + cat /etc/resolv.conf + # 测试 DNS server 是否可达及解析耗时 + ping + dig @ example.com + dig @ -x + ``` + + **解决方案(任选其一即可):** + 1. **修复 DNS 配置(推荐)**:修正 Doris FE 节点上 `/etc/resolv.conf` 中的 `nameserver` 配置,确保域名解析服务正常且快速响应。如果局域网内无需 DNS 且无公网访问需求,可注释掉无效的 nameserver。 + 2. **配置 Hosts 静态映射**:在 FE 节点的 `/etc/hosts` 中添加 HMS 节点的 IP 与 Hostname 映射。 + 3. **规范 Catalog 配置**:创建 Catalog 时,`hive.metastore.uris` 参数建议优先使用正确的 Hostname 而不是裸 IP。 + ## HDFS 1. 访问 HDFS 3.x 时报错:`java.lang.VerifyError: xxx` @@ -417,3 +445,37 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- `./parquet-tools meta /path/to/file.parquet` 4. 更多功能,可参阅 [Apache Parquet Cli 文档](https://github.com/apache/parquet-java/tree/apache-parquet-1.14.0/parquet-cli) + +## 诊断工具 + +### Pulse + +[Pulse](https://github.com/CalvinKirs/Pulse) 是一个轻量级的连通性测试工具集,专为诊断数据湖环境中的基础设施依赖问题而设计。它包含了多个针对性工具,可以帮助用户快速定位外部表访问中的环境问题。 + +Pulse 主要包含以下工具集: + +1. **HMS 诊断工具 (`hms-tools`)**: + - 专门用于排查 Hive Metastore (HMS) 相关问题。 + - 支持健康检查、Ping 测试、元数据对象检索以及配置诊断。 + - **性能压测**:提供 `bench` 模式,用于测量 HMS 的性能分布和响应延迟,帮助判断瓶颈是否在元数据层。 + +2. **Kerberos 诊断工具 (`kerberos-tools`)**: + - 用于在使用 Kerberos 认证的环境中验证 `krb5.conf` 配置。 + - 支持测试 KDC 可达性、检查 Keytab 文件以及执行登录测试,确保认证层不会阻断连接。 + +3. **对象存储诊断工具 (`s3-tools`, `gcs-tools`, `azure-blob-cpp`)**: + - 针对主流云存储(AWS S3, Google GCS, Azure Blob)的诊断工具。 + - 用于排查“权限拒绝(Access Denied)”或“存储桶不存在(Bucket Not Found)”等常见的外部表数据访问问题。 + - 支持验证凭据来源、STS 身份以及执行 Bucket 级别的操作测试。 + +**常用命令示例(以 HMS 为例):** + +```bash +# 使用 hms-tools 测试 HMS 基础连通性与耗时细节 +java -jar hms-tools.jar ping --uris thrift://: --count 3 --verbose + +# 使用 hms-tools 压测实际元数据 RPC 的延时分布 +java -jar hms-tools.jar bench --uris thrift://: --rpc get_all_databases --iterations 10 +``` + +当遇到元数据访问慢或外部表连接异常时,推荐根据问题类型(如认证失败、元数据慢或存储无法访问)选用对应的 Pulse 工具进行辅助定位。如果发现 `connect` 阶段极快,但整体初始化阶段存在明显且固定的延迟,请优先参考上文 FAQ 检查 FE 节点的 DNS 及 hostname 解析配置。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/faq/lakehouse-faq.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/faq/lakehouse-faq.md index c2679e6dcd6c1..f51992f009529 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/faq/lakehouse-faq.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/faq/lakehouse-faq.md @@ -279,6 +279,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 如果 `session` 时区已经是 `Asia/Shanghai`,且查询仍然报错,说明生成 ORC 文件时的时区是 `+08:00`, 导致在读取时解析 `footer` 时需要用到 `+08:00` 时区,可以尝试在 `/usr/share/zoneinfo/` 目录下面软链到相同时区上。 +<<<<<<< HEAD 14. 查询使用 JSON SerDe(如 `org.openx.data.jsonserde.JsonSerDe`)的 Hive 表时,报错:`failed to get schema` 或 `Storage schema reading not supported` 当 Hive 表使用 JSON 格式存储(ROW FORMAT SERDE 为 `org.openx.data.jsonserde.JsonSerDe`)时,Hive Metastore 可能无法通过默认方式读取表的 Schema 信息,导致 Doris 查询时报错: @@ -300,6 +301,35 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- ``` 该参数自 2.1.10 和 3.0.6 版本支持。 +======= +14. **查询 Hive Catalog 表时,优化阶段极慢并伴随 `nereids cost too much time` 报错,且每次访问 HMS 的耗时稳定在 10 秒左右。** + + **问题分析:** + 这类问题通常并非 HMS 服务本身的 RPC 执行慢引起,而是由于 **Doris FE 所在机器的 DNS 配置异常** 导致。 + 在 Hive Metastore Client 初始化阶段,底层会触发 hostname 解析。如果系统配置了无效的 DNS Server 或 DNS 服务不可达,会导致每次新建 HMS Client 时发生解析超时(通常为 10 秒),从而严重拖慢元数据获取速度。 + + **典型现象:** + - **基础网络正常**:HMS 端口端到端连通正常,但 Doris 获取元数据依然极慢。 + - **规律性延迟**:耗时常常稳定在一个固定的超时时间(如 10 秒)。 + - **规避无效**:单纯调大 Catalog 属性中的 HMS Client Timeout 只能暂时规避报错,但无法消除每次建立连接时的固定延迟。 + + **排查步骤:** + 在 Doris FE 节点上执行以下命令,检查 DNS 和主机名解析是否正常: + + ```bash + # 查看当前配置的 DNS server + cat /etc/resolv.conf + # 测试 DNS server 是否可达及解析耗时 + ping + dig @ example.com + dig @ -x + ``` + + **解决方案(任选其一即可):** + 1. **修复 DNS 配置(推荐)**:修正 Doris FE 节点上 `/etc/resolv.conf` 中的 `nameserver` 配置,确保域名解析服务正常且快速响应。如果局域网内无需 DNS 且无公网访问需求,可注释掉无效的 nameserver。 + 2. **配置 Hosts 静态映射**:在 FE 节点的 `/etc/hosts` 中添加 HMS 节点的 IP 与 Hostname 映射。 + 3. **规范 Catalog 配置**:创建 Catalog 时,`hive.metastore.uris` 参数建议优先使用正确的 Hostname 而不是裸 IP。 +>>>>>>> 0cf7ea7ca780 (docs(faq): add HMS DNS resolution diagnostic and Pulse toolset) ## HDFS @@ -417,3 +447,37 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- `./parquet-tools meta /path/to/file.parquet` 4. 更多功能,可参阅 [Apache Parquet Cli 文档](https://github.com/apache/parquet-java/tree/apache-parquet-1.14.0/parquet-cli) + +## 诊断工具 + +### Pulse + +[Pulse](https://github.com/CalvinKirs/Pulse) 是一个轻量级的连通性测试工具集,专为诊断数据湖环境中的基础设施依赖问题而设计。它包含了多个针对性工具,可以帮助用户快速定位外部表访问中的环境问题。 + +Pulse 主要包含以下工具集: + +1. **HMS 诊断工具 (`hms-tools`)**: + - 专门用于排查 Hive Metastore (HMS) 相关问题。 + - 支持健康检查、Ping 测试、元数据对象检索以及配置诊断。 + - **性能压测**:提供 `bench` 模式,用于测量 HMS 的性能分布和响应延迟,帮助判断瓶颈是否在元数据层。 + +2. **Kerberos 诊断工具 (`kerberos-tools`)**: + - 用于在使用 Kerberos 认证的环境中验证 `krb5.conf` 配置。 + - 支持测试 KDC 可达性、检查 Keytab 文件以及执行登录测试,确保认证层不会阻断连接。 + +3. **对象存储诊断工具 (`s3-tools`, `gcs-tools`, `azure-blob-cpp`)**: + - 针对主流云存储(AWS S3, Google GCS, Azure Blob)的诊断工具。 + - 用于排查“权限拒绝(Access Denied)”或“存储桶不存在(Bucket Not Found)”等常见的外部表数据访问问题。 + - 支持验证凭据来源、STS 身份以及执行 Bucket 级别的操作测试。 + +**常用命令示例(以 HMS 为例):** + +```bash +# 使用 hms-tools 测试 HMS 基础连通性与耗时细节 +java -jar hms-tools.jar ping --uris thrift://: --count 3 --verbose + +# 使用 hms-tools 压测实际元数据 RPC 的延时分布 +java -jar hms-tools.jar bench --uris thrift://: --rpc get_all_databases --iterations 10 +``` + +当遇到元数据访问慢或外部表连接异常时,推荐根据问题类型(如认证失败、元数据慢或存储无法访问)选用对应的 Pulse 工具进行辅助定位。如果发现 `connect` 阶段极快,但整体初始化阶段存在明显且固定的延迟,请优先参考上文 FAQ 检查 FE 节点的 DNS 及 hostname 解析配置。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/faq/lakehouse-faq.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/faq/lakehouse-faq.md index c2679e6dcd6c1..f51992f009529 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/faq/lakehouse-faq.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/faq/lakehouse-faq.md @@ -279,6 +279,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 如果 `session` 时区已经是 `Asia/Shanghai`,且查询仍然报错,说明生成 ORC 文件时的时区是 `+08:00`, 导致在读取时解析 `footer` 时需要用到 `+08:00` 时区,可以尝试在 `/usr/share/zoneinfo/` 目录下面软链到相同时区上。 +<<<<<<< HEAD 14. 查询使用 JSON SerDe(如 `org.openx.data.jsonserde.JsonSerDe`)的 Hive 表时,报错:`failed to get schema` 或 `Storage schema reading not supported` 当 Hive 表使用 JSON 格式存储(ROW FORMAT SERDE 为 `org.openx.data.jsonserde.JsonSerDe`)时,Hive Metastore 可能无法通过默认方式读取表的 Schema 信息,导致 Doris 查询时报错: @@ -300,6 +301,35 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- ``` 该参数自 2.1.10 和 3.0.6 版本支持。 +======= +14. **查询 Hive Catalog 表时,优化阶段极慢并伴随 `nereids cost too much time` 报错,且每次访问 HMS 的耗时稳定在 10 秒左右。** + + **问题分析:** + 这类问题通常并非 HMS 服务本身的 RPC 执行慢引起,而是由于 **Doris FE 所在机器的 DNS 配置异常** 导致。 + 在 Hive Metastore Client 初始化阶段,底层会触发 hostname 解析。如果系统配置了无效的 DNS Server 或 DNS 服务不可达,会导致每次新建 HMS Client 时发生解析超时(通常为 10 秒),从而严重拖慢元数据获取速度。 + + **典型现象:** + - **基础网络正常**:HMS 端口端到端连通正常,但 Doris 获取元数据依然极慢。 + - **规律性延迟**:耗时常常稳定在一个固定的超时时间(如 10 秒)。 + - **规避无效**:单纯调大 Catalog 属性中的 HMS Client Timeout 只能暂时规避报错,但无法消除每次建立连接时的固定延迟。 + + **排查步骤:** + 在 Doris FE 节点上执行以下命令,检查 DNS 和主机名解析是否正常: + + ```bash + # 查看当前配置的 DNS server + cat /etc/resolv.conf + # 测试 DNS server 是否可达及解析耗时 + ping + dig @ example.com + dig @ -x + ``` + + **解决方案(任选其一即可):** + 1. **修复 DNS 配置(推荐)**:修正 Doris FE 节点上 `/etc/resolv.conf` 中的 `nameserver` 配置,确保域名解析服务正常且快速响应。如果局域网内无需 DNS 且无公网访问需求,可注释掉无效的 nameserver。 + 2. **配置 Hosts 静态映射**:在 FE 节点的 `/etc/hosts` 中添加 HMS 节点的 IP 与 Hostname 映射。 + 3. **规范 Catalog 配置**:创建 Catalog 时,`hive.metastore.uris` 参数建议优先使用正确的 Hostname 而不是裸 IP。 +>>>>>>> 0cf7ea7ca780 (docs(faq): add HMS DNS resolution diagnostic and Pulse toolset) ## HDFS @@ -417,3 +447,37 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- `./parquet-tools meta /path/to/file.parquet` 4. 更多功能,可参阅 [Apache Parquet Cli 文档](https://github.com/apache/parquet-java/tree/apache-parquet-1.14.0/parquet-cli) + +## 诊断工具 + +### Pulse + +[Pulse](https://github.com/CalvinKirs/Pulse) 是一个轻量级的连通性测试工具集,专为诊断数据湖环境中的基础设施依赖问题而设计。它包含了多个针对性工具,可以帮助用户快速定位外部表访问中的环境问题。 + +Pulse 主要包含以下工具集: + +1. **HMS 诊断工具 (`hms-tools`)**: + - 专门用于排查 Hive Metastore (HMS) 相关问题。 + - 支持健康检查、Ping 测试、元数据对象检索以及配置诊断。 + - **性能压测**:提供 `bench` 模式,用于测量 HMS 的性能分布和响应延迟,帮助判断瓶颈是否在元数据层。 + +2. **Kerberos 诊断工具 (`kerberos-tools`)**: + - 用于在使用 Kerberos 认证的环境中验证 `krb5.conf` 配置。 + - 支持测试 KDC 可达性、检查 Keytab 文件以及执行登录测试,确保认证层不会阻断连接。 + +3. **对象存储诊断工具 (`s3-tools`, `gcs-tools`, `azure-blob-cpp`)**: + - 针对主流云存储(AWS S3, Google GCS, Azure Blob)的诊断工具。 + - 用于排查“权限拒绝(Access Denied)”或“存储桶不存在(Bucket Not Found)”等常见的外部表数据访问问题。 + - 支持验证凭据来源、STS 身份以及执行 Bucket 级别的操作测试。 + +**常用命令示例(以 HMS 为例):** + +```bash +# 使用 hms-tools 测试 HMS 基础连通性与耗时细节 +java -jar hms-tools.jar ping --uris thrift://: --count 3 --verbose + +# 使用 hms-tools 压测实际元数据 RPC 的延时分布 +java -jar hms-tools.jar bench --uris thrift://: --rpc get_all_databases --iterations 10 +``` + +当遇到元数据访问慢或外部表连接异常时,推荐根据问题类型(如认证失败、元数据慢或存储无法访问)选用对应的 Pulse 工具进行辅助定位。如果发现 `connect` 阶段极快,但整体初始化阶段存在明显且固定的延迟,请优先参考上文 FAQ 检查 FE 节点的 DNS 及 hostname 解析配置。 diff --git a/versioned_docs/version-3.x/faq/lakehouse-faq.md b/versioned_docs/version-3.x/faq/lakehouse-faq.md index a5103977fdef3..bc9d77756dc09 100644 --- a/versioned_docs/version-3.x/faq/lakehouse-faq.md +++ b/versioned_docs/version-3.x/faq/lakehouse-faq.md @@ -245,13 +245,41 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- Since after inserting the data, the corresponding statistical information needs to be updated, and this update operation requires the alter privilege. Therefore, the alter privilege needs to be added for this user on Ranger. -13. When querying ORC files, if an error like - `Orc row reader nextBatch failed. reason = Can't open /usr/share/zoneinfo/+08:00` - occurs. +13. When querying ORC files, if an error like + `Orc row reader nextBatch failed. reason = Can't open /usr/share/zoneinfo/+08:00` + occurs. - First check the `time_zone` setting of the current session. It is recommended to use a region-based timezone name such as `Asia/Shanghai`. + First check the `time_zone` setting of the current session. It is recommended to use a region-based timezone name such as `Asia/Shanghai`. - If the session timezone is already set to `Asia/Shanghai` but the query still fails, it indicates that the ORC file was generated with the timezone `+08:00`. During query execution, this timezone is required when parsing the ORC footer. In this case, you can try creating a symbolic link under the `/usr/share/zoneinfo/` directory that points `+08:00` to an equivalent timezone. + If the session timezone is already set to `Asia/Shanghai` but the query still fails, it indicates that the ORC file was generated with the timezone `+08:00`. During query execution, this timezone is required when parsing the ORC footer. In this case, you can try creating a symbolic link under the `/usr/share/zoneinfo/` directory that points `+08:00` to an equivalent timezone. + +14. **When querying Hive Catalog tables, query planning is extremely slow, the `nereids cost too much time` error occurs, and each HMS access takes a consistently long time (e.g., around 10 seconds).** + + **Root Cause Analysis:** + This issue is usually not caused by slow execution of the HMS RPC itself. Instead, the most common root cause is **incorrect DNS configuration on the Doris FE node**. + During the initialization phase of the Hive Metastore Client, hostname resolution is triggered. If the configured DNS server is unreachable or unresponsive, it causes a DNS resolution timeout (typically 10 seconds) every time a new HMS client connection is established, which severely slows down metadata fetching. + + **Typical Symptoms:** + - **Normal Network Connectivity:** The HMS port is reachable, but metadata access in Doris remains extremely slow. + - **Consistent Delay:** The delay consistently hits a fixed timeout threshold (e.g., 10 seconds). + - **Workarounds Fail:** Simply increasing the HMS client timeout parameter in the Catalog properties only masks the error but does not eliminate the fixed 10-second delay on each connection. + + **Troubleshooting Steps:** + Run the following commands on the Doris FE node to verify the DNS and hostname resolution: + + ```bash + # Check current DNS server configuration + cat /etc/resolv.conf + # Test if the DNS server is reachable and measure resolution latency + ping + dig @ example.com + dig @ -x + ``` + + **Solutions (Choose One):** + 1. **Fix DNS Configuration (Recommended):** Correct the `nameserver` entries in `/etc/resolv.conf` on the Doris FE node to ensure the DNS service is reachable and responds quickly. If DNS is not required in your local network environment, consider commenting out the invalid nameservers. + 2. **Configure Static Hosts Mapping:** Add the IP and Hostname mapping of the HMS nodes to `/etc/hosts` on the FE node. + 3. **Standardize Catalog Properties:** When creating the Catalog, it is highly recommended to use a resolvable hostname instead of a bare IP address for the `hive.metastore.uris` property. 14. When querying a Hive table that uses JSON SerDe (e.g., `org.openx.data.jsonserde.JsonSerDe`), an error occurs: `failed to get schema` or `Storage schema reading not supported` @@ -276,7 +304,6 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- This parameter is supported since versions 2.1.10 and 3.0.6. ## HDFS - 1. When accessing HDFS 3.x, if you encounter the error `java.lang.VerifyError: xxx`, in versions prior to 1.2.1, Doris depends on Hadoop version 2.8. You need to update to 2.10.2 or upgrade Doris to versions after 1.2.2. 2. Using Hedged Read to optimize slow HDFS reads. In some cases, high load on HDFS may lead to longer read times for data replicas on a specific HDFS, thereby slowing down overall query efficiency. The HDFS Client provides the Hedged Read feature. This feature initiates another read thread to read the same data if a read request exceeds a certain threshold without returning, and the result returned first is used. @@ -386,3 +413,37 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- `./parquet-tools meta /path/to/file.parquet` 4. For more functionalities, refer to [Apache Parquet Cli documentation](https://github.com/apache/parquet-java/tree/apache-parquet-1.14.0/parquet-cli) + +## Diagnostic Tools + +### Pulse + +[Pulse](https://github.com/CalvinKirs/Pulse) is a lightweight connectivity testing toolkit designed to diagnose infrastructure dependencies in data lake environments. It includes several specialized tools to help users quickly pinpoint environment-related issues in external table access. + +Pulse consists of the following key toolsets: + +1. **HMS Diagnostic Tool (`hms-tools`)**: + * Designed specifically for troubleshooting Hive Metastore (HMS) issues. + * Supports health checks, ping tests, object metadata retrieval, and configuration diagnostics. + * **Performance Benchmarking**: Features a `bench` mode to measure the response distribution and latency of HMS, helping determine if the bottleneck is at the metadata layer. + +2. **Kerberos Diagnostic Tool (`kerberos-tools`)**: + * Used to validate `krb5.conf` configurations in environments with Kerberos authentication. + * Supports testing KDC reachability, inspecting keytab files, and performing login tests to ensure the security layer is not blocking the connection. + +3. **Object Storage Diagnostic Tools (`s3-tools`, `gcs-tools`, `azure-blob-cpp`)**: + * Diagnostic tools for major cloud storage services (AWS S3, Google GCS, Azure Blob Storage). + * Used for troubleshooting common external table access issues such as "Access Denied" or "Bucket Not Found". + * Supports validating credential sources and STS identities, and performing bucket-level operation tests. + +**Example Commands (e.g., HMS):** + +```bash +# Test basic HMS connectivity and latency details using hms-tools +java -jar hms-tools.jar ping --uris thrift://: --count 3 --verbose + +# Benchmark actual metadata RPC response distribution using hms-tools +java -jar hms-tools.jar bench --uris thrift://: --rpc get_all_databases --iterations 10 +``` + +When metadata access is slow or external table connectivity fails, it is recommended to use the corresponding Pulse tool based on the issue type (e.g., authentication failure, slow metadata, or storage reachability) for investigation. If the `connect` phase is extremely fast but there are significant and consistent delays during the overall initialization, please refer to the FAQ above to check the DNS and hostname resolution settings on the FE node. diff --git a/versioned_docs/version-4.x/faq/lakehouse-faq.md b/versioned_docs/version-4.x/faq/lakehouse-faq.md index a5103977fdef3..1b79ba09c434f 100644 --- a/versioned_docs/version-4.x/faq/lakehouse-faq.md +++ b/versioned_docs/version-4.x/faq/lakehouse-faq.md @@ -245,13 +245,41 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- Since after inserting the data, the corresponding statistical information needs to be updated, and this update operation requires the alter privilege. Therefore, the alter privilege needs to be added for this user on Ranger. -13. When querying ORC files, if an error like - `Orc row reader nextBatch failed. reason = Can't open /usr/share/zoneinfo/+08:00` - occurs. +13. When querying ORC files, if an error like + `Orc row reader nextBatch failed. reason = Can't open /usr/share/zoneinfo/+08:00` + occurs. - First check the `time_zone` setting of the current session. It is recommended to use a region-based timezone name such as `Asia/Shanghai`. + First check the `time_zone` setting of the current session. It is recommended to use a region-based timezone name such as `Asia/Shanghai`. - If the session timezone is already set to `Asia/Shanghai` but the query still fails, it indicates that the ORC file was generated with the timezone `+08:00`. During query execution, this timezone is required when parsing the ORC footer. In this case, you can try creating a symbolic link under the `/usr/share/zoneinfo/` directory that points `+08:00` to an equivalent timezone. + If the session timezone is already set to `Asia/Shanghai` but the query still fails, it indicates that the ORC file was generated with the timezone `+08:00`. During query execution, this timezone is required when parsing the ORC footer. In this case, you can try creating a symbolic link under the `/usr/share/zoneinfo/` directory that points `+08:00` to an equivalent timezone. + +14. **When querying Hive Catalog tables, query planning is extremely slow, the `nereids cost too much time` error occurs, and each HMS access takes a consistently long time (e.g., around 10 seconds).** + + **Root Cause Analysis:** + This issue is usually not caused by slow execution of the HMS RPC itself. Instead, the most common root cause is **incorrect DNS configuration on the Doris FE node**. + During the initialization phase of the Hive Metastore Client, hostname resolution is triggered. If the configured DNS server is unreachable or unresponsive, it causes a DNS resolution timeout (typically 10 seconds) every time a new HMS client connection is established, which severely slows down metadata fetching. + + **Typical Symptoms:** + - **Normal Network Connectivity:** The HMS port is reachable, but metadata access in Doris remains extremely slow. + - **Consistent Delay:** The delay consistently hits a fixed timeout threshold (e.g., 10 seconds). + - **Workarounds Fail:** Simply increasing the HMS client timeout parameter in the Catalog properties only masks the error but does not eliminate the fixed 10-second delay on each connection. + + **Troubleshooting Steps:** + Run the following commands on the Doris FE node to verify the DNS and hostname resolution: + + ```bash + # Check current DNS server configuration + cat /etc/resolv.conf + # Test if the DNS server is reachable and measure resolution latency + ping + dig @ example.com + dig @ -x + ``` + + **Solutions (Choose One):** + 1. **Fix DNS Configuration (Recommended):** Correct the `nameserver` entries in `/etc/resolv.conf` on the Doris FE node to ensure the DNS service is reachable and responds quickly. If DNS is not required in your local network environment, consider commenting out the invalid nameservers. + 2. **Configure Static Hosts Mapping:** Add the IP and Hostname mapping of the HMS nodes to `/etc/hosts` on the FE node. + 3. **Standardize Catalog Properties:** When creating the Catalog, it is highly recommended to use a resolvable hostname instead of a bare IP address for the `hive.metastore.uris` property. 14. When querying a Hive table that uses JSON SerDe (e.g., `org.openx.data.jsonserde.JsonSerDe`), an error occurs: `failed to get schema` or `Storage schema reading not supported` @@ -386,3 +414,37 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- `./parquet-tools meta /path/to/file.parquet` 4. For more functionalities, refer to [Apache Parquet Cli documentation](https://github.com/apache/parquet-java/tree/apache-parquet-1.14.0/parquet-cli) + +## Diagnostic Tools + +### Pulse + +[Pulse](https://github.com/CalvinKirs/Pulse) is a lightweight connectivity testing toolkit designed to diagnose infrastructure dependencies in data lake environments. It includes several specialized tools to help users quickly pinpoint environment-related issues in external table access. + +Pulse consists of the following key toolsets: + +1. **HMS Diagnostic Tool (`hms-tools`)**: + * Designed specifically for troubleshooting Hive Metastore (HMS) issues. + * Supports health checks, ping tests, object metadata retrieval, and configuration diagnostics. + * **Performance Benchmarking**: Features a `bench` mode to measure the response distribution and latency of HMS, helping determine if the bottleneck is at the metadata layer. + +2. **Kerberos Diagnostic Tool (`kerberos-tools`)**: + * Used to validate `krb5.conf` configurations in environments with Kerberos authentication. + * Supports testing KDC reachability, inspecting keytab files, and performing login tests to ensure the security layer is not blocking the connection. + +3. **Object Storage Diagnostic Tools (`s3-tools`, `gcs-tools`, `azure-blob-cpp`)**: + * Diagnostic tools for major cloud storage services (AWS S3, Google GCS, Azure Blob Storage). + * Used for troubleshooting common external table access issues such as "Access Denied" or "Bucket Not Found". + * Supports validating credential sources and STS identities, and performing bucket-level operation tests. + +**Example Commands (e.g., HMS):** + +```bash +# Test basic HMS connectivity and latency details using hms-tools +java -jar hms-tools.jar ping --uris thrift://: --count 3 --verbose + +# Benchmark actual metadata RPC response distribution using hms-tools +java -jar hms-tools.jar bench --uris thrift://: --rpc get_all_databases --iterations 10 +``` + +When metadata access is slow or external table connectivity fails, it is recommended to use the corresponding Pulse tool based on the issue type (e.g., authentication failure, slow metadata, or storage reachability) for investigation. If the `connect` phase is extremely fast but there are significant and consistent delays during the overall initialization, please refer to the FAQ above to check the DNS and hostname resolution settings on the FE node. From c4d5d925e68a6c69296b235ff383bac68393c7e6 Mon Sep 17 00:00:00 2001 From: Socrates Date: Fri, 20 Mar 2026 17:18:11 +0800 Subject: [PATCH 2/4] docs(faq): add HMS connection pool idle timeout and optimizer timeout issues --- docs/faq/lakehouse-faq.md | 25 +++++++++++++++++ .../current/faq/lakehouse-faq.md | 25 +++++++++++++++++ .../version-3.x/faq/lakehouse-faq.md | 25 +++++++++++++++++ .../version-4.x/faq/lakehouse-faq.md | 26 ++++++++++++++++++ .../version-3.x/faq/lakehouse-faq.md | 27 +++++++++++++++++++ .../version-4.x/faq/lakehouse-faq.md | 27 +++++++++++++++++++ 6 files changed, 155 insertions(+) diff --git a/docs/faq/lakehouse-faq.md b/docs/faq/lakehouse-faq.md index 1b79ba09c434f..eb666c5e66c4e 100644 --- a/docs/faq/lakehouse-faq.md +++ b/docs/faq/lakehouse-faq.md @@ -303,6 +303,31 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- This parameter is supported since versions 2.1.10 and 3.0.6. +15. **Queries on Hive Catalog tables occasionally experience extremely long hangs or directly report the optimizer timeout error `nereids cost too much time`, but subsequent queries work fine immediately after.** + + **Problem Description:** + This usually happens after the Catalog has been idle for a while. When an HMS RPC is initiated, if a stale connection from the pool is reused, the request will hang for the duration of the Socket Timeout (default 10s). Due to the Hive Client's internal retry mechanism, this can result in cumulative waits of 20-30 seconds if multiple retries occur. This causes the query planning phase to be extremely slow, often triggering the Doris FE optimizer timeout error `nereids cost too much time`. Once the connection is purged and rebuilt, performance returns to normal. + + **Root Cause Analysis:** + Doris maintains a Client Pool for each HMS Catalog to reuse connections. In complex network environments (e.g., across VPCs, through firewalls, or NAT gateways), idle TCP connections are often "silently" reclaimed by network devices after an `idle timeout`. Since these devices typically do not send FIN/RST packets to notify the endpoints, Doris still believes the connection is valid. Reusing such a "zombie connection" requires waiting for a full Socket Timeout before the failure is detected and a retry is triggered. + + **Troubleshooting Steps:** + - Verify if there are firewalls, NAT gateways, or Load Balancers between Doris FE and HMS. + - Use the **Pulse (hms-tools)** diagnostic tool. If the tool shows fast network connectivity but stable delays that are multiples of 10s when executing RPCs after a long idle period, it confirms that idle connections are being silently reclaimed. + + **Solution:** + Configure the connection lifetime in your Catalog properties to be slightly shorter than the network device's idle timeout. We recommend using Hive's native socket lifetime property: + + ```sql + CREATE CATALOG hive_catalog PROPERTIES ( + "type" = "hms", + "hive.metastore.uris" = "thrift://:", + -- Set a value shorter than your network's idle timeout (e.g., 300s) + "hive.metastore.client.socket.lifetime" = "300s" + ); + ``` + When set, the HMS Client will check the connection age before sending an RPC. If it exceeds the `lifetime`, it proactively reconnects, avoiding long hangs and optimizer timeouts caused by stale connections. + ## HDFS 1. When accessing HDFS 3.x, if you encounter the error `java.lang.VerifyError: xxx`, in versions prior to 1.2.1, Doris depends on Hadoop version 2.8. You need to update to 2.10.2 or upgrade Doris to versions after 1.2.2. diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/faq/lakehouse-faq.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/faq/lakehouse-faq.md index 6d1b67935f7ee..f2caed18d1099 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/faq/lakehouse-faq.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/faq/lakehouse-faq.md @@ -329,6 +329,31 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 2. **配置 Hosts 静态映射**:在 FE 节点的 `/etc/hosts` 中添加 HMS 节点的 IP 与 Hostname 映射。 3. **规范 Catalog 配置**:创建 Catalog 时,`hive.metastore.uris` 参数建议优先使用正确的 Hostname 而不是裸 IP。 +15. **查询 Hive Catalog 表时,偶尔出现查询长时间卡住,或者直接报错优化器超时 `nereids cost too much time`,但紧接着再次查询又恢复正常。** + + **问题描述:** + 这种情况通常发生在 Catalog 空闲一段时间后的首次查询。表现为请求在发起 HMS RPC 时卡住,由于 Hive Client 内部存在重试机制,复用到失效的长连接时会等待 Socket Timeout(默认 10 秒),重试可能导致累积等待时间长达 20-30 秒。这会导致查询规划阶段极慢,甚至直接触发 Doris FE 的优化器超时报错 `nereids cost too much time`。一旦该连接被剔除并重建,后续查询会立即恢复正常。 + + **问题分析:** + Doris 为每个 HMS Catalog 维护了一个 Client Pool 以复用连接。在复杂的网络环境中(如跨 VPC、经过防火墙或 NAT 网关),中间网络设备往往会对空闲连接设置 `idle timeout`。当连接空闲时间超过阈值时,网络设备会静默丢弃连接状态,且通常不会发送 FIN/RST 包通知两端。Doris 侧仍认为连接可用,下次复用该“僵尸连接”时,由于链路已不可达,必须等待完整的 Socket Timeout 才能感知到失效并触发重试。 + + **排查建议:** + - 确认 Doris FE 与 HMS 之间是否经过了防火墙、云厂商 NAT 网关或负载均衡器。 + - 使用下文提到的 **Pulse (hms-tools)** 工具。如果探测显示网络连通极快,但长时间放置后首次执行 RPC 出现稳定且倍数于 10s 的延迟,则基本可判定为长连接被中间设备静默回收。 + + **解决方案:** + 利用 Hive Client 原生的生命周期管理能力,在 Catalog 属性中配置 `hive.metastore.client.socket.lifetime`,使其略短于中间网络设备的空闲超时时间(例如设为 300 秒): + + ```sql + CREATE CATALOG hive_catalog PROPERTIES ( + "type" = "hms", + "hive.metastore.uris" = "thrift://:", + -- 设置比中间设备 idle timeout 更短的时间,例如 300s + "hive.metastore.client.socket.lifetime" = "300s" + ); + ``` + 配置后,HMS Client 在执行 RPC 前会检查连接年龄。如果已超过 `lifetime`,它会主动重新建立连接,从而规避因复用失效连接导致的长时间卡顿或优化器超时。 + ## HDFS 1. 访问 HDFS 3.x 时报错:`java.lang.VerifyError: xxx` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/faq/lakehouse-faq.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/faq/lakehouse-faq.md index f51992f009529..d57bf28a23342 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/faq/lakehouse-faq.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/faq/lakehouse-faq.md @@ -331,6 +331,31 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 3. **规范 Catalog 配置**:创建 Catalog 时,`hive.metastore.uris` 参数建议优先使用正确的 Hostname 而不是裸 IP。 >>>>>>> 0cf7ea7ca780 (docs(faq): add HMS DNS resolution diagnostic and Pulse toolset) +15. **查询 Hive Catalog 表时,偶尔出现查询长时间卡住,或者直接报错优化器超时 `nereids cost too much time`,但紧接着再次查询又恢复正常。** + + **问题描述:** + 这种情况通常发生在 Catalog 空闲一段时间后的首次查询。表现为请求在发起 HMS RPC 时卡住,由于 Hive Client 内部存在重试机制,复用到失效的长连接时会等待 Socket Timeout(默认 10 秒),重试可能导致累积等待时间长达 20-30 秒。这会导致查询规划阶段极慢,甚至直接触发 Doris FE 的优化器超时报错 `nereids cost too much time`。一旦该连接被剔除并重建,后续查询会立即恢复正常。 + + **问题分析:** + Doris 为每个 HMS Catalog 维护了一个 Client Pool 以复用连接。在复杂的网络环境中(如跨 VPC、经过防火墙或 NAT 网关),中间网络设备往往会对空闲连接设置 `idle timeout`。当连接空闲时间超过阈值时,网络设备会静默丢弃连接状态,且通常不会发送 FIN/RST 包通知两端。Doris 侧仍认为连接可用,下次复用该“僵尸连接”时,由于链路已不可达,必须等待完整的 Socket Timeout 才能感知到失效并触发重试。 + + **排查建议:** + - 确认 Doris FE 与 HMS 之间是否经过了防火墙、云厂商 NAT 网关或负载均衡器。 + - 使用下文提到的 **Pulse (hms-tools)** 工具。如果探测显示网络连通极快,但长时间放置后首次执行 RPC 出现稳定且倍数于 10s 的延迟,则基本可判定为长连接被中间设备静默回收。 + + **解决方案:** + 利用 Hive Client 原生的生命周期管理能力,在 Catalog 属性中配置 `hive.metastore.client.socket.lifetime`,使其略短于中间网络设备的空闲超时时间(例如设为 300 秒): + + ```sql + CREATE CATALOG hive_catalog PROPERTIES ( + "type" = "hms", + "hive.metastore.uris" = "thrift://:", + -- 设置比中间设备 idle timeout 更短的时间,例如 300s + "hive.metastore.client.socket.lifetime" = "300s" + ); + ``` + 配置后,HMS Client 在执行 RPC 前会检查连接年龄。如果已超过 `lifetime`,它会主动重新建立连接,从而规避因复用失效连接导致的长时间卡顿或优化器超时。 + ## HDFS 1. 访问 HDFS 3.x 时报错:`java.lang.VerifyError: xxx` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/faq/lakehouse-faq.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/faq/lakehouse-faq.md index f51992f009529..a6742edabfbc9 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/faq/lakehouse-faq.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/faq/lakehouse-faq.md @@ -331,6 +331,32 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 3. **规范 Catalog 配置**:创建 Catalog 时,`hive.metastore.uris` 参数建议优先使用正确的 Hostname 而不是裸 IP。 >>>>>>> 0cf7ea7ca780 (docs(faq): add HMS DNS resolution diagnostic and Pulse toolset) +15. **查询 Hive Catalog 表时,偶尔出现查询卡住约 10 秒(默认 socket 超时时间)后才继续执行,或者报错传输异常,但紧接着再次查询又恢复正常。** + + **问题描述:** + 这种情况通常发生在 Catalog 空闲一段时间后的首次查询。表现为请求首次发起 HMS RPC 时卡住,等待约 10 秒后(对应默认的 `hive.metastore.client.socket.timeout`)抛出超时或异常,或者在重试后成功。一旦该连接被剔除并重建,后续查询会立即恢复正常。 + + **问题分析:** + Doris 为每个 HMS Catalog 维护了一个 client pool 以复用连接。在复杂的网络环境中(如跨 VPC、经过防火墙或 NAT 网关),中间网络设备往往会对空闲连接设置 `idle timeout`。 + 当连接空闲时间超过阈值时,网络设备会静默丢弃连接状态,且通常不会通知应用两端(不发送 FIN/RST)。Doris 侧仍认为连接可用,下次复用该“僵尸连接”时,由于链路已不可达,必须等待完整的 socket timeout 才能感知到失效。 + + **排查建议:** + - 确认 Doris FE 与 HMS 之间是否经过了防火墙、云厂商 NAT 网关或负载均衡器。 + - 使用下文提到的 **Pulse (hms-tools)** 工具。如果探测显示网络连通极快,但长时间放置后首次执行 RPC 出现稳定超时,则基本可判定为长连接被中间设备静默回收。 + + **解决方案:** + 利用 Hive Client 原生的生命周期管理能力,在 Catalog 属性中配置 `hive.metastore.client.socket.lifetime`,使其略短于中间网络设备的空闲超时时间(例如设为 300 秒): + + ```sql + CREATE CATALOG hive_catalog PROPERTIES ( + "type" = "hms", + "hive.metastore.uris" = "thrift://:", + -- 设置比中间设备 idle timeout 更短的时间,例如 300s + "hive.metastore.client.socket.lifetime" = "300s" + ); + ``` + 配置后, HMS Client 在执行 RPC 前会检查连接年龄。如果已超过 `lifetime`,它会主动重新建立连接,从而规避因复用失效连接导致的 10 秒卡顿。 + ## HDFS 1. 访问 HDFS 3.x 时报错:`java.lang.VerifyError: xxx` diff --git a/versioned_docs/version-3.x/faq/lakehouse-faq.md b/versioned_docs/version-3.x/faq/lakehouse-faq.md index bc9d77756dc09..145f352765bd2 100644 --- a/versioned_docs/version-3.x/faq/lakehouse-faq.md +++ b/versioned_docs/version-3.x/faq/lakehouse-faq.md @@ -281,6 +281,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 2. **Configure Static Hosts Mapping:** Add the IP and Hostname mapping of the HMS nodes to `/etc/hosts` on the FE node. 3. **Standardize Catalog Properties:** When creating the Catalog, it is highly recommended to use a resolvable hostname instead of a bare IP address for the `hive.metastore.uris` property. +<<<<<<< HEAD 14. When querying a Hive table that uses JSON SerDe (e.g., `org.openx.data.jsonserde.JsonSerDe`), an error occurs: `failed to get schema` or `Storage schema reading not supported` When a Hive table uses JSON format storage (ROW FORMAT SERDE is `org.openx.data.jsonserde.JsonSerDe`), the Hive Metastore may not be able to read the table's schema information through the default method, causing the following error when querying from Doris: @@ -302,6 +303,32 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- ``` This parameter is supported since versions 2.1.10 and 3.0.6. +======= +15. **Queries on Hive Catalog tables occasionally experience extremely long hangs or directly report the optimizer timeout error `nereids cost too much time`, but subsequent queries work fine immediately after.** + + **Problem Description:** + This usually happens after the Catalog has been idle for a while. When an HMS RPC is initiated, if a stale connection from the pool is reused, the request will hang for the duration of the Socket Timeout (default 10s). Due to the Hive Client's internal retry mechanism, this can result in cumulative waits of 20-30 seconds if multiple retries occur. This causes the query planning phase to be extremely slow, often triggering the Doris FE optimizer timeout error `nereids cost too much time`. Once the connection is purged and rebuilt, performance returns to normal. + + **Root Cause Analysis:** + Doris maintains a Client Pool for each HMS Catalog to reuse connections. In complex network environments (e.g., across VPCs, through firewalls, or NAT gateways), idle TCP connections are "silently" reclaimed by network devices after an `idle timeout`. Since these devices typically do not send FIN/RST packets to notify the endpoints, Doris still believes the connection is valid. Reusing such a "zombie connection" requires waiting for a full Socket Timeout before the failure is detected and a retry is triggered. + + **Troubleshooting Steps:** + - Verify if there are firewalls, NAT gateways, or Load Balancers between Doris FE and HMS. + - Use the **Pulse (hms-tools)** diagnostic tool. If the tool shows fast network connectivity but stable delays that are multiples of 10s when executing RPCs after a long idle period, it confirms that idle connections are being silently reclaimed. + + **Solution:** + Configure the connection lifetime in your Catalog properties to be slightly shorter than the network device's idle timeout. We recommend using Hive's native socket lifetime property: + + ```sql + CREATE CATALOG hive_catalog PROPERTIES ( + "type" = "hms", + "hive.metastore.uris" = "thrift://:", + -- Set a value shorter than your network's idle timeout (e.g., 300s) + "hive.metastore.client.socket.lifetime" = "300s" + ); + ``` + When set, the HMS Client will check the connection age before sending an RPC. If it exceeds the `lifetime`, it proactively reconnects, avoiding long hangs and optimizer timeouts caused by stale connections. +>>>>>>> 30861289d66a (docs(faq): add HMS connection pool idle timeout and optimizer timeout issues) ## HDFS 1. When accessing HDFS 3.x, if you encounter the error `java.lang.VerifyError: xxx`, in versions prior to 1.2.1, Doris depends on Hadoop version 2.8. You need to update to 2.10.2 or upgrade Doris to versions after 1.2.2. diff --git a/versioned_docs/version-4.x/faq/lakehouse-faq.md b/versioned_docs/version-4.x/faq/lakehouse-faq.md index 1b79ba09c434f..d18e17cb2cbda 100644 --- a/versioned_docs/version-4.x/faq/lakehouse-faq.md +++ b/versioned_docs/version-4.x/faq/lakehouse-faq.md @@ -281,6 +281,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 2. **Configure Static Hosts Mapping:** Add the IP and Hostname mapping of the HMS nodes to `/etc/hosts` on the FE node. 3. **Standardize Catalog Properties:** When creating the Catalog, it is highly recommended to use a resolvable hostname instead of a bare IP address for the `hive.metastore.uris` property. +<<<<<<< HEAD 14. When querying a Hive table that uses JSON SerDe (e.g., `org.openx.data.jsonserde.JsonSerDe`), an error occurs: `failed to get schema` or `Storage schema reading not supported` When a Hive table uses JSON format storage (ROW FORMAT SERDE is `org.openx.data.jsonserde.JsonSerDe`), the Hive Metastore may not be able to read the table's schema information through the default method, causing the following error when querying from Doris: @@ -302,6 +303,32 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- ``` This parameter is supported since versions 2.1.10 and 3.0.6. +======= +15. **Queries on Hive Catalog tables occasionally experience extremely long hangs or directly report the optimizer timeout error `nereids cost too much time`, but subsequent queries work fine immediately after.** + + **Problem Description:** + This usually happens after the Catalog has been idle for a while. When an HMS RPC is initiated, if a stale connection from the pool is reused, the request will hang for the duration of the Socket Timeout (default 10s). Due to the Hive Client's internal retry mechanism, this can result in cumulative waits of 20-30 seconds if multiple retries occur. This causes the query planning phase to be extremely slow, often triggering the Doris FE optimizer timeout error `nereids cost too much time`. Once the connection is purged and rebuilt, performance returns to normal. + + **Root Cause Analysis:** + Doris maintains a Client Pool for each HMS Catalog to reuse connections. In complex network environments (e.g., across VPCs, through firewalls, or NAT gateways), idle TCP connections are "silently" reclaimed by network devices after an `idle timeout`. Since these devices typically do not send FIN/RST packets to notify the endpoints, Doris still believes the connection is valid. Reusing such a "zombie connection" requires waiting for a full Socket Timeout before the failure is detected and a retry is triggered. + + **Troubleshooting Steps:** + - Verify if there are firewalls, NAT gateways, or Load Balancers between Doris FE and HMS. + - Use the **Pulse (hms-tools)** diagnostic tool. If the tool shows fast network connectivity but stable delays that are multiples of 10s when executing RPCs after a long idle period, it confirms that idle connections are being silently reclaimed. + + **Solution:** + Configure the connection lifetime in your Catalog properties to be slightly shorter than the network device's idle timeout. We recommend using Hive's native socket lifetime property: + + ```sql + CREATE CATALOG hive_catalog PROPERTIES ( + "type" = "hms", + "hive.metastore.uris" = "thrift://:", + -- Set a value shorter than your network's idle timeout (e.g., 300s) + "hive.metastore.client.socket.lifetime" = "300s" + ); + ``` + When set, the HMS Client will check the connection age before sending an RPC. If it exceeds the `lifetime`, it proactively reconnects, avoiding long hangs and optimizer timeouts caused by stale connections. +>>>>>>> 30861289d66a (docs(faq): add HMS connection pool idle timeout and optimizer timeout issues) ## HDFS From 10c4e3cb3e4233ef88bde8bc9365ff9a8cc50916 Mon Sep 17 00:00:00 2001 From: "Mingyu Chen (Rayner)" Date: Mon, 30 Mar 2026 14:20:50 -0700 Subject: [PATCH 3/4] opt --- docs/faq/lakehouse-faq.md | 176 ++++++------ .../current/faq/lakehouse-faq.md | 238 ++++++++-------- .../version-3.x/faq/lakehouse-faq.md | 244 ++++++++--------- .../version-4.x/faq/lakehouse-faq.md | 253 +++++++++--------- .../version-3.x/faq/lakehouse-faq.md | 183 +++++++------ .../version-4.x/faq/lakehouse-faq.md | 182 +++++++------ 6 files changed, 652 insertions(+), 624 deletions(-) diff --git a/docs/faq/lakehouse-faq.md b/docs/faq/lakehouse-faq.md index eb666c5e66c4e..23295d5e54b21 100644 --- a/docs/faq/lakehouse-faq.md +++ b/docs/faq/lakehouse-faq.md @@ -2,34 +2,34 @@ { "title": "Data Lakehouse FAQ", "language": "en", - "description": "This is usually due to incorrect Kerberos authentication information. You can troubleshoot by following these steps:" + "description": "Apache Doris Data Lakehouse FAQ and troubleshooting guide, covering certificate issues, Kerberos authentication, JDBC Catalog, Hive Catalog, HDFS, DLF Catalog, and diagnostic tools." } --- ## Certificate Issues 1. When querying, an error `curl 77: Problem with the SSL CA cert.` occurs. This indicates that the current system certificate is too old and needs to be updated locally. - - You can download the latest CA certificate from `https://curl.se/docs/caextract.html`. - - Place the downloaded `cacert-xxx.pem` into the `/etc/ssl/certs/` directory, for example: `sudo cp cacert-xxx.pem /etc/ssl/certs/ca-certificates.crt`. + - You can download the latest CA certificate from `https://curl.se/docs/caextract.html`. + - Place the downloaded `cacert-xxx.pem` into the `/etc/ssl/certs/` directory, for example: `sudo cp cacert-xxx.pem /etc/ssl/certs/ca-certificates.crt`. 2. When querying, an error occurs: `ERROR 1105 (HY000): errCode = 2, detailMessage = (x.x.x.x)[CANCELLED][INTERNAL_ERROR]error setting certificate verify locations: CAfile: /etc/ssl/certs/ca-certificates.crt CApath: none`. -``` -yum install -y ca-certificates -ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca-certificates.crt -``` + ``` + yum install -y ca-certificates + ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca-certificates.crt + ``` ## Kerberos 1. When connecting to a Hive Metastore authenticated with Kerberos, an error `GSS initiate failed` is encountered. - This is usually due to incorrect Kerberos authentication information. You can troubleshoot by following these steps: + This is usually due to incorrect Kerberos authentication information. You can troubleshoot by following these steps: 1. In versions prior to 1.2.1, the libhdfs3 library that Doris depends on did not enable gsasl. Please update to versions 1.2.2 and later. 2. Ensure that correct keytab and principal are set for each component and verify that the keytab file exists on all FE and BE nodes. - - `hadoop.kerberos.keytab`/`hadoop.kerberos.principal`: Used for Hadoop hdfs access, fill in the corresponding values for hdfs. - - `hive.metastore.kerberos.principal`: Used for hive metastore. + 1. `hadoop.kerberos.keytab`/`hadoop.kerberos.principal`: Used for Hadoop HDFS access, fill in the corresponding values for HDFS. + 2. `hive.metastore.kerberos.principal`: Used for Hive Metastore. 3. Try replacing the IP in the principal with a domain name (do not use the default `_HOST` placeholder). 4. Ensure that the `/etc/krb5.conf` file exists on all FE and BE nodes. @@ -37,54 +37,54 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 2. When connecting to a Hive database through the Hive Catalog, an error occurs: `RemoteException: SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]`. If the error occurs during the query when there are no issues with `show databases` and `show tables`, follow these two steps: - - Place core-site.xml and hdfs-site.xml in the fe/conf and be/conf directories. - - Execute Kerberos kinit on the BE node, restart BE, and then proceed with the query. - - When encountering the error `GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos Ticket)` while querying a table configured with Kerberos, restarting FE and BE nodes usually resolves the issue. - + - Place `core-site.xml` and `hdfs-site.xml` in the `fe/conf` and `be/conf` directories. + - Execute Kerberos `kinit` on the BE node, restart BE, and then proceed with the query. + +3. When encountering the error `GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos Ticket)` while querying a table configured with Kerberos, restarting FE and BE nodes usually resolves the issue. + - Before restarting all nodes, configure `-Djavax.security.auth.useSubjectCredsOnly=false` in the JAVA_OPTS parameter in `"${DORIS_HOME}/be/conf/be.conf"` to obtain JAAS credentials information through the underlying mechanism rather than the application. - Refer to [JAAS Troubleshooting](https://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/tutorials/Troubleshooting.html) for solutions to common JAAS errors. - To resolve the error `Unable to obtain password from user` when configuring Kerberos in the Catalog: - +4. To resolve the error `Unable to obtain password from user` when configuring Kerberos in the Catalog: + - Ensure the principal used is listed in klist by checking with `klist -kt your.keytab`. - - Verify the catalog configuration for any missing settings such as `yarn.resourcemanager.principal`. + - Verify the Catalog configuration for any missing settings such as `yarn.resourcemanager.principal`. - If the above checks are fine, it may be due to the JDK version installed by the system's package manager not supporting certain encryption algorithms. Consider installing JDK manually and setting the `JAVA_HOME` environment variable. - Kerberos typically uses AES-256 for encryption. For Oracle JDK, JCE must be installed. Some distributions of OpenJDK automatically provide unlimited strength JCE, eliminating the need for separate installation. - JCE versions correspond to JDK versions; download the appropriate JCE zip package and extract it to the `$JAVA_HOME/jre/lib/security` directory based on the JDK version: - - JDK6: [JCE6](http://www.oracle.com/technetwork/java/javase/downloads/jce-6-download-429243.html) - - JDK7: [JCE7](http://www.oracle.com/technetwork/java/embedded/embedded-se/downloads/jce-7-download-432124.html) - - JDK8: [JCE8](http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html) - - When encountering the error `java.security.InvalidKeyException: Illegal key size` while accessing HDFS with KMS, upgrade the JDK version to >= Java 8 u162 or install the corresponding JCE Unlimited Strength Jurisdiction Policy Files. - - If configuring Kerberos in the Catalog results in the error `SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]`, place the `core-site.xml` file in the `"${DORIS_HOME}/be/conf"` directory. - + - JDK6: [JCE6](http://www.oracle.com/technetwork/java/javase/downloads/jce-6-download-429243.html) + - JDK7: [JCE7](http://www.oracle.com/technetwork/java/embedded/embedded-se/downloads/jce-7-download-432124.html) + - JDK8: [JCE8](http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html) + +5. When encountering the error `java.security.InvalidKeyException: Illegal key size` while accessing HDFS with KMS, upgrade the JDK version to >= Java 8 u162, or install the corresponding JCE Unlimited Strength Jurisdiction Policy Files. + +6. If configuring Kerberos in the Catalog results in the error `SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]`, place the `core-site.xml` file in the `"${DORIS_HOME}/be/conf"` directory. + If accessing HDFS results in the error `No common protection layer between client and server`, ensure that the `hadoop.rpc.protection` properties on the client and server are consistent. - - ``` + + ```xml - + - + hadoop.security.authentication kerberos - + ``` - - When using Broker Load with Kerberos configured and encountering the error `Cannot locate default realm.`: - + +7. When using Broker Load with Kerberos configured and encountering the error `Cannot locate default realm.`: + Add the configuration item `-Djava.security.krb5.conf=/your-path` to the `JAVA_OPTS` in the `start_broker.sh` script for Broker Load. -3. When using Kerberos configuration in the Catalog, the `hadoop.username` property cannot be used simultaneously. +8. When using Kerberos configuration in the Catalog, the `hadoop.username` property cannot be used simultaneously. -4. Accessing Kerberos with JDK 17 +9. Accessing Kerberos with JDK 17 - When running Doris with JDK 17 and accessing Kerberos services, you may encounter issues accessing due to the use of deprecated encryption algorithms. You need to add the `allow_weak_crypto=true` property in krb5.conf or upgrade the encryption algorithm in Kerberos. + When running Doris with JDK 17 and accessing Kerberos services, you may encounter issues due to the use of deprecated encryption algorithms. You need to add the `allow_weak_crypto=true` property in `krb5.conf`, or upgrade the encryption algorithm in Kerberos. For more details, refer to: @@ -92,52 +92,52 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 1. Error connecting to SQLServer via JDBC Catalog: `unable to find valid certification path to requested target` - Add the `trustServerCertificate=true` option in the `jdbc_url`. + Add the `trustServerCertificate=true` option in the `jdbc_url`. 2. Connecting to MySQL database via JDBC Catalog results in Chinese character garbling or incorrect Chinese character query conditions - Add `useUnicode=true&characterEncoding=utf-8` in the `jdbc_url`. + Add `useUnicode=true&characterEncoding=utf-8` in the `jdbc_url`. - > Note: Starting from version 1.2.3, when connecting to MySQL database via JDBC Catalog, these parameters will be automatically added. + > Note: Starting from version 1.2.3, when connecting to MySQL database via JDBC Catalog, these parameters will be automatically added. 3. Error connecting to MySQL database via JDBC Catalog: `Establishing SSL connection without server's identity verification is not recommended` - Add `useSSL=true` in the `jdbc_url`. + Add `useSSL=true` in the `jdbc_url`. -4. When synchronizing MySQL data to Doris using JDBC Catalog, date data synchronization error occurs. Verify if the MySQL version matches the MySQL driver package, for example, MySQL 8 and above require the driver com.mysql.cj.jdbc.Driver. +4. When synchronizing MySQL data to Doris using JDBC Catalog, date data synchronization error occurs. Verify if the MySQL version matches the MySQL driver package, for example, MySQL 8 and above require the driver `com.mysql.cj.jdbc.Driver`. 5. When a single field is too large, a Java memory OOM occurs on the BE side during a query. - When Jdbc Scanner reads data through JDBC, the session variable `batch_size` determines the number of rows processed in the JVM per batch. If a single field is too large, it may cause `field_size * batch_size` (approximate value, considering JVM static memory and data copy overhead) to exceed the JVM memory limit, resulting in OOM. + When JDBC Scanner reads data through JDBC, the Session Variable `batch_size` determines the number of rows processed in the JVM per batch. If a single field is too large, it may cause `field_size * batch_size` (approximate value, considering JVM static memory and data copy overhead) to exceed the JVM memory limit, resulting in OOM. - Solutions: + Solutions: - - Reduce the `batch_size` value by executing `set batch_size = 512;`. The default value is 4064. - - Increase the BE JVM memory by modifying the `-Xmx` parameter in `JAVA_OPTS`. For example: `-Xmx8g`. + - Reduce the `batch_size` value by executing `set batch_size = 512;`. The default value is 4064. + - Increase the BE JVM memory by modifying the `-Xmx` parameter in `JAVA_OPTS`. For example: `-Xmx8g`. ## Hive Catalog 1. Accessing Iceberg or Hive table through Hive Catalog reports an error: `failed to get schema` or `Storage schema reading not supported` You can try the following methods: - - * Put the `iceberg` runtime-related jar package in the lib/ directory of Hive. - - * Configure in `hive-site.xml`: - + + - Put the `iceberg` runtime-related jar package in the `lib/` directory of Hive. + + - Configure in `hive-site.xml`: + ``` metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader ``` - + After the configuration is completed, you need to restart the Hive Metastore. - - * Add `"get_schema_from_table" = "true"` in the Catalog properties - + + - Add `"get_schema_from_table" = "true"` in the Catalog properties. + This parameter is supported since versions 2.1.10 and 3.0.6. 2. Error connecting to Hive Catalog: `Caused by: java.lang.NullPointerException` - If the fe.log contains the following stack trace: + If the fe.log contains the following stack trace: ``` Caused by: java.lang.NullPointerException @@ -148,17 +148,17 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_181] ``` - Try adding `"metastore.filter.hook" = "org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl"` in the `create catalog` statement to resolve. + Try adding `"metastore.filter.hook" = "org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl"` in the `CREATE CATALOG` statement to resolve. 3. If after creating Hive Catalog, `show tables` works fine but querying results in `java.net.UnknownHostException: xxxxx` - Add the following in the CATALOG's PROPERTIES: + Add the following in the Catalog's PROPERTIES: ``` 'fs.defaultFS' = 'hdfs://' ``` -4. Tables in orc format in Hive 1.x may encounter system column names in the underlying orc file schema as `_col0`, `_col1`, `_col2`, etc. In this case, add `hive.version` as 1.x.x in the catalog configuration to map with the column names in the hive table. +4. Tables in ORC format in Hive 1.x may encounter system column names in the underlying ORC file Schema as `_col0`, `_col1`, `_col2`, etc. In this case, add `hive.version` as 1.x.x in the Catalog configuration to map with the column names in the Hive table. ```sql CREATE CATALOG hive PROPERTIES ( @@ -199,15 +199,15 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- } ``` -8. When querying a Hive external table, if you encounter the error `java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found`, search for `hadoop-lzo-*.jar` in the Hadoop environment, place it in the `"${DORIS_HOME}/fe/lib/"` directory, and restart the FE. Starting from version 2.0.2, you can place this file in the `custom_lib/` directory of the FE (if it does not exist, create it manually) to prevent file loss when upgrading the cluster due to the lib directory being replaced. +8. When querying a Hive external table, if you encounter the error `java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found`, search for `hadoop-lzo-*.jar` in the Hadoop environment, place it in the `"${DORIS_HOME}/fe/lib/"` directory, and restart the FE. Starting from version 2.0.2, you can place this file in the `custom_lib/` directory of the FE (if it does not exist, create it manually) to prevent file loss when upgrading the cluster due to the `lib` directory being replaced. -9. When creating a Hive table specifying the serde as `org.apache.hadoop.hive.contrib.serde2.MultiDelimitserDe`, and encountering the error `storage schema reading not supported` when accessing the table, add the following configuration to the hive-site.xml file and restart the HMS service: +9. When creating a Hive table specifying the serde as `org.apache.hadoop.hive.contrib.serde2.MultiDelimitserDe`, and encountering the error `storage schema reading not supported` when accessing the table, add the following configuration to the `hive-site.xml` file and restart the HMS service: - ``` + ```xml - metastore.storage.schema.reader.impl - org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader - + metastore.storage.schema.reader.impl + org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader + ``` 10. Error: `java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty`. The complete error message in the FE log is as follows: @@ -222,7 +222,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- Caused by: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty ``` - Try updating the CA certificate on the FE node using `update-ca-trust (CentOS/RockyLinux)`, and then restart the FE process. + Try updating the CA certificate on the FE node using `update-ca-trust` (CentOS/RockyLinux), and then restart the FE process. 11. BE error: `java.lang.InternalError`. If you see an error similar to the following in `be.INFO`: @@ -243,7 +243,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 12. When inserting data into Hive, an error occurred as `HiveAccessControlException Permission denied: user [user_a] does not have [UPDATE] privilege on [database/table]`. - Since after inserting the data, the corresponding statistical information needs to be updated, and this update operation requires the alter privilege. Therefore, the alter privilege needs to be added for this user on Ranger. + Since after inserting the data, the corresponding statistical information needs to be updated, and this update operation requires the ALTER privilege. Therefore, the ALTER privilege needs to be added for this user on Ranger. 13. When querying ORC files, if an error like `Orc row reader nextBatch failed. reason = Can't open /usr/share/zoneinfo/+08:00` @@ -253,18 +253,21 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- If the session timezone is already set to `Asia/Shanghai` but the query still fails, it indicates that the ORC file was generated with the timezone `+08:00`. During query execution, this timezone is required when parsing the ORC footer. In this case, you can try creating a symbolic link under the `/usr/share/zoneinfo/` directory that points `+08:00` to an equivalent timezone. -14. **When querying Hive Catalog tables, query planning is extremely slow, the `nereids cost too much time` error occurs, and each HMS access takes a consistently long time (e.g., around 10 seconds).** +15. **When querying Hive Catalog tables, query planning is extremely slow, the `nereids cost too much time` error occurs, and each HMS access takes a consistently long time (e.g., around 10 seconds).** **Root Cause Analysis:** + This issue is usually not caused by slow execution of the HMS RPC itself. Instead, the most common root cause is **incorrect DNS configuration on the Doris FE node**. During the initialization phase of the Hive Metastore Client, hostname resolution is triggered. If the configured DNS server is unreachable or unresponsive, it causes a DNS resolution timeout (typically 10 seconds) every time a new HMS client connection is established, which severely slows down metadata fetching. **Typical Symptoms:** + - **Normal Network Connectivity:** The HMS port is reachable, but metadata access in Doris remains extremely slow. - **Consistent Delay:** The delay consistently hits a fixed timeout threshold (e.g., 10 seconds). - **Workarounds Fail:** Simply increasing the HMS client timeout parameter in the Catalog properties only masks the error but does not eliminate the fixed 10-second delay on each connection. **Troubleshooting Steps:** + Run the following commands on the Doris FE node to verify the DNS and hostname resolution: ```bash @@ -281,7 +284,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 2. **Configure Static Hosts Mapping:** Add the IP and Hostname mapping of the HMS nodes to `/etc/hosts` on the FE node. 3. **Standardize Catalog Properties:** When creating the Catalog, it is highly recommended to use a resolvable hostname instead of a bare IP address for the `hive.metastore.uris` property. -14. When querying a Hive table that uses JSON SerDe (e.g., `org.openx.data.jsonserde.JsonSerDe`), an error occurs: `failed to get schema` or `Storage schema reading not supported` +16. When querying a Hive table that uses JSON SerDe (e.g., `org.openx.data.jsonserde.JsonSerDe`), an error occurs: `failed to get schema` or `Storage schema reading not supported` When a Hive table uses JSON format storage (ROW FORMAT SERDE is `org.openx.data.jsonserde.JsonSerDe`), the Hive Metastore may not be able to read the table's schema information through the default method, causing the following error when querying from Doris: @@ -303,19 +306,23 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- This parameter is supported since versions 2.1.10 and 3.0.6. -15. **Queries on Hive Catalog tables occasionally experience extremely long hangs or directly report the optimizer timeout error `nereids cost too much time`, but subsequent queries work fine immediately after.** +17. **Queries on Hive Catalog tables occasionally experience extremely long hangs or directly report the optimizer timeout error `nereids cost too much time`, but subsequent queries work fine immediately after.** **Problem Description:** + This usually happens after the Catalog has been idle for a while. When an HMS RPC is initiated, if a stale connection from the pool is reused, the request will hang for the duration of the Socket Timeout (default 10s). Due to the Hive Client's internal retry mechanism, this can result in cumulative waits of 20-30 seconds if multiple retries occur. This causes the query planning phase to be extremely slow, often triggering the Doris FE optimizer timeout error `nereids cost too much time`. Once the connection is purged and rebuilt, performance returns to normal. **Root Cause Analysis:** + Doris maintains a Client Pool for each HMS Catalog to reuse connections. In complex network environments (e.g., across VPCs, through firewalls, or NAT gateways), idle TCP connections are often "silently" reclaimed by network devices after an `idle timeout`. Since these devices typically do not send FIN/RST packets to notify the endpoints, Doris still believes the connection is valid. Reusing such a "zombie connection" requires waiting for a full Socket Timeout before the failure is detected and a retry is triggered. **Troubleshooting Steps:** + - Verify if there are firewalls, NAT gateways, or Load Balancers between Doris FE and HMS. - Use the **Pulse (hms-tools)** diagnostic tool. If the tool shows fast network connectivity but stable delays that are multiples of 10s when executing RPCs after a long idle period, it confirms that idle connections are being silently reclaimed. **Solution:** + Configure the connection lifetime in your Catalog properties to be slightly shorter than the network device's idle timeout. We recommend using Hive's native socket lifetime property: ```sql @@ -326,11 +333,12 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- "hive.metastore.client.socket.lifetime" = "300s" ); ``` + When set, the HMS Client will check the connection age before sending an RPC. If it exceeds the `lifetime`, it proactively reconnects, avoiding long hangs and optimizer timeouts caused by stale connections. ## HDFS -1. When accessing HDFS 3.x, if you encounter the error `java.lang.VerifyError: xxx`, in versions prior to 1.2.1, Doris depends on Hadoop version 2.8. You need to update to 2.10.2 or upgrade Doris to versions after 1.2.2. +1. When accessing HDFS 3.x, if you encounter the error `java.lang.VerifyError: xxx`, in versions prior to 1.2.1, Doris depends on Hadoop version 2.8. You need to update to 2.10.2, or upgrade Doris to versions after 1.2.2. 2. Using Hedged Read to optimize slow HDFS reads. In some cases, high load on HDFS may lead to longer read times for data replicas on a specific HDFS, thereby slowing down overall query efficiency. The HDFS Client provides the Hedged Read feature. This feature initiates another read thread to read the same data if a read request exceeds a certain threshold without returning, and the result returned first is used. @@ -338,12 +346,12 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- You can enable this feature by: - ``` - create catalog regression properties ( - 'type'='hms', + ```sql + CREATE CATALOG regression PROPERTIES ( + 'type' = 'hms', 'hive.metastore.uris' = 'thrift://172.21.16.47:7004', 'dfs.client.hedged.read.threadpool.size' = '128', - 'dfs.client.hedged.read.threshold.millis' = "500" + 'dfs.client.hedged.read.threshold.millis' = '500' ); ``` @@ -355,13 +363,13 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- `TotalHedgedRead`: Number of times Hedged Read was initiated. - `HedgedReadWins`: Number of successful Hedged Reads (times when the request was initiated and returned faster than the original request) + `HedgedReadWins`: Number of successful Hedged Reads (times when the request was initiated and returned faster than the original request). Note that these values are cumulative for a single HDFS Client, not for a single query. The same HDFS Client can be reused by multiple queries. 3. `Couldn't create proxy provider class org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider` - In the start scripts of FE and BE, the environment variable `HADOOP_CONF_DIR` is added to the CLASSPATH. If `HADOOP_CONF_DIR` is set incorrectly, such as pointing to a non-existent or incorrect path, it may load the wrong xxx-site.xml file, resulting in reading incorrect information. + In the startup scripts of FE and BE, the environment variable `HADOOP_CONF_DIR` is added to the CLASSPATH. If `HADOOP_CONF_DIR` is set incorrectly, such as pointing to a non-existent or incorrect path, it may load the wrong `xxx-site.xml` file, resulting in reading incorrect information. Check if `HADOOP_CONF_DIR` is configured correctly or remove this environment variable. @@ -369,7 +377,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- Possible solutions include: - Use `hdfs fsck file -files -blocks -locations` to check if the file is healthy. - - Check connectivity with datanodes using `telnet`. + - Check connectivity with DataNodes using `telnet`. The following error may be printed in the error log: @@ -383,7 +391,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- At the same time, please check whether the parameter is true in the `hdfs-site.xml` file placed under `fe/conf` and `be/conf`. - - Check datanode logs. + - Check DataNode logs. If you encounter the following error: @@ -391,11 +399,11 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to read expected SASL data transfer protection handshake from client at /XXX.XXX.XXX.XXX:XXXXX. Perhaps the client is running an older version of Hadoop which does not support SASL data transfer protection ``` - it means that the current hdfs has enabled encrypted transmission, but the client has not, causing the error. + it means that the current HDFS has enabled encrypted transmission, but the client has not, causing the error. Use any of the following solutions: - Copy `hdfs-site.xml` and `core-site.xml` to `fe/conf` and `be/conf`. (Recommended) - - In `hdfs-site.xml`, find the corresponding configuration `dfs.data.transfer.protection` and set this parameter in the catalog. + - In `hdfs-site.xml`, find the corresponding configuration `dfs.data.transfer.protection` and set this parameter in the Catalog. 5. When querying a Hive Catalog table, an error occurs: `RPC response has a length of xxx exceeds maximum data length` @@ -432,13 +440,13 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- When querying Parquet files, due to potential differences in the format of Parquet files generated by different systems, such as the number of RowGroups, index values, etc., sometimes it is necessary to check the metadata of Parquet files for issue identification or performance analysis. Here is a tool provided to help users analyze Parquet files more conveniently: - 1. Download and unzip [Apache Parquet Cli 1.14.0](https://github.com/morningman/tools/releases/download/apache-parquet-cli-1.14.0/apache-parquet-cli-1.14.0.tar.xz) - 2. Download the Parquet file to be analyzed to your local machine, assuming the path is `/path/to/file.parquet` + 1. Download and unzip [Apache Parquet Cli 1.14.0](https://github.com/morningman/tools/releases/download/apache-parquet-cli-1.14.0/apache-parquet-cli-1.14.0.tar.xz). + 2. Download the Parquet file to be analyzed to your local machine, assuming the path is `/path/to/file.parquet`. 3. Use the following command to analyze the metadata of the Parquet file: `./parquet-tools meta /path/to/file.parquet` - 4. For more functionalities, refer to [Apache Parquet Cli documentation](https://github.com/apache/parquet-java/tree/apache-parquet-1.14.0/parquet-cli) + 4. For more functionalities, refer to [Apache Parquet Cli documentation](https://github.com/apache/parquet-java/tree/apache-parquet-1.14.0/parquet-cli). ## Diagnostic Tools diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/faq/lakehouse-faq.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/faq/lakehouse-faq.md index f2caed18d1099..c4a94de62e060 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/faq/lakehouse-faq.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/faq/lakehouse-faq.md @@ -2,91 +2,91 @@ { "title": "常见数据湖问题", "language": "zh-CN", - "description": "通常是因为 Kerberos 认证信息填写不正确导致的,可以通过以下步骤排查:" + "description": "Apache Doris 数据湖(Lakehouse)常见问题排查指南,涵盖证书、Kerberos 认证、JDBC Catalog、Hive Catalog、HDFS、DLF Catalog 等场景的报错解决方案与诊断工具使用说明。" } --- ## 证书问题 1. 查询时报错 `curl 77: Problem with the SSL CA cert.`。说明当前系统证书过旧,需要更新本地证书。 - - 可以从 `https://curl.se/docs/caextract.html` 下载最新的 CA 证书。 - - 将下载后的 cacert-xxx.pem 放到`/etc/ssl/certs/`目录,例如:`sudo cp cacert-xxx.pem /etc/ssl/certs/ca-certificates.crt`。 + - 可以从 `https://curl.se/docs/caextract.html` 下载最新的 CA 证书。 + - 将下载后的 cacert-xxx.pem 放到 `/etc/ssl/certs/` 目录,例如:`sudo cp cacert-xxx.pem /etc/ssl/certs/ca-certificates.crt`。 -2. 查询时报错:`ERROR 1105 (HY000): errCode = 2, detailMessage = (x.x.x.x)[CANCELLED][INTERNAL_ERROR]error setting certificate verify locations: CAfile: /etc/ssl/certs/ca-certificates.crt CApath: none`. +2. 查询时报错:`ERROR 1105 (HY000): errCode = 2, detailMessage = (x.x.x.x)[CANCELLED][INTERNAL_ERROR]error setting certificate verify locations: CAfile: /etc/ssl/certs/ca-certificates.crt CApath: none`。 -``` -yum install -y ca-certificates -ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca-certificates.crt -``` + ``` + yum install -y ca-certificates + ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca-certificates.crt + ``` ## Kerberos 1. 连接 Kerberos 认证的 Hive Metastore 报错:`GSS initiate failed` - 通常是因为 Kerberos 认证信息填写不正确导致的,可以通过以下步骤排查: + 通常是因为 Kerberos 认证信息填写不正确导致的,可以通过以下步骤排查: 1. 1.2.1 之前的版本中,Doris 依赖的 libhdfs3 库没有开启 gsasl。请更新至 1.2.2 之后的版本。 2. 确认对各个组件,设置了正确的 keytab 和 principal,并确认 keytab 文件存在于所有 FE、BE 节点上。 - 1. `hadoop.kerberos.keytab`/`hadoop.kerberos.principal`:用于 Hadoop hdfs 访问,填写 hdfs 对应的值。 - 2. `hive.metastore.kerberos.principal`:用于 hive metastore。 + 1. `hadoop.kerberos.keytab`/`hadoop.kerberos.principal`:用于 Hadoop HDFS 访问,填写 HDFS 对应的值。 + 2. `hive.metastore.kerberos.principal`:用于 Hive Metastore。 - 3. 尝试将 principal 中的 ip 换成域名(不要使用默认的 `_HOST` 占位符) + 3. 尝试将 principal 中的 IP 换成域名(不要使用默认的 `_HOST` 占位符)。 4. 确认 `/etc/krb5.conf` 文件存在于所有 FE、BE 节点上。 2. 通过 Hive Catalog 连接 Hive 数据库报错:`RemoteException: SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]`. - 如果在 `show databases` 和 `show tables` 都是没问题的情况下,查询的时候出现上面的错误,我们需要进行下面两个操作: - - fe/conf、be/conf 目录下需放置 core-site.xml 和 hdfs-site.xml - - BE 节点执行 Kerberos 的 kinit 然后重启 BE,然后再去执行查询即可。 + 如果在 `show databases` 和 `show tables` 都是没问题的情况下,查询的时候出现上面的错误,需要进行下面两个操作: + - `fe/conf`、`be/conf` 目录下需放置 `core-site.xml` 和 `hdfs-site.xml`。 + - BE 节点执行 Kerberos 的 `kinit` 然后重启 BE,然后再去执行查询即可。 3. 查询配置了 Kerberos 的外表,遇到该报错:`GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos Ticket)`,一般重启 FE 和 BE 能够解决该问题。 - - 重启所有节点前可在`"${DORIS_HOME}/be/conf/be.conf"`中的 JAVA_OPTS 参数里配置`-Djavax.security.auth.useSubjectCredsOnly=false`,通过底层机制去获取 JAAS credentials 信息,而不是应用程序。 - - 在[JAAS Troubleshooting](https://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/tutorials/Troubleshooting.html)中可获取更多常见 JAAS 报错的解决方法。 + - 重启所有节点前可在 `"${DORIS_HOME}/be/conf/be.conf"` 中的 JAVA_OPTS 参数里配置 `-Djavax.security.auth.useSubjectCredsOnly=false`,通过底层机制去获取 JAAS credentials 信息,而不是应用程序。 + - 在 [JAAS Troubleshooting](https://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/tutorials/Troubleshooting.html) 中可获取更多常见 JAAS 报错的解决方法。 -4. 在 Catalog 中配置 Kerberos 时,报错`Unable to obtain password from user`的解决方法: +4. 在 Catalog 中配置 Kerberos 时,报错 `Unable to obtain password from user` 的解决方法: - - 用到的 principal 必须在 klist 中存在,使用`klist -kt your.keytab`检查。 - - 检查 catalog 配置是否正确,比如漏配`yarn.resourcemanager.principal`。 - - 若上述检查没问题,则当前系统 yum 或者其他包管理软件安装的 JDK 版本存在不支持的加密算法,建议自行安装 JDK 并设置`JAVA_HOME`环境变量。 + - 用到的 principal 必须在 klist 中存在,使用 `klist -kt your.keytab` 检查。 + - 检查 Catalog 配置是否正确,比如漏配 `yarn.resourcemanager.principal`。 + - 若上述检查没问题,则当前系统 yum 或者其他包管理软件安装的 JDK 版本存在不支持的加密算法,建议自行安装 JDK 并设置 `JAVA_HOME` 环境变量。 - Kerberos 默认使用 AES-256 来进行加密。如果使用 Oracle JDK,则必须安装 JCE。如果是 OpenJDK,OpenJDK 的某些发行版会自动提供无限强度的 JCE,因此不需要安装 JCE。 - - JCE 与 JDK 版本是对应的,需要根据 JDK 的版本来选择 JCE 版本,下载 JCE 的 zip 包并解压到`$JAVA_HOME/jre/lib/security`目录下: - - JDK6:[JCE6](http://www.oracle.com/technetwork/java/javase/downloads/jce-6-download-429243.html) - - JDK7:[JCE7](http://www.oracle.com/technetwork/java/embedded/embedded-se/downloads/jce-7-download-432124.html) - - JDK8:[JCE8](http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html) + - JCE 与 JDK 版本是对应的,需要根据 JDK 的版本来选择 JCE 版本,下载 JCE 的 zip 包并解压到 `$JAVA_HOME/jre/lib/security` 目录下: + - JDK6:[JCE6](http://www.oracle.com/technetwork/java/javase/downloads/jce-6-download-429243.html) + - JDK7:[JCE7](http://www.oracle.com/technetwork/java/embedded/embedded-se/downloads/jce-7-download-432124.html) + - JDK8:[JCE8](http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html) 5. 使用 KMS 访问 HDFS 时报错:`java.security.InvalidKeyException: Illegal key size` - 升级 JDK 版本到 >= Java 8 u162 的版本。或者下载安装 JDK 相应的 JCE Unlimited Strength Jurisdiction Policy Files。 + 升级 JDK 版本到 >= Java 8 u162 的版本,或者下载安装 JDK 相应的 JCE Unlimited Strength Jurisdiction Policy Files。 -6. 在 Catalog 中配置 Kerberos 时,如果报错`SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]`,那么需要将`core-site.xml`文件放到`"${DORIS_HOME}/be/conf"`目录下。 +6. 在 Catalog 中配置 Kerberos 时,如果报错 `SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]`,那么需要将 `core-site.xml` 文件放到 `"${DORIS_HOME}/be/conf"` 目录下。 - 如果访问 HDFS 报错`No common protection layer between client and server`,检查客户端和服务端的`hadoop.rpc.protection`属性,使他们保持一致。 + 如果访问 HDFS 报错 `No common protection layer between client and server`,检查客户端和服务端的 `hadoop.rpc.protection` 属性,使它们保持一致。 - ``` - - - - - - - hadoop.security.authentication - kerberos - - - + ```xml + + + + + + + hadoop.security.authentication + kerberos + + + ``` -7. 在使用 Broker Load 时,配置了 Kerberos,如果报错`Cannot locate default realm.`。 +7. 在使用 Broker Load 时,配置了 Kerberos,如果报错 `Cannot locate default realm.`。 - 将 `-Djava.security.krb5.conf=/your-path` 配置项添加到 Broker Load 启动脚本的 `start_broker.sh` 的 `JAVA_OPTS`里。 + 将 `-Djava.security.krb5.conf=/your-path` 配置项添加到 Broker Load 启动脚本的 `start_broker.sh` 的 `JAVA_OPTS` 里。 -8. 当在 Catalog 里使用 Kerberos 配置时,不能同时使用`hadoop.username`属性。 +8. 当在 Catalog 里使用 Kerberos 配置时,不能同时使用 `hadoop.username` 属性。 9. 使用 JDK 17 访问 Kerberos - 如果使用 JDK 17 运行 Doris 并访问 Kerberos 服务,可能会出现因使用已废弃的加密算法而导致无法访问的现象。需要在 krb5.conf 中添加 `allow_weak_crypto=true` 属性。或升级 Kerberos 的加密算法。 + 如果使用 JDK 17 运行 Doris 并访问 Kerberos 服务,可能会出现因使用已废弃的加密算法而导致无法访问的现象。需要在 `krb5.conf` 中添加 `allow_weak_crypto=true` 属性,或升级 Kerberos 的加密算法。 详情参阅: @@ -94,52 +94,51 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 1. 通过 JDBC Catalog 连接 SQLServer 报错:`unable to find valid certification path to requested target` - 请在 `jdbc_url` 中添加 `trustServerCertificate=true` 选项。 + 请在 `jdbc_url` 中添加 `trustServerCertificate=true` 选项。 2. 通过 JDBC Catalog 连接 MySQL 数据库,中文字符乱码,或中文字符条件查询不正确 - 请在 `jdbc_url` 中添加 `useUnicode=true&characterEncoding=utf-8` + 请在 `jdbc_url` 中添加 `useUnicode=true&characterEncoding=utf-8`。 - > 注:1.2.3 版本后,使用 JDBC Catalog 连接 MySQL 数据库,会自动添加这些参数。 + > 注:1.2.3 版本后,使用 JDBC Catalog 连接 MySQL 数据库,会自动添加这些参数。 3. 通过 JDBC Catalog 连接 MySQL 数据库报错:`Establishing SSL connection without server's identity verification is not recommended` - 请在 `jdbc_url` 中添加 `useSSL=true` + 请在 `jdbc_url` 中添加 `useSSL=true`。 -4. 使用 JDBC Catalog 将 MySQL 数据同步到 Doris 中,日期数据同步错误。需要校验下 MySQL 的版本是否与 MySQL 的驱动包是否对应,比如 MySQL8 以上需要使用驱动 com.mysql.cj.jdbc.Driver。 - +4. 使用 JDBC Catalog 将 MySQL 数据同步到 Doris 中,日期数据同步错误。需要校验 MySQL 的版本与 MySQL 的驱动包是否对应,比如 MySQL 8 以上需要使用驱动 `com.mysql.cj.jdbc.Driver`。 5. 单个字段过大,查询时 BE 侧 Java 内存 OOM - Jdbc Scanner 在通过 jdbc 读取时,由 session variable `batch_size` 决定每批次数据在 JVM 中处理的数量,如果单个字段过大,导致 `字段大小 * batch_size`(近似值,由于 JVM 中 static 以及数据 copy 占用) 超过 JVM 内存限制,就会出现 OOM。 + JDBC Scanner 在通过 JDBC 读取时,由 Session Variable `batch_size` 决定每批次数据在 JVM 中处理的数量,如果单个字段过大,导致 `字段大小 * batch_size`(近似值,由于 JVM 中 static 以及数据 copy 占用)超过 JVM 内存限制,就会出现 OOM。 - 解决方法: + 解决方法: - - 减小 `batch_size` 的值,可以通过 `set batch_size = 512;` 来调整,默认值为 4064。 - - 增大 BE 的 JVM 内存,通过修改 `JAVA_OPTS` 参数中的 `-Xmx` 来调整 JVM 最大堆内存大小。例如:`"-Xmx8g`。 + - 减小 `batch_size` 的值,可以通过 `set batch_size = 512;` 来调整,默认值为 4064。 + - 增大 BE 的 JVM 内存,通过修改 `JAVA_OPTS` 参数中的 `-Xmx` 来调整 JVM 最大堆内存大小。例如:`-Xmx8g`。 ## Hive Catalog 1. 通过 Hive Catalog 访问 Iceberg 或 Hive 表报错:`failed to get schema` 或 `Storage schema reading not supported` - 可以尝试以下方法: + 可以尝试以下方法: - * 在 Hive 的 lib/ 目录放上 `iceberg` 运行时有关的 jar 包。 + - 在 Hive 的 `lib/` 目录放上 `iceberg` 运行时有关的 jar 包。 - * 在 `hive-site.xml` 配置: + - 在 `hive-site.xml` 配置: - ``` - metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader - ``` + ``` + metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader + ``` - 配置完成后需要重启 Hive Metastore。 + 配置完成后需要重启 Hive Metastore。 - * 在 Catalog 属性中添加 `"get_schema_from_table" = "true"` + - 在 Catalog 属性中添加 `"get_schema_from_table" = "true"`。 - 该参数自 2.1.10 和 3.0.6 版本支持。 + 该参数自 2.1.10 和 3.0.6 版本支持。 2. 连接 Hive Catalog 报错:`Caused by: java.lang.NullPointerException` - 如 fe.log 中有如下堆栈: + 如 fe.log 中有如下堆栈: ``` Caused by: java.lang.NullPointerException @@ -150,17 +149,17 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_181] ``` - 可以尝试在 `create catalog` 语句中添加 `"metastore.filter.hook" = "org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl"` 解决。 + 可以尝试在 `CREATE CATALOG` 语句中添加 `"metastore.filter.hook" = "org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl"` 解决。 -3. 如果创建 Hive Catalog 后能正常`show tables`,但查询时报`java.net.UnknownHostException: xxxxx` +3. 如果创建 Hive Catalog 后能正常 `show tables`,但查询时报 `java.net.UnknownHostException: xxxxx` - 可以在 CATALOG 的 PROPERTIES 中添加 + 可以在 Catalog 的 PROPERTIES 中添加: ``` 'fs.defaultFS' = 'hdfs://' ``` -4. Hive 1.x 的 orc 格式的表可能会遇到底层 orc 文件 schema 中列名为 `_col0`,`_col1`,`_col2`... 这类系统列名,此时需要在 catalog 配置中添加 `hive.version` 为 1.x.x,这样就会使用 hive 表中的列名进行映射。 +4. Hive 1.x 的 ORC 格式的表可能会遇到底层 ORC 文件 Schema 中列名为 `_col0`、`_col1`、`_col2`... 这类系统列名,此时需要在 Catalog 配置中添加 `hive.version` 为 1.x.x,这样就会使用 Hive 表中的列名进行映射。 ```sql CREATE CATALOG hive PROPERTIES ( @@ -168,7 +167,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- ); ``` -5. 使用 Catalog 查询表数据时发现与 Hive Metastore 相关的报错:`Invalid method name`,需要设置`hive.version`参数。 +5. 使用 Catalog 查询表数据时发现与 Hive Metastore 相关的报错:`Invalid method name`,需要设置 `hive.version` 参数。 ```sql CREATE CATALOG hive PROPERTIES ( @@ -178,15 +177,15 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 6. 查询 ORC 格式的表,FE 报错 `Could not obtain block` 或 `Caused by: java.lang.NoSuchFieldError: types` - 对于 ORC 文件,在默认情况下,FE 会访问 HDFS 获取文件信息,进行文件切分。部分情况下,FE 可能无法访问到 HDFS。可以通过添加以下参数解决: + 对于 ORC 文件,在默认情况下,FE 会访问 HDFS 获取文件信息,进行文件切分。部分情况下,FE 可能无法访问到 HDFS。可以通过添加以下参数解决: - `"hive.exec.orc.split.strategy" = "BI"` + `"hive.exec.orc.split.strategy" = "BI"` - 其他选项:HYBRID(默认),ETL。 + 其他选项:HYBRID(默认)、ETL。 -7. 在 hive 上可以查到 hudi 表分区字段的值,但是在 doris 查不到。 +7. 在 Hive 上可以查到 Hudi 表分区字段的值,但是在 Doris 查不到。 - doris 和 hive 目前查询 hudi 的方式不一样,doris 需要在 hudi 表结构的 avsc 文件里添加上分区字段,如果没加,就会导致 doris 查询 partition_val 为空(即使设置了 hoodie.datasource.hive_sync.partition_fields=partition_val 也不可以) + Doris 和 Hive 目前查询 Hudi 的方式不一样,Doris 需要在 Hudi 表结构的 avsc 文件里添加上分区字段,如果没加,就会导致 Doris 查询 `partition_val` 为空(即使设置了 `hoodie.datasource.hive_sync.partition_fields=partition_val` 也不可以)。 ``` { @@ -215,26 +214,26 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- } ``` -8. 查询 hive 外表,遇到该报错:`java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found` +8. 查询 Hive 外表,遇到该报错:`java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found` - 去 hadoop 环境搜索`hadoop-lzo-*.jar`放在`"${DORIS_HOME}/fe/lib/"`目录下并重启 fe。 + 去 Hadoop 环境搜索 `hadoop-lzo-*.jar` 放在 `"${DORIS_HOME}/fe/lib/"` 目录下并重启 FE。 - 从 2.0.2 版本起,可以将这个文件放置在 FE 的 `custom_lib/` 目录下(如不存在,手动创建即可),以防止升级集群时因为 lib 目录被替换而导致文件丢失。 + 从 2.0.2 版本起,可以将这个文件放置在 FE 的 `custom_lib/` 目录下(如不存在,手动创建即可),以防止升级集群时因为 `lib` 目录被替换而导致文件丢失。 -9. 创建 hive 表指定 serde 为 `org.apache.hadoop.hive.contrib.serde2.MultiDelimitserDe`,访问表时报错:`storage schema reading not supported` +9. 创建 Hive 表指定 serde 为 `org.apache.hadoop.hive.contrib.serde2.MultiDelimitserDe`,访问表时报错:`storage schema reading not supported` - 在 hive-site.xml 文件中增加以下配置,并重启 hms 服务: + 在 `hive-site.xml` 文件中增加以下配置,并重启 HMS 服务: - ``` - - metastore.storage.schema.reader.impl - org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader - - ``` + ```xml + + metastore.storage.schema.reader.impl + org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader + + ``` -10. 报错:java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty +10. 报错:`java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty` - FE日志中完整报错信息如下: + FE 日志中完整报错信息如下: ``` org.apache.doris.common.UserException: errCode = 2, detailMessage = S3 list path failed. path=s3://bucket/part-*,msg=errors while get file status listStatus on s3://bucket: com.amazonaws.SdkClientException: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty @@ -246,7 +245,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- Caused by: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty ``` - 尝试更新FE节点CA证书,使用 `update-ca-trust(CentOS/RockyLinux)`,然后重启FE进程即可。 + 尝试更新 FE 节点 CA 证书,使用 `update-ca-trust`(CentOS/RockyLinux),然后重启 FE 进程即可。 11. BE 报错:`java.lang.InternalError` @@ -267,17 +266,17 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 是因为 Doris 自带的 libz.a 和系统环境中的 libz.so 冲突了。 - 为了解决这个问题,需要先执行 `export LD_LIBRARY_PATH=/path/to/be/lib:$LD_LIBRARY_PATH` 然后重启 BE 进程。 + 为了解决这个问题,需要先执行 `export LD_LIBRARY_PATH=/path/to/be/lib:$LD_LIBRARY_PATH`,然后重启 BE 进程。 -12. 在插入 hive 数据的时候报错:`HiveAccessControlException Permission denied: user [user_a] does not have [UPDATE] privilege on [database/table]`。 +12. 在插入 Hive 数据的时候报错:`HiveAccessControlException Permission denied: user [user_a] does not have [UPDATE] privilege on [database/table]`。 - 因为插入数据之后,需要更新对应的统计信息,这个更新的操作需要 alter 权限,所以要在 ranger 上给该用户新增 alter 权限。 + 因为插入数据之后,需要更新对应的统计信息,这个更新的操作需要 ALTER 权限,所以要在 Ranger 上给该用户新增 ALTER 权限。 13. 在查询 ORC 文件时,如果出现报错类似 `Orc row reader nextBatch failed. reason = Can't open /usr/share/zoneinfo/+08:00` 首先检查当前 `session` 下 `time_zone` 的时区设置是多少,推荐使用类似 `Asia/Shanghai` 的写法。 - 如果 `session` 时区已经是 `Asia/Shanghai`,且查询仍然报错,说明生成 ORC 文件时的时区是 `+08:00`, 导致在读取时解析 `footer` 时需要用到 `+08:00` 时区,可以尝试在 `/usr/share/zoneinfo/` 目录下面软链到相同时区上。 + 如果 `session` 时区已经是 `Asia/Shanghai`,且查询仍然报错,说明生成 ORC 文件时的时区是 `+08:00`,导致在读取时解析 `footer` 时需要用到 `+08:00` 时区,可以尝试在 `/usr/share/zoneinfo/` 目录下面软链到相同时区上。 14. 查询使用 JSON SerDe(如 `org.openx.data.jsonserde.JsonSerDe`)的 Hive 表时,报错:`failed to get schema` 或 `Storage schema reading not supported` @@ -329,19 +328,23 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 2. **配置 Hosts 静态映射**:在 FE 节点的 `/etc/hosts` 中添加 HMS 节点的 IP 与 Hostname 映射。 3. **规范 Catalog 配置**:创建 Catalog 时,`hive.metastore.uris` 参数建议优先使用正确的 Hostname 而不是裸 IP。 -15. **查询 Hive Catalog 表时,偶尔出现查询长时间卡住,或者直接报错优化器超时 `nereids cost too much time`,但紧接着再次查询又恢复正常。** +16. **查询 Hive Catalog 表时,偶尔出现查询长时间卡住,或者直接报错优化器超时 `nereids cost too much time`,但紧接着再次查询又恢复正常。** **问题描述:** + 这种情况通常发生在 Catalog 空闲一段时间后的首次查询。表现为请求在发起 HMS RPC 时卡住,由于 Hive Client 内部存在重试机制,复用到失效的长连接时会等待 Socket Timeout(默认 10 秒),重试可能导致累积等待时间长达 20-30 秒。这会导致查询规划阶段极慢,甚至直接触发 Doris FE 的优化器超时报错 `nereids cost too much time`。一旦该连接被剔除并重建,后续查询会立即恢复正常。 **问题分析:** + Doris 为每个 HMS Catalog 维护了一个 Client Pool 以复用连接。在复杂的网络环境中(如跨 VPC、经过防火墙或 NAT 网关),中间网络设备往往会对空闲连接设置 `idle timeout`。当连接空闲时间超过阈值时,网络设备会静默丢弃连接状态,且通常不会发送 FIN/RST 包通知两端。Doris 侧仍认为连接可用,下次复用该“僵尸连接”时,由于链路已不可达,必须等待完整的 Socket Timeout 才能感知到失效并触发重试。 **排查建议:** + - 确认 Doris FE 与 HMS 之间是否经过了防火墙、云厂商 NAT 网关或负载均衡器。 - 使用下文提到的 **Pulse (hms-tools)** 工具。如果探测显示网络连通极快,但长时间放置后首次执行 RPC 出现稳定且倍数于 10s 的延迟,则基本可判定为长连接被中间设备静默回收。 **解决方案:** + 利用 Hive Client 原生的生命周期管理能力,在 Catalog 属性中配置 `hive.metastore.client.socket.lifetime`,使其略短于中间网络设备的空闲超时时间(例如设为 300 秒): ```sql @@ -352,29 +355,30 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- "hive.metastore.client.socket.lifetime" = "300s" ); ``` + 配置后,HMS Client 在执行 RPC 前会检查连接年龄。如果已超过 `lifetime`,它会主动重新建立连接,从而规避因复用失效连接导致的长时间卡顿或优化器超时。 ## HDFS 1. 访问 HDFS 3.x 时报错:`java.lang.VerifyError: xxx` - 1.2.1 之前的版本中,Doris 依赖的 Hadoop 版本为 2.8。需更新至 2.10.2。或更新 Doris 至 1.2.2 之后的版本。 + 1.2.1 之前的版本中,Doris 依赖的 Hadoop 版本为 2.8,需更新至 2.10.2,或更新 Doris 至 1.2.2 之后的版本。 -2. 使用 Hedged Read 优化 HDFS 读取慢的问题。 +2. 使用 Hedged Read 优化 HDFS 读取慢的问题 在某些情况下,HDFS 的负载较高可能导致读取某个 HDFS 上的数据副本的时间较长,从而拖慢整体的查询效率。HDFS Client 提供了 Hedged Read 功能。 - 该功能可以在一个读请求超过一定阈值未返回时,启动另一个读线程读取同一份数据,哪个先返回就是用哪个结果。 + 该功能可以在一个读请求超过一定阈值未返回时,启动另一个读线程读取同一份数据,哪个先返回就使用哪个结果。 注意:该功能可能会增加 HDFS 集群的负载,请酌情使用。 可以通过以下方式开启这个功能: - ``` - create catalog regression properties ( - 'type'='hms', + ```sql + CREATE CATALOG regression PROPERTIES ( + 'type' = 'hms', 'hive.metastore.uris' = 'thrift://172.21.16.47:7004', 'dfs.client.hedged.read.threadpool.size' = '128', - 'dfs.client.hedged.read.threshold.millis' = "500" + 'dfs.client.hedged.read.threshold.millis' = '500' ); ``` @@ -384,15 +388,15 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 开启后,可以在 Query Profile 中看到相关参数: - `TotalHedgedRead`: 发起 Hedged Read 的次数。 + `TotalHedgedRead`:发起 Hedged Read 的次数。 - `HedgedReadWins`:Hedged Read 成功的次数(发起并且比原请求更快返回的次数) + `HedgedReadWins`:Hedged Read 成功的次数(发起并且比原请求更快返回的次数)。 注意,这里的值是单个 HDFS Client 的累计值,而不是单个查询的数值。同一个 HDFS Client 会被多个查询复用。 3. `Couldn't create proxy provider class org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider` - 在 FE 和 BE 的 start 脚本中,会将环境变量 `HADOOP_CONF_DIR` 加入 CLASSPATH。如果 `HADOOP_CONF_DIR` 设置错误,比如指向了不存在的路径或错误路径,则可能加载到错误的 xxx-site.xml 文件,从而读取到错误的信息。 + 在 FE 和 BE 的启动脚本中,会将环境变量 `HADOOP_CONF_DIR` 加入 CLASSPATH。如果 `HADOOP_CONF_DIR` 设置错误,比如指向了不存在的路径或错误路径,则可能加载到错误的 `xxx-site.xml` 文件,从而读取到错误的信息。 需检查 `HADOOP_CONF_DIR` 是否配置正确,或将这个环境变量删除。 @@ -400,7 +404,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 可能的处理方式有: - 通过 `hdfs fsck file -files -blocks -locations` 来查看具体该文件是否健康。 - - 通过 `telnet` 来检查与 datanode 的连通性。 + - 通过 `telnet` 来检查与 DataNode 的连通性。 在错误日志中可能会打印如下错误: @@ -414,7 +418,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 同时,请检查 `fe/conf` 和 `be/conf` 下放置的 `hdfs-site.xml` 文件中,该参数是否为 true。 - - 查看 datanode 日志。 + - 查看 DataNode 日志。 如果出现以下错误: @@ -422,11 +426,11 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to read expected SASL data transfer protection handshake from client at /XXX.XXX.XXX.XXX:XXXXX. Perhaps the client is running an older version of Hadoop which does not support SASL data transfer protection ``` - 则为当前 hdfs 开启了加密传输方式,而客户端未开启导致的错误。 + 则为当前 HDFS 开启了加密传输方式,而客户端未开启导致的错误。 使用下面的任意一种解决方案即可: - - 拷贝 `hdfs-site.xml` 以及 `core-site.xml` 到 `fe/conf` 和 `be/conf` 目录。(推荐) - - 在 `hdfs-site.xml` 找到相应的配置 `dfs.data.transfer.protection`,并且在 catalog 里面设置该参数。 + - 拷贝 `hdfs-site.xml` 以及 `core-site.xml` 到 `fe/conf` 和 `be/conf` 目录。(推荐) + - 在 `hdfs-site.xml` 找到相应的配置 `dfs.data.transfer.protection`,并且在 Catalog 里面设置该参数。 5. 查询 Hive Catalog 表时报错:`RPC response has a length of xxx exceeds maximum data length` @@ -447,29 +451,29 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- ## DLF Catalog -1. 使用 DLF Catalog 时,BE 读在取 JindoFS 数据出现`Invalid address`,需要在`/ets/hosts`中添加日志中出现的域名到 IP 的映射。 +1. 使用 DLF Catalog 时,BE 读取 JindoFS 数据出现 `Invalid address`,需要在 `/etc/hosts` 中添加日志中出现的域名到 IP 的映射。 -2. 读取数据无权限时,使用`hadoop.username`属性指定有权限的用户。 +2. 读取数据无权限时,使用 `hadoop.username` 属性指定有权限的用户。 -3. DLF Catalog 中的元数据和 DLF 保持一致。当使用 DLF 管理元数据时,Hive 新导入的分区,可能未被 DLF 同步,导致出现 DLF 和 Hive 元数据不一致的情况,对此,需要先保证 Hive 元数据被 DLF 完全同步。 +3. DLF Catalog 中的元数据和 DLF 保持一致。当使用 DLF 管理元数据时,Hive 新导入的分区可能未被 DLF 同步,导致出现 DLF 和 Hive 元数据不一致的情况。对此,需要先保证 Hive 元数据被 DLF 完全同步。 ## 其他问题 1. Binary 类型映射到 Doris 后,查询乱码 - Doris 原生不支持 Binary 类型,所以各类数据湖或数据库中的 Binary 类型映射到 Doris 中,通常使用 String 类型进行映射。String 类型只能展示可打印字符。如果需要查询 Binary 的内容,可以使用 `TO_BASE64()` 函数转换为 Base64 编码后,在进行下一步处理。 + Doris 原生不支持 Binary 类型,所以各类数据湖或数据库中的 Binary 类型映射到 Doris 中,通常使用 String 类型进行映射。String 类型只能展示可打印字符。如果需要查询 Binary 的内容,可以使用 `TO_BASE64()` 函数转换为 Base64 编码后,再进行下一步处理。 2. 分析 Parquet 文件 - 在查询 Parquet 文件时,由于不同系统生成的 Parquet 文件格式可能有所差异,比如 RowGroup 的数量,索引的值等,有时需要检查 Parquet 文件的元数据进行问题定位或性能分析。这里提供一个工具帮助用户更方便的分析 Parquet 文件: + 在查询 Parquet 文件时,由于不同系统生成的 Parquet 文件格式可能有所差异,比如 RowGroup 的数量、索引的值等,有时需要检查 Parquet 文件的元数据进行问题定位或性能分析。这里提供一个工具帮助用户更方便地分析 Parquet 文件: - 1. 下载并解压 [Apache Parquet Cli 1.14.0](https://github.com/morningman/tools/releases/download/apache-parquet-cli-1.14.0/apache-parquet-cli-1.14.0.tar.xz) - 2. 将需要分析的 Parquet 文件下载到本地,假设路径为 `/path/to/file.parquet` + 1. 下载并解压 [Apache Parquet Cli 1.14.0](https://github.com/morningman/tools/releases/download/apache-parquet-cli-1.14.0/apache-parquet-cli-1.14.0.tar.xz)。 + 2. 将需要分析的 Parquet 文件下载到本地,假设路径为 `/path/to/file.parquet`。 3. 使用如下命令分析 Parquet 文件元信息: `./parquet-tools meta /path/to/file.parquet` - 4. 更多功能,可参阅 [Apache Parquet Cli 文档](https://github.com/apache/parquet-java/tree/apache-parquet-1.14.0/parquet-cli) + 4. 更多功能,可参阅 [Apache Parquet Cli 文档](https://github.com/apache/parquet-java/tree/apache-parquet-1.14.0/parquet-cli)。 ## 诊断工具 @@ -489,7 +493,7 @@ Pulse 主要包含以下工具集: - 支持测试 KDC 可达性、检查 Keytab 文件以及执行登录测试,确保认证层不会阻断连接。 3. **对象存储诊断工具 (`s3-tools`, `gcs-tools`, `azure-blob-cpp`)**: - - 针对主流云存储(AWS S3, Google GCS, Azure Blob)的诊断工具。 + - 针对主流云存储(AWS S3、Google GCS、Azure Blob)的诊断工具。 - 用于排查“权限拒绝(Access Denied)”或“存储桶不存在(Bucket Not Found)”等常见的外部表数据访问问题。 - 支持验证凭据来源、STS 身份以及执行 Bucket 级别的操作测试。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/faq/lakehouse-faq.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/faq/lakehouse-faq.md index d57bf28a23342..c4a94de62e060 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/faq/lakehouse-faq.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/faq/lakehouse-faq.md @@ -2,91 +2,91 @@ { "title": "常见数据湖问题", "language": "zh-CN", - "description": "通常是因为 Kerberos 认证信息填写不正确导致的,可以通过以下步骤排查:" + "description": "Apache Doris 数据湖(Lakehouse)常见问题排查指南,涵盖证书、Kerberos 认证、JDBC Catalog、Hive Catalog、HDFS、DLF Catalog 等场景的报错解决方案与诊断工具使用说明。" } --- ## 证书问题 1. 查询时报错 `curl 77: Problem with the SSL CA cert.`。说明当前系统证书过旧,需要更新本地证书。 - - 可以从 `https://curl.se/docs/caextract.html` 下载最新的 CA 证书。 - - 将下载后的 cacert-xxx.pem 放到`/etc/ssl/certs/`目录,例如:`sudo cp cacert-xxx.pem /etc/ssl/certs/ca-certificates.crt`。 + - 可以从 `https://curl.se/docs/caextract.html` 下载最新的 CA 证书。 + - 将下载后的 cacert-xxx.pem 放到 `/etc/ssl/certs/` 目录,例如:`sudo cp cacert-xxx.pem /etc/ssl/certs/ca-certificates.crt`。 -2. 查询时报错:`ERROR 1105 (HY000): errCode = 2, detailMessage = (x.x.x.x)[CANCELLED][INTERNAL_ERROR]error setting certificate verify locations: CAfile: /etc/ssl/certs/ca-certificates.crt CApath: none`. +2. 查询时报错:`ERROR 1105 (HY000): errCode = 2, detailMessage = (x.x.x.x)[CANCELLED][INTERNAL_ERROR]error setting certificate verify locations: CAfile: /etc/ssl/certs/ca-certificates.crt CApath: none`。 -``` -yum install -y ca-certificates -ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca-certificates.crt -``` + ``` + yum install -y ca-certificates + ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca-certificates.crt + ``` ## Kerberos 1. 连接 Kerberos 认证的 Hive Metastore 报错:`GSS initiate failed` - 通常是因为 Kerberos 认证信息填写不正确导致的,可以通过以下步骤排查: + 通常是因为 Kerberos 认证信息填写不正确导致的,可以通过以下步骤排查: 1. 1.2.1 之前的版本中,Doris 依赖的 libhdfs3 库没有开启 gsasl。请更新至 1.2.2 之后的版本。 2. 确认对各个组件,设置了正确的 keytab 和 principal,并确认 keytab 文件存在于所有 FE、BE 节点上。 - 1. `hadoop.kerberos.keytab`/`hadoop.kerberos.principal`:用于 Hadoop hdfs 访问,填写 hdfs 对应的值。 - 2. `hive.metastore.kerberos.principal`:用于 hive metastore。 + 1. `hadoop.kerberos.keytab`/`hadoop.kerberos.principal`:用于 Hadoop HDFS 访问,填写 HDFS 对应的值。 + 2. `hive.metastore.kerberos.principal`:用于 Hive Metastore。 - 3. 尝试将 principal 中的 ip 换成域名(不要使用默认的 `_HOST` 占位符) + 3. 尝试将 principal 中的 IP 换成域名(不要使用默认的 `_HOST` 占位符)。 4. 确认 `/etc/krb5.conf` 文件存在于所有 FE、BE 节点上。 2. 通过 Hive Catalog 连接 Hive 数据库报错:`RemoteException: SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]`. - 如果在 `show databases` 和 `show tables` 都是没问题的情况下,查询的时候出现上面的错误,我们需要进行下面两个操作: - - fe/conf、be/conf 目录下需放置 core-site.xml 和 hdfs-site.xml - - BE 节点执行 Kerberos 的 kinit 然后重启 BE,然后再去执行查询即可。 + 如果在 `show databases` 和 `show tables` 都是没问题的情况下,查询的时候出现上面的错误,需要进行下面两个操作: + - `fe/conf`、`be/conf` 目录下需放置 `core-site.xml` 和 `hdfs-site.xml`。 + - BE 节点执行 Kerberos 的 `kinit` 然后重启 BE,然后再去执行查询即可。 3. 查询配置了 Kerberos 的外表,遇到该报错:`GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos Ticket)`,一般重启 FE 和 BE 能够解决该问题。 - - 重启所有节点前可在`"${DORIS_HOME}/be/conf/be.conf"`中的 JAVA_OPTS 参数里配置`-Djavax.security.auth.useSubjectCredsOnly=false`,通过底层机制去获取 JAAS credentials 信息,而不是应用程序。 - - 在[JAAS Troubleshooting](https://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/tutorials/Troubleshooting.html)中可获取更多常见 JAAS 报错的解决方法。 + - 重启所有节点前可在 `"${DORIS_HOME}/be/conf/be.conf"` 中的 JAVA_OPTS 参数里配置 `-Djavax.security.auth.useSubjectCredsOnly=false`,通过底层机制去获取 JAAS credentials 信息,而不是应用程序。 + - 在 [JAAS Troubleshooting](https://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/tutorials/Troubleshooting.html) 中可获取更多常见 JAAS 报错的解决方法。 -4. 在 Catalog 中配置 Kerberos 时,报错`Unable to obtain password from user`的解决方法: +4. 在 Catalog 中配置 Kerberos 时,报错 `Unable to obtain password from user` 的解决方法: - - 用到的 principal 必须在 klist 中存在,使用`klist -kt your.keytab`检查。 - - 检查 catalog 配置是否正确,比如漏配`yarn.resourcemanager.principal`。 - - 若上述检查没问题,则当前系统 yum 或者其他包管理软件安装的 JDK 版本存在不支持的加密算法,建议自行安装 JDK 并设置`JAVA_HOME`环境变量。 + - 用到的 principal 必须在 klist 中存在,使用 `klist -kt your.keytab` 检查。 + - 检查 Catalog 配置是否正确,比如漏配 `yarn.resourcemanager.principal`。 + - 若上述检查没问题,则当前系统 yum 或者其他包管理软件安装的 JDK 版本存在不支持的加密算法,建议自行安装 JDK 并设置 `JAVA_HOME` 环境变量。 - Kerberos 默认使用 AES-256 来进行加密。如果使用 Oracle JDK,则必须安装 JCE。如果是 OpenJDK,OpenJDK 的某些发行版会自动提供无限强度的 JCE,因此不需要安装 JCE。 - - JCE 与 JDK 版本是对应的,需要根据 JDK 的版本来选择 JCE 版本,下载 JCE 的 zip 包并解压到`$JAVA_HOME/jre/lib/security`目录下: - - JDK6:[JCE6](http://www.oracle.com/technetwork/java/javase/downloads/jce-6-download-429243.html) - - JDK7:[JCE7](http://www.oracle.com/technetwork/java/embedded/embedded-se/downloads/jce-7-download-432124.html) - - JDK8:[JCE8](http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html) + - JCE 与 JDK 版本是对应的,需要根据 JDK 的版本来选择 JCE 版本,下载 JCE 的 zip 包并解压到 `$JAVA_HOME/jre/lib/security` 目录下: + - JDK6:[JCE6](http://www.oracle.com/technetwork/java/javase/downloads/jce-6-download-429243.html) + - JDK7:[JCE7](http://www.oracle.com/technetwork/java/embedded/embedded-se/downloads/jce-7-download-432124.html) + - JDK8:[JCE8](http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html) 5. 使用 KMS 访问 HDFS 时报错:`java.security.InvalidKeyException: Illegal key size` - 升级 JDK 版本到 >= Java 8 u162 的版本。或者下载安装 JDK 相应的 JCE Unlimited Strength Jurisdiction Policy Files。 + 升级 JDK 版本到 >= Java 8 u162 的版本,或者下载安装 JDK 相应的 JCE Unlimited Strength Jurisdiction Policy Files。 -6. 在 Catalog 中配置 Kerberos 时,如果报错`SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]`,那么需要将`core-site.xml`文件放到`"${DORIS_HOME}/be/conf"`目录下。 +6. 在 Catalog 中配置 Kerberos 时,如果报错 `SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]`,那么需要将 `core-site.xml` 文件放到 `"${DORIS_HOME}/be/conf"` 目录下。 - 如果访问 HDFS 报错`No common protection layer between client and server`,检查客户端和服务端的`hadoop.rpc.protection`属性,使他们保持一致。 + 如果访问 HDFS 报错 `No common protection layer between client and server`,检查客户端和服务端的 `hadoop.rpc.protection` 属性,使它们保持一致。 - ``` - - - - - - - hadoop.security.authentication - kerberos - - - + ```xml + + + + + + + hadoop.security.authentication + kerberos + + + ``` -7. 在使用 Broker Load 时,配置了 Kerberos,如果报错`Cannot locate default realm.`。 +7. 在使用 Broker Load 时,配置了 Kerberos,如果报错 `Cannot locate default realm.`。 - 将 `-Djava.security.krb5.conf=/your-path` 配置项添加到 Broker Load 启动脚本的 `start_broker.sh` 的 `JAVA_OPTS`里。 + 将 `-Djava.security.krb5.conf=/your-path` 配置项添加到 Broker Load 启动脚本的 `start_broker.sh` 的 `JAVA_OPTS` 里。 -8. 当在 Catalog 里使用 Kerberos 配置时,不能同时使用`hadoop.username`属性。 +8. 当在 Catalog 里使用 Kerberos 配置时,不能同时使用 `hadoop.username` 属性。 9. 使用 JDK 17 访问 Kerberos - 如果使用 JDK 17 运行 Doris 并访问 Kerberos 服务,可能会出现因使用已废弃的加密算法而导致无法访问的现象。需要在 krb5.conf 中添加 `allow_weak_crypto=true` 属性。或升级 Kerberos 的加密算法。 + 如果使用 JDK 17 运行 Doris 并访问 Kerberos 服务,可能会出现因使用已废弃的加密算法而导致无法访问的现象。需要在 `krb5.conf` 中添加 `allow_weak_crypto=true` 属性,或升级 Kerberos 的加密算法。 详情参阅: @@ -94,52 +94,51 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 1. 通过 JDBC Catalog 连接 SQLServer 报错:`unable to find valid certification path to requested target` - 请在 `jdbc_url` 中添加 `trustServerCertificate=true` 选项。 + 请在 `jdbc_url` 中添加 `trustServerCertificate=true` 选项。 2. 通过 JDBC Catalog 连接 MySQL 数据库,中文字符乱码,或中文字符条件查询不正确 - 请在 `jdbc_url` 中添加 `useUnicode=true&characterEncoding=utf-8` + 请在 `jdbc_url` 中添加 `useUnicode=true&characterEncoding=utf-8`。 - > 注:1.2.3 版本后,使用 JDBC Catalog 连接 MySQL 数据库,会自动添加这些参数。 + > 注:1.2.3 版本后,使用 JDBC Catalog 连接 MySQL 数据库,会自动添加这些参数。 3. 通过 JDBC Catalog 连接 MySQL 数据库报错:`Establishing SSL connection without server's identity verification is not recommended` - 请在 `jdbc_url` 中添加 `useSSL=true` + 请在 `jdbc_url` 中添加 `useSSL=true`。 -4. 使用 JDBC Catalog 将 MySQL 数据同步到 Doris 中,日期数据同步错误。需要校验下 MySQL 的版本是否与 MySQL 的驱动包是否对应,比如 MySQL8 以上需要使用驱动 com.mysql.cj.jdbc.Driver。 - +4. 使用 JDBC Catalog 将 MySQL 数据同步到 Doris 中,日期数据同步错误。需要校验 MySQL 的版本与 MySQL 的驱动包是否对应,比如 MySQL 8 以上需要使用驱动 `com.mysql.cj.jdbc.Driver`。 5. 单个字段过大,查询时 BE 侧 Java 内存 OOM - Jdbc Scanner 在通过 jdbc 读取时,由 session variable `batch_size` 决定每批次数据在 JVM 中处理的数量,如果单个字段过大,导致 `字段大小 * batch_size`(近似值,由于 JVM 中 static 以及数据 copy 占用) 超过 JVM 内存限制,就会出现 OOM。 + JDBC Scanner 在通过 JDBC 读取时,由 Session Variable `batch_size` 决定每批次数据在 JVM 中处理的数量,如果单个字段过大,导致 `字段大小 * batch_size`(近似值,由于 JVM 中 static 以及数据 copy 占用)超过 JVM 内存限制,就会出现 OOM。 - 解决方法: + 解决方法: - - 减小 `batch_size` 的值,可以通过 `set batch_size = 512;` 来调整,默认值为 4064。 - - 增大 BE 的 JVM 内存,通过修改 `JAVA_OPTS` 参数中的 `-Xmx` 来调整 JVM 最大堆内存大小。例如:`"-Xmx8g`。 + - 减小 `batch_size` 的值,可以通过 `set batch_size = 512;` 来调整,默认值为 4064。 + - 增大 BE 的 JVM 内存,通过修改 `JAVA_OPTS` 参数中的 `-Xmx` 来调整 JVM 最大堆内存大小。例如:`-Xmx8g`。 ## Hive Catalog 1. 通过 Hive Catalog 访问 Iceberg 或 Hive 表报错:`failed to get schema` 或 `Storage schema reading not supported` - 可以尝试以下方法: + 可以尝试以下方法: - * 在 Hive 的 lib/ 目录放上 `iceberg` 运行时有关的 jar 包。 + - 在 Hive 的 `lib/` 目录放上 `iceberg` 运行时有关的 jar 包。 - * 在 `hive-site.xml` 配置: + - 在 `hive-site.xml` 配置: - ``` - metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader - ``` + ``` + metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader + ``` - 配置完成后需要重启 Hive Metastore。 + 配置完成后需要重启 Hive Metastore。 - * 在 Catalog 属性中添加 `"get_schema_from_table" = "true"` + - 在 Catalog 属性中添加 `"get_schema_from_table" = "true"`。 - 该参数自 2.1.10 和 3.0.6 版本支持。 + 该参数自 2.1.10 和 3.0.6 版本支持。 2. 连接 Hive Catalog 报错:`Caused by: java.lang.NullPointerException` - 如 fe.log 中有如下堆栈: + 如 fe.log 中有如下堆栈: ``` Caused by: java.lang.NullPointerException @@ -150,17 +149,17 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_181] ``` - 可以尝试在 `create catalog` 语句中添加 `"metastore.filter.hook" = "org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl"` 解决。 + 可以尝试在 `CREATE CATALOG` 语句中添加 `"metastore.filter.hook" = "org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl"` 解决。 -3. 如果创建 Hive Catalog 后能正常`show tables`,但查询时报`java.net.UnknownHostException: xxxxx` +3. 如果创建 Hive Catalog 后能正常 `show tables`,但查询时报 `java.net.UnknownHostException: xxxxx` - 可以在 CATALOG 的 PROPERTIES 中添加 + 可以在 Catalog 的 PROPERTIES 中添加: ``` 'fs.defaultFS' = 'hdfs://' ``` -4. Hive 1.x 的 orc 格式的表可能会遇到底层 orc 文件 schema 中列名为 `_col0`,`_col1`,`_col2`... 这类系统列名,此时需要在 catalog 配置中添加 `hive.version` 为 1.x.x,这样就会使用 hive 表中的列名进行映射。 +4. Hive 1.x 的 ORC 格式的表可能会遇到底层 ORC 文件 Schema 中列名为 `_col0`、`_col1`、`_col2`... 这类系统列名,此时需要在 Catalog 配置中添加 `hive.version` 为 1.x.x,这样就会使用 Hive 表中的列名进行映射。 ```sql CREATE CATALOG hive PROPERTIES ( @@ -168,7 +167,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- ); ``` -5. 使用 Catalog 查询表数据时发现与 Hive Metastore 相关的报错:`Invalid method name`,需要设置`hive.version`参数。 +5. 使用 Catalog 查询表数据时发现与 Hive Metastore 相关的报错:`Invalid method name`,需要设置 `hive.version` 参数。 ```sql CREATE CATALOG hive PROPERTIES ( @@ -178,15 +177,15 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 6. 查询 ORC 格式的表,FE 报错 `Could not obtain block` 或 `Caused by: java.lang.NoSuchFieldError: types` - 对于 ORC 文件,在默认情况下,FE 会访问 HDFS 获取文件信息,进行文件切分。部分情况下,FE 可能无法访问到 HDFS。可以通过添加以下参数解决: + 对于 ORC 文件,在默认情况下,FE 会访问 HDFS 获取文件信息,进行文件切分。部分情况下,FE 可能无法访问到 HDFS。可以通过添加以下参数解决: - `"hive.exec.orc.split.strategy" = "BI"` + `"hive.exec.orc.split.strategy" = "BI"` - 其他选项:HYBRID(默认),ETL。 + 其他选项:HYBRID(默认)、ETL。 -7. 在 hive 上可以查到 hudi 表分区字段的值,但是在 doris 查不到。 +7. 在 Hive 上可以查到 Hudi 表分区字段的值,但是在 Doris 查不到。 - doris 和 hive 目前查询 hudi 的方式不一样,doris 需要在 hudi 表结构的 avsc 文件里添加上分区字段,如果没加,就会导致 doris 查询 partition_val 为空(即使设置了 hoodie.datasource.hive_sync.partition_fields=partition_val 也不可以) + Doris 和 Hive 目前查询 Hudi 的方式不一样,Doris 需要在 Hudi 表结构的 avsc 文件里添加上分区字段,如果没加,就会导致 Doris 查询 `partition_val` 为空(即使设置了 `hoodie.datasource.hive_sync.partition_fields=partition_val` 也不可以)。 ``` { @@ -215,26 +214,26 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- } ``` -8. 查询 hive 外表,遇到该报错:`java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found` +8. 查询 Hive 外表,遇到该报错:`java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found` - 去 hadoop 环境搜索`hadoop-lzo-*.jar`放在`"${DORIS_HOME}/fe/lib/"`目录下并重启 fe。 + 去 Hadoop 环境搜索 `hadoop-lzo-*.jar` 放在 `"${DORIS_HOME}/fe/lib/"` 目录下并重启 FE。 - 从 2.0.2 版本起,可以将这个文件放置在 FE 的 `custom_lib/` 目录下(如不存在,手动创建即可),以防止升级集群时因为 lib 目录被替换而导致文件丢失。 + 从 2.0.2 版本起,可以将这个文件放置在 FE 的 `custom_lib/` 目录下(如不存在,手动创建即可),以防止升级集群时因为 `lib` 目录被替换而导致文件丢失。 -9. 创建 hive 表指定 serde 为 `org.apache.hadoop.hive.contrib.serde2.MultiDelimitserDe`,访问表时报错:`storage schema reading not supported` +9. 创建 Hive 表指定 serde 为 `org.apache.hadoop.hive.contrib.serde2.MultiDelimitserDe`,访问表时报错:`storage schema reading not supported` - 在 hive-site.xml 文件中增加以下配置,并重启 hms 服务: + 在 `hive-site.xml` 文件中增加以下配置,并重启 HMS 服务: - ``` - - metastore.storage.schema.reader.impl - org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader - - ``` + ```xml + + metastore.storage.schema.reader.impl + org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader + + ``` -10. 报错:java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty +10. 报错:`java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty` - FE日志中完整报错信息如下: + FE 日志中完整报错信息如下: ``` org.apache.doris.common.UserException: errCode = 2, detailMessage = S3 list path failed. path=s3://bucket/part-*,msg=errors while get file status listStatus on s3://bucket: com.amazonaws.SdkClientException: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty @@ -246,7 +245,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- Caused by: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty ``` - 尝试更新FE节点CA证书,使用 `update-ca-trust(CentOS/RockyLinux)`,然后重启FE进程即可。 + 尝试更新 FE 节点 CA 证书,使用 `update-ca-trust`(CentOS/RockyLinux),然后重启 FE 进程即可。 11. BE 报错:`java.lang.InternalError` @@ -267,19 +266,18 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 是因为 Doris 自带的 libz.a 和系统环境中的 libz.so 冲突了。 - 为了解决这个问题,需要先执行 `export LD_LIBRARY_PATH=/path/to/be/lib:$LD_LIBRARY_PATH` 然后重启 BE 进程。 + 为了解决这个问题,需要先执行 `export LD_LIBRARY_PATH=/path/to/be/lib:$LD_LIBRARY_PATH`,然后重启 BE 进程。 -12. 在插入 hive 数据的时候报错:`HiveAccessControlException Permission denied: user [user_a] does not have [UPDATE] privilege on [database/table]`。 +12. 在插入 Hive 数据的时候报错:`HiveAccessControlException Permission denied: user [user_a] does not have [UPDATE] privilege on [database/table]`。 - 因为插入数据之后,需要更新对应的统计信息,这个更新的操作需要 alter 权限,所以要在 ranger 上给该用户新增 alter 权限。 + 因为插入数据之后,需要更新对应的统计信息,这个更新的操作需要 ALTER 权限,所以要在 Ranger 上给该用户新增 ALTER 权限。 13. 在查询 ORC 文件时,如果出现报错类似 `Orc row reader nextBatch failed. reason = Can't open /usr/share/zoneinfo/+08:00` 首先检查当前 `session` 下 `time_zone` 的时区设置是多少,推荐使用类似 `Asia/Shanghai` 的写法。 - 如果 `session` 时区已经是 `Asia/Shanghai`,且查询仍然报错,说明生成 ORC 文件时的时区是 `+08:00`, 导致在读取时解析 `footer` 时需要用到 `+08:00` 时区,可以尝试在 `/usr/share/zoneinfo/` 目录下面软链到相同时区上。 + 如果 `session` 时区已经是 `Asia/Shanghai`,且查询仍然报错,说明生成 ORC 文件时的时区是 `+08:00`,导致在读取时解析 `footer` 时需要用到 `+08:00` 时区,可以尝试在 `/usr/share/zoneinfo/` 目录下面软链到相同时区上。 -<<<<<<< HEAD 14. 查询使用 JSON SerDe(如 `org.openx.data.jsonserde.JsonSerDe`)的 Hive 表时,报错:`failed to get schema` 或 `Storage schema reading not supported` 当 Hive 表使用 JSON 格式存储(ROW FORMAT SERDE 为 `org.openx.data.jsonserde.JsonSerDe`)时,Hive Metastore 可能无法通过默认方式读取表的 Schema 信息,导致 Doris 查询时报错: @@ -301,8 +299,8 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- ``` 该参数自 2.1.10 和 3.0.6 版本支持。 -======= -14. **查询 Hive Catalog 表时,优化阶段极慢并伴随 `nereids cost too much time` 报错,且每次访问 HMS 的耗时稳定在 10 秒左右。** + +15. 查询 Hive Catalog 表时,优化阶段极慢并伴随 `nereids cost too much time` 报错,且每次访问 HMS 的耗时稳定在 10 秒左右。 **问题分析:** 这类问题通常并非 HMS 服务本身的 RPC 执行慢引起,而是由于 **Doris FE 所在机器的 DNS 配置异常** 导致。 @@ -329,21 +327,24 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 1. **修复 DNS 配置(推荐)**:修正 Doris FE 节点上 `/etc/resolv.conf` 中的 `nameserver` 配置,确保域名解析服务正常且快速响应。如果局域网内无需 DNS 且无公网访问需求,可注释掉无效的 nameserver。 2. **配置 Hosts 静态映射**:在 FE 节点的 `/etc/hosts` 中添加 HMS 节点的 IP 与 Hostname 映射。 3. **规范 Catalog 配置**:创建 Catalog 时,`hive.metastore.uris` 参数建议优先使用正确的 Hostname 而不是裸 IP。 ->>>>>>> 0cf7ea7ca780 (docs(faq): add HMS DNS resolution diagnostic and Pulse toolset) -15. **查询 Hive Catalog 表时,偶尔出现查询长时间卡住,或者直接报错优化器超时 `nereids cost too much time`,但紧接着再次查询又恢复正常。** +16. **查询 Hive Catalog 表时,偶尔出现查询长时间卡住,或者直接报错优化器超时 `nereids cost too much time`,但紧接着再次查询又恢复正常。** **问题描述:** + 这种情况通常发生在 Catalog 空闲一段时间后的首次查询。表现为请求在发起 HMS RPC 时卡住,由于 Hive Client 内部存在重试机制,复用到失效的长连接时会等待 Socket Timeout(默认 10 秒),重试可能导致累积等待时间长达 20-30 秒。这会导致查询规划阶段极慢,甚至直接触发 Doris FE 的优化器超时报错 `nereids cost too much time`。一旦该连接被剔除并重建,后续查询会立即恢复正常。 **问题分析:** + Doris 为每个 HMS Catalog 维护了一个 Client Pool 以复用连接。在复杂的网络环境中(如跨 VPC、经过防火墙或 NAT 网关),中间网络设备往往会对空闲连接设置 `idle timeout`。当连接空闲时间超过阈值时,网络设备会静默丢弃连接状态,且通常不会发送 FIN/RST 包通知两端。Doris 侧仍认为连接可用,下次复用该“僵尸连接”时,由于链路已不可达,必须等待完整的 Socket Timeout 才能感知到失效并触发重试。 **排查建议:** + - 确认 Doris FE 与 HMS 之间是否经过了防火墙、云厂商 NAT 网关或负载均衡器。 - 使用下文提到的 **Pulse (hms-tools)** 工具。如果探测显示网络连通极快,但长时间放置后首次执行 RPC 出现稳定且倍数于 10s 的延迟,则基本可判定为长连接被中间设备静默回收。 **解决方案:** + 利用 Hive Client 原生的生命周期管理能力,在 Catalog 属性中配置 `hive.metastore.client.socket.lifetime`,使其略短于中间网络设备的空闲超时时间(例如设为 300 秒): ```sql @@ -354,29 +355,30 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- "hive.metastore.client.socket.lifetime" = "300s" ); ``` + 配置后,HMS Client 在执行 RPC 前会检查连接年龄。如果已超过 `lifetime`,它会主动重新建立连接,从而规避因复用失效连接导致的长时间卡顿或优化器超时。 ## HDFS 1. 访问 HDFS 3.x 时报错:`java.lang.VerifyError: xxx` - 1.2.1 之前的版本中,Doris 依赖的 Hadoop 版本为 2.8。需更新至 2.10.2。或更新 Doris 至 1.2.2 之后的版本。 + 1.2.1 之前的版本中,Doris 依赖的 Hadoop 版本为 2.8,需更新至 2.10.2,或更新 Doris 至 1.2.2 之后的版本。 -2. 使用 Hedged Read 优化 HDFS 读取慢的问题。 +2. 使用 Hedged Read 优化 HDFS 读取慢的问题 在某些情况下,HDFS 的负载较高可能导致读取某个 HDFS 上的数据副本的时间较长,从而拖慢整体的查询效率。HDFS Client 提供了 Hedged Read 功能。 - 该功能可以在一个读请求超过一定阈值未返回时,启动另一个读线程读取同一份数据,哪个先返回就是用哪个结果。 + 该功能可以在一个读请求超过一定阈值未返回时,启动另一个读线程读取同一份数据,哪个先返回就使用哪个结果。 注意:该功能可能会增加 HDFS 集群的负载,请酌情使用。 可以通过以下方式开启这个功能: - ``` - create catalog regression properties ( - 'type'='hms', + ```sql + CREATE CATALOG regression PROPERTIES ( + 'type' = 'hms', 'hive.metastore.uris' = 'thrift://172.21.16.47:7004', 'dfs.client.hedged.read.threadpool.size' = '128', - 'dfs.client.hedged.read.threshold.millis' = "500" + 'dfs.client.hedged.read.threshold.millis' = '500' ); ``` @@ -386,15 +388,15 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 开启后,可以在 Query Profile 中看到相关参数: - `TotalHedgedRead`: 发起 Hedged Read 的次数。 + `TotalHedgedRead`:发起 Hedged Read 的次数。 - `HedgedReadWins`:Hedged Read 成功的次数(发起并且比原请求更快返回的次数) + `HedgedReadWins`:Hedged Read 成功的次数(发起并且比原请求更快返回的次数)。 注意,这里的值是单个 HDFS Client 的累计值,而不是单个查询的数值。同一个 HDFS Client 会被多个查询复用。 3. `Couldn't create proxy provider class org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider` - 在 FE 和 BE 的 start 脚本中,会将环境变量 `HADOOP_CONF_DIR` 加入 CLASSPATH。如果 `HADOOP_CONF_DIR` 设置错误,比如指向了不存在的路径或错误路径,则可能加载到错误的 xxx-site.xml 文件,从而读取到错误的信息。 + 在 FE 和 BE 的启动脚本中,会将环境变量 `HADOOP_CONF_DIR` 加入 CLASSPATH。如果 `HADOOP_CONF_DIR` 设置错误,比如指向了不存在的路径或错误路径,则可能加载到错误的 `xxx-site.xml` 文件,从而读取到错误的信息。 需检查 `HADOOP_CONF_DIR` 是否配置正确,或将这个环境变量删除。 @@ -402,7 +404,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 可能的处理方式有: - 通过 `hdfs fsck file -files -blocks -locations` 来查看具体该文件是否健康。 - - 通过 `telnet` 来检查与 datanode 的连通性。 + - 通过 `telnet` 来检查与 DataNode 的连通性。 在错误日志中可能会打印如下错误: @@ -416,7 +418,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 同时,请检查 `fe/conf` 和 `be/conf` 下放置的 `hdfs-site.xml` 文件中,该参数是否为 true。 - - 查看 datanode 日志。 + - 查看 DataNode 日志。 如果出现以下错误: @@ -424,11 +426,11 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to read expected SASL data transfer protection handshake from client at /XXX.XXX.XXX.XXX:XXXXX. Perhaps the client is running an older version of Hadoop which does not support SASL data transfer protection ``` - 则为当前 hdfs 开启了加密传输方式,而客户端未开启导致的错误。 + 则为当前 HDFS 开启了加密传输方式,而客户端未开启导致的错误。 使用下面的任意一种解决方案即可: - - 拷贝 `hdfs-site.xml` 以及 `core-site.xml` 到 `fe/conf` 和 `be/conf` 目录。(推荐) - - 在 `hdfs-site.xml` 找到相应的配置 `dfs.data.transfer.protection`,并且在 catalog 里面设置该参数。 + - 拷贝 `hdfs-site.xml` 以及 `core-site.xml` 到 `fe/conf` 和 `be/conf` 目录。(推荐) + - 在 `hdfs-site.xml` 找到相应的配置 `dfs.data.transfer.protection`,并且在 Catalog 里面设置该参数。 5. 查询 Hive Catalog 表时报错:`RPC response has a length of xxx exceeds maximum data length` @@ -449,29 +451,29 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- ## DLF Catalog -1. 使用 DLF Catalog 时,BE 读在取 JindoFS 数据出现`Invalid address`,需要在`/ets/hosts`中添加日志中出现的域名到 IP 的映射。 +1. 使用 DLF Catalog 时,BE 读取 JindoFS 数据出现 `Invalid address`,需要在 `/etc/hosts` 中添加日志中出现的域名到 IP 的映射。 -2. 读取数据无权限时,使用`hadoop.username`属性指定有权限的用户。 +2. 读取数据无权限时,使用 `hadoop.username` 属性指定有权限的用户。 -3. DLF Catalog 中的元数据和 DLF 保持一致。当使用 DLF 管理元数据时,Hive 新导入的分区,可能未被 DLF 同步,导致出现 DLF 和 Hive 元数据不一致的情况,对此,需要先保证 Hive 元数据被 DLF 完全同步。 +3. DLF Catalog 中的元数据和 DLF 保持一致。当使用 DLF 管理元数据时,Hive 新导入的分区可能未被 DLF 同步,导致出现 DLF 和 Hive 元数据不一致的情况。对此,需要先保证 Hive 元数据被 DLF 完全同步。 ## 其他问题 1. Binary 类型映射到 Doris 后,查询乱码 - Doris 原生不支持 Binary 类型,所以各类数据湖或数据库中的 Binary 类型映射到 Doris 中,通常使用 String 类型进行映射。String 类型只能展示可打印字符。如果需要查询 Binary 的内容,可以使用 `TO_BASE64()` 函数转换为 Base64 编码后,在进行下一步处理。 + Doris 原生不支持 Binary 类型,所以各类数据湖或数据库中的 Binary 类型映射到 Doris 中,通常使用 String 类型进行映射。String 类型只能展示可打印字符。如果需要查询 Binary 的内容,可以使用 `TO_BASE64()` 函数转换为 Base64 编码后,再进行下一步处理。 2. 分析 Parquet 文件 - 在查询 Parquet 文件时,由于不同系统生成的 Parquet 文件格式可能有所差异,比如 RowGroup 的数量,索引的值等,有时需要检查 Parquet 文件的元数据进行问题定位或性能分析。这里提供一个工具帮助用户更方便的分析 Parquet 文件: + 在查询 Parquet 文件时,由于不同系统生成的 Parquet 文件格式可能有所差异,比如 RowGroup 的数量、索引的值等,有时需要检查 Parquet 文件的元数据进行问题定位或性能分析。这里提供一个工具帮助用户更方便地分析 Parquet 文件: - 1. 下载并解压 [Apache Parquet Cli 1.14.0](https://github.com/morningman/tools/releases/download/apache-parquet-cli-1.14.0/apache-parquet-cli-1.14.0.tar.xz) - 2. 将需要分析的 Parquet 文件下载到本地,假设路径为 `/path/to/file.parquet` + 1. 下载并解压 [Apache Parquet Cli 1.14.0](https://github.com/morningman/tools/releases/download/apache-parquet-cli-1.14.0/apache-parquet-cli-1.14.0.tar.xz)。 + 2. 将需要分析的 Parquet 文件下载到本地,假设路径为 `/path/to/file.parquet`。 3. 使用如下命令分析 Parquet 文件元信息: `./parquet-tools meta /path/to/file.parquet` - 4. 更多功能,可参阅 [Apache Parquet Cli 文档](https://github.com/apache/parquet-java/tree/apache-parquet-1.14.0/parquet-cli) + 4. 更多功能,可参阅 [Apache Parquet Cli 文档](https://github.com/apache/parquet-java/tree/apache-parquet-1.14.0/parquet-cli)。 ## 诊断工具 @@ -491,7 +493,7 @@ Pulse 主要包含以下工具集: - 支持测试 KDC 可达性、检查 Keytab 文件以及执行登录测试,确保认证层不会阻断连接。 3. **对象存储诊断工具 (`s3-tools`, `gcs-tools`, `azure-blob-cpp`)**: - - 针对主流云存储(AWS S3, Google GCS, Azure Blob)的诊断工具。 + - 针对主流云存储(AWS S3、Google GCS、Azure Blob)的诊断工具。 - 用于排查“权限拒绝(Access Denied)”或“存储桶不存在(Bucket Not Found)”等常见的外部表数据访问问题。 - 支持验证凭据来源、STS 身份以及执行 Bucket 级别的操作测试。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/faq/lakehouse-faq.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/faq/lakehouse-faq.md index a6742edabfbc9..c4a94de62e060 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/faq/lakehouse-faq.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/faq/lakehouse-faq.md @@ -2,91 +2,91 @@ { "title": "常见数据湖问题", "language": "zh-CN", - "description": "通常是因为 Kerberos 认证信息填写不正确导致的,可以通过以下步骤排查:" + "description": "Apache Doris 数据湖(Lakehouse)常见问题排查指南,涵盖证书、Kerberos 认证、JDBC Catalog、Hive Catalog、HDFS、DLF Catalog 等场景的报错解决方案与诊断工具使用说明。" } --- ## 证书问题 1. 查询时报错 `curl 77: Problem with the SSL CA cert.`。说明当前系统证书过旧,需要更新本地证书。 - - 可以从 `https://curl.se/docs/caextract.html` 下载最新的 CA 证书。 - - 将下载后的 cacert-xxx.pem 放到`/etc/ssl/certs/`目录,例如:`sudo cp cacert-xxx.pem /etc/ssl/certs/ca-certificates.crt`。 + - 可以从 `https://curl.se/docs/caextract.html` 下载最新的 CA 证书。 + - 将下载后的 cacert-xxx.pem 放到 `/etc/ssl/certs/` 目录,例如:`sudo cp cacert-xxx.pem /etc/ssl/certs/ca-certificates.crt`。 -2. 查询时报错:`ERROR 1105 (HY000): errCode = 2, detailMessage = (x.x.x.x)[CANCELLED][INTERNAL_ERROR]error setting certificate verify locations: CAfile: /etc/ssl/certs/ca-certificates.crt CApath: none`. +2. 查询时报错:`ERROR 1105 (HY000): errCode = 2, detailMessage = (x.x.x.x)[CANCELLED][INTERNAL_ERROR]error setting certificate verify locations: CAfile: /etc/ssl/certs/ca-certificates.crt CApath: none`。 -``` -yum install -y ca-certificates -ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca-certificates.crt -``` + ``` + yum install -y ca-certificates + ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca-certificates.crt + ``` ## Kerberos 1. 连接 Kerberos 认证的 Hive Metastore 报错:`GSS initiate failed` - 通常是因为 Kerberos 认证信息填写不正确导致的,可以通过以下步骤排查: + 通常是因为 Kerberos 认证信息填写不正确导致的,可以通过以下步骤排查: 1. 1.2.1 之前的版本中,Doris 依赖的 libhdfs3 库没有开启 gsasl。请更新至 1.2.2 之后的版本。 2. 确认对各个组件,设置了正确的 keytab 和 principal,并确认 keytab 文件存在于所有 FE、BE 节点上。 - 1. `hadoop.kerberos.keytab`/`hadoop.kerberos.principal`:用于 Hadoop hdfs 访问,填写 hdfs 对应的值。 - 2. `hive.metastore.kerberos.principal`:用于 hive metastore。 + 1. `hadoop.kerberos.keytab`/`hadoop.kerberos.principal`:用于 Hadoop HDFS 访问,填写 HDFS 对应的值。 + 2. `hive.metastore.kerberos.principal`:用于 Hive Metastore。 - 3. 尝试将 principal 中的 ip 换成域名(不要使用默认的 `_HOST` 占位符) + 3. 尝试将 principal 中的 IP 换成域名(不要使用默认的 `_HOST` 占位符)。 4. 确认 `/etc/krb5.conf` 文件存在于所有 FE、BE 节点上。 2. 通过 Hive Catalog 连接 Hive 数据库报错:`RemoteException: SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]`. - 如果在 `show databases` 和 `show tables` 都是没问题的情况下,查询的时候出现上面的错误,我们需要进行下面两个操作: - - fe/conf、be/conf 目录下需放置 core-site.xml 和 hdfs-site.xml - - BE 节点执行 Kerberos 的 kinit 然后重启 BE,然后再去执行查询即可。 + 如果在 `show databases` 和 `show tables` 都是没问题的情况下,查询的时候出现上面的错误,需要进行下面两个操作: + - `fe/conf`、`be/conf` 目录下需放置 `core-site.xml` 和 `hdfs-site.xml`。 + - BE 节点执行 Kerberos 的 `kinit` 然后重启 BE,然后再去执行查询即可。 3. 查询配置了 Kerberos 的外表,遇到该报错:`GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos Ticket)`,一般重启 FE 和 BE 能够解决该问题。 - - 重启所有节点前可在`"${DORIS_HOME}/be/conf/be.conf"`中的 JAVA_OPTS 参数里配置`-Djavax.security.auth.useSubjectCredsOnly=false`,通过底层机制去获取 JAAS credentials 信息,而不是应用程序。 - - 在[JAAS Troubleshooting](https://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/tutorials/Troubleshooting.html)中可获取更多常见 JAAS 报错的解决方法。 + - 重启所有节点前可在 `"${DORIS_HOME}/be/conf/be.conf"` 中的 JAVA_OPTS 参数里配置 `-Djavax.security.auth.useSubjectCredsOnly=false`,通过底层机制去获取 JAAS credentials 信息,而不是应用程序。 + - 在 [JAAS Troubleshooting](https://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/tutorials/Troubleshooting.html) 中可获取更多常见 JAAS 报错的解决方法。 -4. 在 Catalog 中配置 Kerberos 时,报错`Unable to obtain password from user`的解决方法: +4. 在 Catalog 中配置 Kerberos 时,报错 `Unable to obtain password from user` 的解决方法: - - 用到的 principal 必须在 klist 中存在,使用`klist -kt your.keytab`检查。 - - 检查 catalog 配置是否正确,比如漏配`yarn.resourcemanager.principal`。 - - 若上述检查没问题,则当前系统 yum 或者其他包管理软件安装的 JDK 版本存在不支持的加密算法,建议自行安装 JDK 并设置`JAVA_HOME`环境变量。 + - 用到的 principal 必须在 klist 中存在,使用 `klist -kt your.keytab` 检查。 + - 检查 Catalog 配置是否正确,比如漏配 `yarn.resourcemanager.principal`。 + - 若上述检查没问题,则当前系统 yum 或者其他包管理软件安装的 JDK 版本存在不支持的加密算法,建议自行安装 JDK 并设置 `JAVA_HOME` 环境变量。 - Kerberos 默认使用 AES-256 来进行加密。如果使用 Oracle JDK,则必须安装 JCE。如果是 OpenJDK,OpenJDK 的某些发行版会自动提供无限强度的 JCE,因此不需要安装 JCE。 - - JCE 与 JDK 版本是对应的,需要根据 JDK 的版本来选择 JCE 版本,下载 JCE 的 zip 包并解压到`$JAVA_HOME/jre/lib/security`目录下: - - JDK6:[JCE6](http://www.oracle.com/technetwork/java/javase/downloads/jce-6-download-429243.html) - - JDK7:[JCE7](http://www.oracle.com/technetwork/java/embedded/embedded-se/downloads/jce-7-download-432124.html) - - JDK8:[JCE8](http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html) + - JCE 与 JDK 版本是对应的,需要根据 JDK 的版本来选择 JCE 版本,下载 JCE 的 zip 包并解压到 `$JAVA_HOME/jre/lib/security` 目录下: + - JDK6:[JCE6](http://www.oracle.com/technetwork/java/javase/downloads/jce-6-download-429243.html) + - JDK7:[JCE7](http://www.oracle.com/technetwork/java/embedded/embedded-se/downloads/jce-7-download-432124.html) + - JDK8:[JCE8](http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html) 5. 使用 KMS 访问 HDFS 时报错:`java.security.InvalidKeyException: Illegal key size` - 升级 JDK 版本到 >= Java 8 u162 的版本。或者下载安装 JDK 相应的 JCE Unlimited Strength Jurisdiction Policy Files。 + 升级 JDK 版本到 >= Java 8 u162 的版本,或者下载安装 JDK 相应的 JCE Unlimited Strength Jurisdiction Policy Files。 -6. 在 Catalog 中配置 Kerberos 时,如果报错`SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]`,那么需要将`core-site.xml`文件放到`"${DORIS_HOME}/be/conf"`目录下。 +6. 在 Catalog 中配置 Kerberos 时,如果报错 `SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]`,那么需要将 `core-site.xml` 文件放到 `"${DORIS_HOME}/be/conf"` 目录下。 - 如果访问 HDFS 报错`No common protection layer between client and server`,检查客户端和服务端的`hadoop.rpc.protection`属性,使他们保持一致。 + 如果访问 HDFS 报错 `No common protection layer between client and server`,检查客户端和服务端的 `hadoop.rpc.protection` 属性,使它们保持一致。 - ``` - - - - - - - hadoop.security.authentication - kerberos - - - + ```xml + + + + + + + hadoop.security.authentication + kerberos + + + ``` -7. 在使用 Broker Load 时,配置了 Kerberos,如果报错`Cannot locate default realm.`。 +7. 在使用 Broker Load 时,配置了 Kerberos,如果报错 `Cannot locate default realm.`。 - 将 `-Djava.security.krb5.conf=/your-path` 配置项添加到 Broker Load 启动脚本的 `start_broker.sh` 的 `JAVA_OPTS`里。 + 将 `-Djava.security.krb5.conf=/your-path` 配置项添加到 Broker Load 启动脚本的 `start_broker.sh` 的 `JAVA_OPTS` 里。 -8. 当在 Catalog 里使用 Kerberos 配置时,不能同时使用`hadoop.username`属性。 +8. 当在 Catalog 里使用 Kerberos 配置时,不能同时使用 `hadoop.username` 属性。 9. 使用 JDK 17 访问 Kerberos - 如果使用 JDK 17 运行 Doris 并访问 Kerberos 服务,可能会出现因使用已废弃的加密算法而导致无法访问的现象。需要在 krb5.conf 中添加 `allow_weak_crypto=true` 属性。或升级 Kerberos 的加密算法。 + 如果使用 JDK 17 运行 Doris 并访问 Kerberos 服务,可能会出现因使用已废弃的加密算法而导致无法访问的现象。需要在 `krb5.conf` 中添加 `allow_weak_crypto=true` 属性,或升级 Kerberos 的加密算法。 详情参阅: @@ -94,52 +94,51 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 1. 通过 JDBC Catalog 连接 SQLServer 报错:`unable to find valid certification path to requested target` - 请在 `jdbc_url` 中添加 `trustServerCertificate=true` 选项。 + 请在 `jdbc_url` 中添加 `trustServerCertificate=true` 选项。 2. 通过 JDBC Catalog 连接 MySQL 数据库,中文字符乱码,或中文字符条件查询不正确 - 请在 `jdbc_url` 中添加 `useUnicode=true&characterEncoding=utf-8` + 请在 `jdbc_url` 中添加 `useUnicode=true&characterEncoding=utf-8`。 - > 注:1.2.3 版本后,使用 JDBC Catalog 连接 MySQL 数据库,会自动添加这些参数。 + > 注:1.2.3 版本后,使用 JDBC Catalog 连接 MySQL 数据库,会自动添加这些参数。 3. 通过 JDBC Catalog 连接 MySQL 数据库报错:`Establishing SSL connection without server's identity verification is not recommended` - 请在 `jdbc_url` 中添加 `useSSL=true` + 请在 `jdbc_url` 中添加 `useSSL=true`。 -4. 使用 JDBC Catalog 将 MySQL 数据同步到 Doris 中,日期数据同步错误。需要校验下 MySQL 的版本是否与 MySQL 的驱动包是否对应,比如 MySQL8 以上需要使用驱动 com.mysql.cj.jdbc.Driver。 - +4. 使用 JDBC Catalog 将 MySQL 数据同步到 Doris 中,日期数据同步错误。需要校验 MySQL 的版本与 MySQL 的驱动包是否对应,比如 MySQL 8 以上需要使用驱动 `com.mysql.cj.jdbc.Driver`。 5. 单个字段过大,查询时 BE 侧 Java 内存 OOM - Jdbc Scanner 在通过 jdbc 读取时,由 session variable `batch_size` 决定每批次数据在 JVM 中处理的数量,如果单个字段过大,导致 `字段大小 * batch_size`(近似值,由于 JVM 中 static 以及数据 copy 占用) 超过 JVM 内存限制,就会出现 OOM。 + JDBC Scanner 在通过 JDBC 读取时,由 Session Variable `batch_size` 决定每批次数据在 JVM 中处理的数量,如果单个字段过大,导致 `字段大小 * batch_size`(近似值,由于 JVM 中 static 以及数据 copy 占用)超过 JVM 内存限制,就会出现 OOM。 - 解决方法: + 解决方法: - - 减小 `batch_size` 的值,可以通过 `set batch_size = 512;` 来调整,默认值为 4064。 - - 增大 BE 的 JVM 内存,通过修改 `JAVA_OPTS` 参数中的 `-Xmx` 来调整 JVM 最大堆内存大小。例如:`"-Xmx8g`。 + - 减小 `batch_size` 的值,可以通过 `set batch_size = 512;` 来调整,默认值为 4064。 + - 增大 BE 的 JVM 内存,通过修改 `JAVA_OPTS` 参数中的 `-Xmx` 来调整 JVM 最大堆内存大小。例如:`-Xmx8g`。 ## Hive Catalog 1. 通过 Hive Catalog 访问 Iceberg 或 Hive 表报错:`failed to get schema` 或 `Storage schema reading not supported` - 可以尝试以下方法: + 可以尝试以下方法: - * 在 Hive 的 lib/ 目录放上 `iceberg` 运行时有关的 jar 包。 + - 在 Hive 的 `lib/` 目录放上 `iceberg` 运行时有关的 jar 包。 - * 在 `hive-site.xml` 配置: + - 在 `hive-site.xml` 配置: - ``` - metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader - ``` + ``` + metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader + ``` - 配置完成后需要重启 Hive Metastore。 + 配置完成后需要重启 Hive Metastore。 - * 在 Catalog 属性中添加 `"get_schema_from_table" = "true"` + - 在 Catalog 属性中添加 `"get_schema_from_table" = "true"`。 - 该参数自 2.1.10 和 3.0.6 版本支持。 + 该参数自 2.1.10 和 3.0.6 版本支持。 2. 连接 Hive Catalog 报错:`Caused by: java.lang.NullPointerException` - 如 fe.log 中有如下堆栈: + 如 fe.log 中有如下堆栈: ``` Caused by: java.lang.NullPointerException @@ -150,17 +149,17 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_181] ``` - 可以尝试在 `create catalog` 语句中添加 `"metastore.filter.hook" = "org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl"` 解决。 + 可以尝试在 `CREATE CATALOG` 语句中添加 `"metastore.filter.hook" = "org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl"` 解决。 -3. 如果创建 Hive Catalog 后能正常`show tables`,但查询时报`java.net.UnknownHostException: xxxxx` +3. 如果创建 Hive Catalog 后能正常 `show tables`,但查询时报 `java.net.UnknownHostException: xxxxx` - 可以在 CATALOG 的 PROPERTIES 中添加 + 可以在 Catalog 的 PROPERTIES 中添加: ``` 'fs.defaultFS' = 'hdfs://' ``` -4. Hive 1.x 的 orc 格式的表可能会遇到底层 orc 文件 schema 中列名为 `_col0`,`_col1`,`_col2`... 这类系统列名,此时需要在 catalog 配置中添加 `hive.version` 为 1.x.x,这样就会使用 hive 表中的列名进行映射。 +4. Hive 1.x 的 ORC 格式的表可能会遇到底层 ORC 文件 Schema 中列名为 `_col0`、`_col1`、`_col2`... 这类系统列名,此时需要在 Catalog 配置中添加 `hive.version` 为 1.x.x,这样就会使用 Hive 表中的列名进行映射。 ```sql CREATE CATALOG hive PROPERTIES ( @@ -168,7 +167,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- ); ``` -5. 使用 Catalog 查询表数据时发现与 Hive Metastore 相关的报错:`Invalid method name`,需要设置`hive.version`参数。 +5. 使用 Catalog 查询表数据时发现与 Hive Metastore 相关的报错:`Invalid method name`,需要设置 `hive.version` 参数。 ```sql CREATE CATALOG hive PROPERTIES ( @@ -178,15 +177,15 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 6. 查询 ORC 格式的表,FE 报错 `Could not obtain block` 或 `Caused by: java.lang.NoSuchFieldError: types` - 对于 ORC 文件,在默认情况下,FE 会访问 HDFS 获取文件信息,进行文件切分。部分情况下,FE 可能无法访问到 HDFS。可以通过添加以下参数解决: + 对于 ORC 文件,在默认情况下,FE 会访问 HDFS 获取文件信息,进行文件切分。部分情况下,FE 可能无法访问到 HDFS。可以通过添加以下参数解决: - `"hive.exec.orc.split.strategy" = "BI"` + `"hive.exec.orc.split.strategy" = "BI"` - 其他选项:HYBRID(默认),ETL。 + 其他选项:HYBRID(默认)、ETL。 -7. 在 hive 上可以查到 hudi 表分区字段的值,但是在 doris 查不到。 +7. 在 Hive 上可以查到 Hudi 表分区字段的值,但是在 Doris 查不到。 - doris 和 hive 目前查询 hudi 的方式不一样,doris 需要在 hudi 表结构的 avsc 文件里添加上分区字段,如果没加,就会导致 doris 查询 partition_val 为空(即使设置了 hoodie.datasource.hive_sync.partition_fields=partition_val 也不可以) + Doris 和 Hive 目前查询 Hudi 的方式不一样,Doris 需要在 Hudi 表结构的 avsc 文件里添加上分区字段,如果没加,就会导致 Doris 查询 `partition_val` 为空(即使设置了 `hoodie.datasource.hive_sync.partition_fields=partition_val` 也不可以)。 ``` { @@ -215,26 +214,26 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- } ``` -8. 查询 hive 外表,遇到该报错:`java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found` +8. 查询 Hive 外表,遇到该报错:`java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found` - 去 hadoop 环境搜索`hadoop-lzo-*.jar`放在`"${DORIS_HOME}/fe/lib/"`目录下并重启 fe。 + 去 Hadoop 环境搜索 `hadoop-lzo-*.jar` 放在 `"${DORIS_HOME}/fe/lib/"` 目录下并重启 FE。 - 从 2.0.2 版本起,可以将这个文件放置在 FE 的 `custom_lib/` 目录下(如不存在,手动创建即可),以防止升级集群时因为 lib 目录被替换而导致文件丢失。 + 从 2.0.2 版本起,可以将这个文件放置在 FE 的 `custom_lib/` 目录下(如不存在,手动创建即可),以防止升级集群时因为 `lib` 目录被替换而导致文件丢失。 -9. 创建 hive 表指定 serde 为 `org.apache.hadoop.hive.contrib.serde2.MultiDelimitserDe`,访问表时报错:`storage schema reading not supported` +9. 创建 Hive 表指定 serde 为 `org.apache.hadoop.hive.contrib.serde2.MultiDelimitserDe`,访问表时报错:`storage schema reading not supported` - 在 hive-site.xml 文件中增加以下配置,并重启 hms 服务: + 在 `hive-site.xml` 文件中增加以下配置,并重启 HMS 服务: - ``` - - metastore.storage.schema.reader.impl - org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader - - ``` + ```xml + + metastore.storage.schema.reader.impl + org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader + + ``` -10. 报错:java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty +10. 报错:`java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty` - FE日志中完整报错信息如下: + FE 日志中完整报错信息如下: ``` org.apache.doris.common.UserException: errCode = 2, detailMessage = S3 list path failed. path=s3://bucket/part-*,msg=errors while get file status listStatus on s3://bucket: com.amazonaws.SdkClientException: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty @@ -246,7 +245,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- Caused by: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty ``` - 尝试更新FE节点CA证书,使用 `update-ca-trust(CentOS/RockyLinux)`,然后重启FE进程即可。 + 尝试更新 FE 节点 CA 证书,使用 `update-ca-trust`(CentOS/RockyLinux),然后重启 FE 进程即可。 11. BE 报错:`java.lang.InternalError` @@ -267,19 +266,18 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 是因为 Doris 自带的 libz.a 和系统环境中的 libz.so 冲突了。 - 为了解决这个问题,需要先执行 `export LD_LIBRARY_PATH=/path/to/be/lib:$LD_LIBRARY_PATH` 然后重启 BE 进程。 + 为了解决这个问题,需要先执行 `export LD_LIBRARY_PATH=/path/to/be/lib:$LD_LIBRARY_PATH`,然后重启 BE 进程。 -12. 在插入 hive 数据的时候报错:`HiveAccessControlException Permission denied: user [user_a] does not have [UPDATE] privilege on [database/table]`。 +12. 在插入 Hive 数据的时候报错:`HiveAccessControlException Permission denied: user [user_a] does not have [UPDATE] privilege on [database/table]`。 - 因为插入数据之后,需要更新对应的统计信息,这个更新的操作需要 alter 权限,所以要在 ranger 上给该用户新增 alter 权限。 + 因为插入数据之后,需要更新对应的统计信息,这个更新的操作需要 ALTER 权限,所以要在 Ranger 上给该用户新增 ALTER 权限。 13. 在查询 ORC 文件时,如果出现报错类似 `Orc row reader nextBatch failed. reason = Can't open /usr/share/zoneinfo/+08:00` 首先检查当前 `session` 下 `time_zone` 的时区设置是多少,推荐使用类似 `Asia/Shanghai` 的写法。 - 如果 `session` 时区已经是 `Asia/Shanghai`,且查询仍然报错,说明生成 ORC 文件时的时区是 `+08:00`, 导致在读取时解析 `footer` 时需要用到 `+08:00` 时区,可以尝试在 `/usr/share/zoneinfo/` 目录下面软链到相同时区上。 + 如果 `session` 时区已经是 `Asia/Shanghai`,且查询仍然报错,说明生成 ORC 文件时的时区是 `+08:00`,导致在读取时解析 `footer` 时需要用到 `+08:00` 时区,可以尝试在 `/usr/share/zoneinfo/` 目录下面软链到相同时区上。 -<<<<<<< HEAD 14. 查询使用 JSON SerDe(如 `org.openx.data.jsonserde.JsonSerDe`)的 Hive 表时,报错:`failed to get schema` 或 `Storage schema reading not supported` 当 Hive 表使用 JSON 格式存储(ROW FORMAT SERDE 为 `org.openx.data.jsonserde.JsonSerDe`)时,Hive Metastore 可能无法通过默认方式读取表的 Schema 信息,导致 Doris 查询时报错: @@ -301,8 +299,8 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- ``` 该参数自 2.1.10 和 3.0.6 版本支持。 -======= -14. **查询 Hive Catalog 表时,优化阶段极慢并伴随 `nereids cost too much time` 报错,且每次访问 HMS 的耗时稳定在 10 秒左右。** + +15. 查询 Hive Catalog 表时,优化阶段极慢并伴随 `nereids cost too much time` 报错,且每次访问 HMS 的耗时稳定在 10 秒左右。 **问题分析:** 这类问题通常并非 HMS 服务本身的 RPC 执行慢引起,而是由于 **Doris FE 所在机器的 DNS 配置异常** 导致。 @@ -329,22 +327,24 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 1. **修复 DNS 配置(推荐)**:修正 Doris FE 节点上 `/etc/resolv.conf` 中的 `nameserver` 配置,确保域名解析服务正常且快速响应。如果局域网内无需 DNS 且无公网访问需求,可注释掉无效的 nameserver。 2. **配置 Hosts 静态映射**:在 FE 节点的 `/etc/hosts` 中添加 HMS 节点的 IP 与 Hostname 映射。 3. **规范 Catalog 配置**:创建 Catalog 时,`hive.metastore.uris` 参数建议优先使用正确的 Hostname 而不是裸 IP。 ->>>>>>> 0cf7ea7ca780 (docs(faq): add HMS DNS resolution diagnostic and Pulse toolset) -15. **查询 Hive Catalog 表时,偶尔出现查询卡住约 10 秒(默认 socket 超时时间)后才继续执行,或者报错传输异常,但紧接着再次查询又恢复正常。** +16. **查询 Hive Catalog 表时,偶尔出现查询长时间卡住,或者直接报错优化器超时 `nereids cost too much time`,但紧接着再次查询又恢复正常。** **问题描述:** - 这种情况通常发生在 Catalog 空闲一段时间后的首次查询。表现为请求首次发起 HMS RPC 时卡住,等待约 10 秒后(对应默认的 `hive.metastore.client.socket.timeout`)抛出超时或异常,或者在重试后成功。一旦该连接被剔除并重建,后续查询会立即恢复正常。 + + 这种情况通常发生在 Catalog 空闲一段时间后的首次查询。表现为请求在发起 HMS RPC 时卡住,由于 Hive Client 内部存在重试机制,复用到失效的长连接时会等待 Socket Timeout(默认 10 秒),重试可能导致累积等待时间长达 20-30 秒。这会导致查询规划阶段极慢,甚至直接触发 Doris FE 的优化器超时报错 `nereids cost too much time`。一旦该连接被剔除并重建,后续查询会立即恢复正常。 **问题分析:** - Doris 为每个 HMS Catalog 维护了一个 client pool 以复用连接。在复杂的网络环境中(如跨 VPC、经过防火墙或 NAT 网关),中间网络设备往往会对空闲连接设置 `idle timeout`。 - 当连接空闲时间超过阈值时,网络设备会静默丢弃连接状态,且通常不会通知应用两端(不发送 FIN/RST)。Doris 侧仍认为连接可用,下次复用该“僵尸连接”时,由于链路已不可达,必须等待完整的 socket timeout 才能感知到失效。 + + Doris 为每个 HMS Catalog 维护了一个 Client Pool 以复用连接。在复杂的网络环境中(如跨 VPC、经过防火墙或 NAT 网关),中间网络设备往往会对空闲连接设置 `idle timeout`。当连接空闲时间超过阈值时,网络设备会静默丢弃连接状态,且通常不会发送 FIN/RST 包通知两端。Doris 侧仍认为连接可用,下次复用该“僵尸连接”时,由于链路已不可达,必须等待完整的 Socket Timeout 才能感知到失效并触发重试。 **排查建议:** + - 确认 Doris FE 与 HMS 之间是否经过了防火墙、云厂商 NAT 网关或负载均衡器。 - - 使用下文提到的 **Pulse (hms-tools)** 工具。如果探测显示网络连通极快,但长时间放置后首次执行 RPC 出现稳定超时,则基本可判定为长连接被中间设备静默回收。 + - 使用下文提到的 **Pulse (hms-tools)** 工具。如果探测显示网络连通极快,但长时间放置后首次执行 RPC 出现稳定且倍数于 10s 的延迟,则基本可判定为长连接被中间设备静默回收。 **解决方案:** + 利用 Hive Client 原生的生命周期管理能力,在 Catalog 属性中配置 `hive.metastore.client.socket.lifetime`,使其略短于中间网络设备的空闲超时时间(例如设为 300 秒): ```sql @@ -355,29 +355,30 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- "hive.metastore.client.socket.lifetime" = "300s" ); ``` - 配置后, HMS Client 在执行 RPC 前会检查连接年龄。如果已超过 `lifetime`,它会主动重新建立连接,从而规避因复用失效连接导致的 10 秒卡顿。 + + 配置后,HMS Client 在执行 RPC 前会检查连接年龄。如果已超过 `lifetime`,它会主动重新建立连接,从而规避因复用失效连接导致的长时间卡顿或优化器超时。 ## HDFS 1. 访问 HDFS 3.x 时报错:`java.lang.VerifyError: xxx` - 1.2.1 之前的版本中,Doris 依赖的 Hadoop 版本为 2.8。需更新至 2.10.2。或更新 Doris 至 1.2.2 之后的版本。 + 1.2.1 之前的版本中,Doris 依赖的 Hadoop 版本为 2.8,需更新至 2.10.2,或更新 Doris 至 1.2.2 之后的版本。 -2. 使用 Hedged Read 优化 HDFS 读取慢的问题。 +2. 使用 Hedged Read 优化 HDFS 读取慢的问题 在某些情况下,HDFS 的负载较高可能导致读取某个 HDFS 上的数据副本的时间较长,从而拖慢整体的查询效率。HDFS Client 提供了 Hedged Read 功能。 - 该功能可以在一个读请求超过一定阈值未返回时,启动另一个读线程读取同一份数据,哪个先返回就是用哪个结果。 + 该功能可以在一个读请求超过一定阈值未返回时,启动另一个读线程读取同一份数据,哪个先返回就使用哪个结果。 注意:该功能可能会增加 HDFS 集群的负载,请酌情使用。 可以通过以下方式开启这个功能: - ``` - create catalog regression properties ( - 'type'='hms', + ```sql + CREATE CATALOG regression PROPERTIES ( + 'type' = 'hms', 'hive.metastore.uris' = 'thrift://172.21.16.47:7004', 'dfs.client.hedged.read.threadpool.size' = '128', - 'dfs.client.hedged.read.threshold.millis' = "500" + 'dfs.client.hedged.read.threshold.millis' = '500' ); ``` @@ -387,15 +388,15 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 开启后,可以在 Query Profile 中看到相关参数: - `TotalHedgedRead`: 发起 Hedged Read 的次数。 + `TotalHedgedRead`:发起 Hedged Read 的次数。 - `HedgedReadWins`:Hedged Read 成功的次数(发起并且比原请求更快返回的次数) + `HedgedReadWins`:Hedged Read 成功的次数(发起并且比原请求更快返回的次数)。 注意,这里的值是单个 HDFS Client 的累计值,而不是单个查询的数值。同一个 HDFS Client 会被多个查询复用。 3. `Couldn't create proxy provider class org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider` - 在 FE 和 BE 的 start 脚本中,会将环境变量 `HADOOP_CONF_DIR` 加入 CLASSPATH。如果 `HADOOP_CONF_DIR` 设置错误,比如指向了不存在的路径或错误路径,则可能加载到错误的 xxx-site.xml 文件,从而读取到错误的信息。 + 在 FE 和 BE 的启动脚本中,会将环境变量 `HADOOP_CONF_DIR` 加入 CLASSPATH。如果 `HADOOP_CONF_DIR` 设置错误,比如指向了不存在的路径或错误路径,则可能加载到错误的 `xxx-site.xml` 文件,从而读取到错误的信息。 需检查 `HADOOP_CONF_DIR` 是否配置正确,或将这个环境变量删除。 @@ -403,7 +404,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 可能的处理方式有: - 通过 `hdfs fsck file -files -blocks -locations` 来查看具体该文件是否健康。 - - 通过 `telnet` 来检查与 datanode 的连通性。 + - 通过 `telnet` 来检查与 DataNode 的连通性。 在错误日志中可能会打印如下错误: @@ -417,7 +418,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 同时,请检查 `fe/conf` 和 `be/conf` 下放置的 `hdfs-site.xml` 文件中,该参数是否为 true。 - - 查看 datanode 日志。 + - 查看 DataNode 日志。 如果出现以下错误: @@ -425,11 +426,11 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to read expected SASL data transfer protection handshake from client at /XXX.XXX.XXX.XXX:XXXXX. Perhaps the client is running an older version of Hadoop which does not support SASL data transfer protection ``` - 则为当前 hdfs 开启了加密传输方式,而客户端未开启导致的错误。 + 则为当前 HDFS 开启了加密传输方式,而客户端未开启导致的错误。 使用下面的任意一种解决方案即可: - - 拷贝 `hdfs-site.xml` 以及 `core-site.xml` 到 `fe/conf` 和 `be/conf` 目录。(推荐) - - 在 `hdfs-site.xml` 找到相应的配置 `dfs.data.transfer.protection`,并且在 catalog 里面设置该参数。 + - 拷贝 `hdfs-site.xml` 以及 `core-site.xml` 到 `fe/conf` 和 `be/conf` 目录。(推荐) + - 在 `hdfs-site.xml` 找到相应的配置 `dfs.data.transfer.protection`,并且在 Catalog 里面设置该参数。 5. 查询 Hive Catalog 表时报错:`RPC response has a length of xxx exceeds maximum data length` @@ -450,29 +451,29 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- ## DLF Catalog -1. 使用 DLF Catalog 时,BE 读在取 JindoFS 数据出现`Invalid address`,需要在`/ets/hosts`中添加日志中出现的域名到 IP 的映射。 +1. 使用 DLF Catalog 时,BE 读取 JindoFS 数据出现 `Invalid address`,需要在 `/etc/hosts` 中添加日志中出现的域名到 IP 的映射。 -2. 读取数据无权限时,使用`hadoop.username`属性指定有权限的用户。 +2. 读取数据无权限时,使用 `hadoop.username` 属性指定有权限的用户。 -3. DLF Catalog 中的元数据和 DLF 保持一致。当使用 DLF 管理元数据时,Hive 新导入的分区,可能未被 DLF 同步,导致出现 DLF 和 Hive 元数据不一致的情况,对此,需要先保证 Hive 元数据被 DLF 完全同步。 +3. DLF Catalog 中的元数据和 DLF 保持一致。当使用 DLF 管理元数据时,Hive 新导入的分区可能未被 DLF 同步,导致出现 DLF 和 Hive 元数据不一致的情况。对此,需要先保证 Hive 元数据被 DLF 完全同步。 ## 其他问题 1. Binary 类型映射到 Doris 后,查询乱码 - Doris 原生不支持 Binary 类型,所以各类数据湖或数据库中的 Binary 类型映射到 Doris 中,通常使用 String 类型进行映射。String 类型只能展示可打印字符。如果需要查询 Binary 的内容,可以使用 `TO_BASE64()` 函数转换为 Base64 编码后,在进行下一步处理。 + Doris 原生不支持 Binary 类型,所以各类数据湖或数据库中的 Binary 类型映射到 Doris 中,通常使用 String 类型进行映射。String 类型只能展示可打印字符。如果需要查询 Binary 的内容,可以使用 `TO_BASE64()` 函数转换为 Base64 编码后,再进行下一步处理。 2. 分析 Parquet 文件 - 在查询 Parquet 文件时,由于不同系统生成的 Parquet 文件格式可能有所差异,比如 RowGroup 的数量,索引的值等,有时需要检查 Parquet 文件的元数据进行问题定位或性能分析。这里提供一个工具帮助用户更方便的分析 Parquet 文件: + 在查询 Parquet 文件时,由于不同系统生成的 Parquet 文件格式可能有所差异,比如 RowGroup 的数量、索引的值等,有时需要检查 Parquet 文件的元数据进行问题定位或性能分析。这里提供一个工具帮助用户更方便地分析 Parquet 文件: - 1. 下载并解压 [Apache Parquet Cli 1.14.0](https://github.com/morningman/tools/releases/download/apache-parquet-cli-1.14.0/apache-parquet-cli-1.14.0.tar.xz) - 2. 将需要分析的 Parquet 文件下载到本地,假设路径为 `/path/to/file.parquet` + 1. 下载并解压 [Apache Parquet Cli 1.14.0](https://github.com/morningman/tools/releases/download/apache-parquet-cli-1.14.0/apache-parquet-cli-1.14.0.tar.xz)。 + 2. 将需要分析的 Parquet 文件下载到本地,假设路径为 `/path/to/file.parquet`。 3. 使用如下命令分析 Parquet 文件元信息: `./parquet-tools meta /path/to/file.parquet` - 4. 更多功能,可参阅 [Apache Parquet Cli 文档](https://github.com/apache/parquet-java/tree/apache-parquet-1.14.0/parquet-cli) + 4. 更多功能,可参阅 [Apache Parquet Cli 文档](https://github.com/apache/parquet-java/tree/apache-parquet-1.14.0/parquet-cli)。 ## 诊断工具 @@ -492,7 +493,7 @@ Pulse 主要包含以下工具集: - 支持测试 KDC 可达性、检查 Keytab 文件以及执行登录测试,确保认证层不会阻断连接。 3. **对象存储诊断工具 (`s3-tools`, `gcs-tools`, `azure-blob-cpp`)**: - - 针对主流云存储(AWS S3, Google GCS, Azure Blob)的诊断工具。 + - 针对主流云存储(AWS S3、Google GCS、Azure Blob)的诊断工具。 - 用于排查“权限拒绝(Access Denied)”或“存储桶不存在(Bucket Not Found)”等常见的外部表数据访问问题。 - 支持验证凭据来源、STS 身份以及执行 Bucket 级别的操作测试。 diff --git a/versioned_docs/version-3.x/faq/lakehouse-faq.md b/versioned_docs/version-3.x/faq/lakehouse-faq.md index 145f352765bd2..23295d5e54b21 100644 --- a/versioned_docs/version-3.x/faq/lakehouse-faq.md +++ b/versioned_docs/version-3.x/faq/lakehouse-faq.md @@ -2,34 +2,34 @@ { "title": "Data Lakehouse FAQ", "language": "en", - "description": "This is usually due to incorrect Kerberos authentication information. You can troubleshoot by following these steps:" + "description": "Apache Doris Data Lakehouse FAQ and troubleshooting guide, covering certificate issues, Kerberos authentication, JDBC Catalog, Hive Catalog, HDFS, DLF Catalog, and diagnostic tools." } --- ## Certificate Issues 1. When querying, an error `curl 77: Problem with the SSL CA cert.` occurs. This indicates that the current system certificate is too old and needs to be updated locally. - - You can download the latest CA certificate from `https://curl.se/docs/caextract.html`. - - Place the downloaded `cacert-xxx.pem` into the `/etc/ssl/certs/` directory, for example: `sudo cp cacert-xxx.pem /etc/ssl/certs/ca-certificates.crt`. + - You can download the latest CA certificate from `https://curl.se/docs/caextract.html`. + - Place the downloaded `cacert-xxx.pem` into the `/etc/ssl/certs/` directory, for example: `sudo cp cacert-xxx.pem /etc/ssl/certs/ca-certificates.crt`. 2. When querying, an error occurs: `ERROR 1105 (HY000): errCode = 2, detailMessage = (x.x.x.x)[CANCELLED][INTERNAL_ERROR]error setting certificate verify locations: CAfile: /etc/ssl/certs/ca-certificates.crt CApath: none`. -``` -yum install -y ca-certificates -ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca-certificates.crt -``` + ``` + yum install -y ca-certificates + ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca-certificates.crt + ``` ## Kerberos 1. When connecting to a Hive Metastore authenticated with Kerberos, an error `GSS initiate failed` is encountered. - This is usually due to incorrect Kerberos authentication information. You can troubleshoot by following these steps: + This is usually due to incorrect Kerberos authentication information. You can troubleshoot by following these steps: 1. In versions prior to 1.2.1, the libhdfs3 library that Doris depends on did not enable gsasl. Please update to versions 1.2.2 and later. 2. Ensure that correct keytab and principal are set for each component and verify that the keytab file exists on all FE and BE nodes. - - `hadoop.kerberos.keytab`/`hadoop.kerberos.principal`: Used for Hadoop hdfs access, fill in the corresponding values for hdfs. - - `hive.metastore.kerberos.principal`: Used for hive metastore. + 1. `hadoop.kerberos.keytab`/`hadoop.kerberos.principal`: Used for Hadoop HDFS access, fill in the corresponding values for HDFS. + 2. `hive.metastore.kerberos.principal`: Used for Hive Metastore. 3. Try replacing the IP in the principal with a domain name (do not use the default `_HOST` placeholder). 4. Ensure that the `/etc/krb5.conf` file exists on all FE and BE nodes. @@ -37,54 +37,54 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 2. When connecting to a Hive database through the Hive Catalog, an error occurs: `RemoteException: SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]`. If the error occurs during the query when there are no issues with `show databases` and `show tables`, follow these two steps: - - Place core-site.xml and hdfs-site.xml in the fe/conf and be/conf directories. - - Execute Kerberos kinit on the BE node, restart BE, and then proceed with the query. - - When encountering the error `GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos Ticket)` while querying a table configured with Kerberos, restarting FE and BE nodes usually resolves the issue. - + - Place `core-site.xml` and `hdfs-site.xml` in the `fe/conf` and `be/conf` directories. + - Execute Kerberos `kinit` on the BE node, restart BE, and then proceed with the query. + +3. When encountering the error `GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos Ticket)` while querying a table configured with Kerberos, restarting FE and BE nodes usually resolves the issue. + - Before restarting all nodes, configure `-Djavax.security.auth.useSubjectCredsOnly=false` in the JAVA_OPTS parameter in `"${DORIS_HOME}/be/conf/be.conf"` to obtain JAAS credentials information through the underlying mechanism rather than the application. - Refer to [JAAS Troubleshooting](https://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/tutorials/Troubleshooting.html) for solutions to common JAAS errors. - To resolve the error `Unable to obtain password from user` when configuring Kerberos in the Catalog: - +4. To resolve the error `Unable to obtain password from user` when configuring Kerberos in the Catalog: + - Ensure the principal used is listed in klist by checking with `klist -kt your.keytab`. - - Verify the catalog configuration for any missing settings such as `yarn.resourcemanager.principal`. + - Verify the Catalog configuration for any missing settings such as `yarn.resourcemanager.principal`. - If the above checks are fine, it may be due to the JDK version installed by the system's package manager not supporting certain encryption algorithms. Consider installing JDK manually and setting the `JAVA_HOME` environment variable. - Kerberos typically uses AES-256 for encryption. For Oracle JDK, JCE must be installed. Some distributions of OpenJDK automatically provide unlimited strength JCE, eliminating the need for separate installation. - JCE versions correspond to JDK versions; download the appropriate JCE zip package and extract it to the `$JAVA_HOME/jre/lib/security` directory based on the JDK version: - - JDK6: [JCE6](http://www.oracle.com/technetwork/java/javase/downloads/jce-6-download-429243.html) - - JDK7: [JCE7](http://www.oracle.com/technetwork/java/embedded/embedded-se/downloads/jce-7-download-432124.html) - - JDK8: [JCE8](http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html) - - When encountering the error `java.security.InvalidKeyException: Illegal key size` while accessing HDFS with KMS, upgrade the JDK version to >= Java 8 u162 or install the corresponding JCE Unlimited Strength Jurisdiction Policy Files. - - If configuring Kerberos in the Catalog results in the error `SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]`, place the `core-site.xml` file in the `"${DORIS_HOME}/be/conf"` directory. - + - JDK6: [JCE6](http://www.oracle.com/technetwork/java/javase/downloads/jce-6-download-429243.html) + - JDK7: [JCE7](http://www.oracle.com/technetwork/java/embedded/embedded-se/downloads/jce-7-download-432124.html) + - JDK8: [JCE8](http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html) + +5. When encountering the error `java.security.InvalidKeyException: Illegal key size` while accessing HDFS with KMS, upgrade the JDK version to >= Java 8 u162, or install the corresponding JCE Unlimited Strength Jurisdiction Policy Files. + +6. If configuring Kerberos in the Catalog results in the error `SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]`, place the `core-site.xml` file in the `"${DORIS_HOME}/be/conf"` directory. + If accessing HDFS results in the error `No common protection layer between client and server`, ensure that the `hadoop.rpc.protection` properties on the client and server are consistent. - - ``` + + ```xml - + - + hadoop.security.authentication kerberos - + ``` - - When using Broker Load with Kerberos configured and encountering the error `Cannot locate default realm.`: - + +7. When using Broker Load with Kerberos configured and encountering the error `Cannot locate default realm.`: + Add the configuration item `-Djava.security.krb5.conf=/your-path` to the `JAVA_OPTS` in the `start_broker.sh` script for Broker Load. -3. When using Kerberos configuration in the Catalog, the `hadoop.username` property cannot be used simultaneously. +8. When using Kerberos configuration in the Catalog, the `hadoop.username` property cannot be used simultaneously. -4. Accessing Kerberos with JDK 17 +9. Accessing Kerberos with JDK 17 - When running Doris with JDK 17 and accessing Kerberos services, you may encounter issues accessing due to the use of deprecated encryption algorithms. You need to add the `allow_weak_crypto=true` property in krb5.conf or upgrade the encryption algorithm in Kerberos. + When running Doris with JDK 17 and accessing Kerberos services, you may encounter issues due to the use of deprecated encryption algorithms. You need to add the `allow_weak_crypto=true` property in `krb5.conf`, or upgrade the encryption algorithm in Kerberos. For more details, refer to: @@ -92,52 +92,52 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 1. Error connecting to SQLServer via JDBC Catalog: `unable to find valid certification path to requested target` - Add the `trustServerCertificate=true` option in the `jdbc_url`. + Add the `trustServerCertificate=true` option in the `jdbc_url`. 2. Connecting to MySQL database via JDBC Catalog results in Chinese character garbling or incorrect Chinese character query conditions - Add `useUnicode=true&characterEncoding=utf-8` in the `jdbc_url`. + Add `useUnicode=true&characterEncoding=utf-8` in the `jdbc_url`. - > Note: Starting from version 1.2.3, when connecting to MySQL database via JDBC Catalog, these parameters will be automatically added. + > Note: Starting from version 1.2.3, when connecting to MySQL database via JDBC Catalog, these parameters will be automatically added. 3. Error connecting to MySQL database via JDBC Catalog: `Establishing SSL connection without server's identity verification is not recommended` - Add `useSSL=true` in the `jdbc_url`. + Add `useSSL=true` in the `jdbc_url`. -4. When synchronizing MySQL data to Doris using JDBC Catalog, date data synchronization error occurs. Verify if the MySQL version matches the MySQL driver package, for example, MySQL 8 and above require the driver com.mysql.cj.jdbc.Driver. +4. When synchronizing MySQL data to Doris using JDBC Catalog, date data synchronization error occurs. Verify if the MySQL version matches the MySQL driver package, for example, MySQL 8 and above require the driver `com.mysql.cj.jdbc.Driver`. 5. When a single field is too large, a Java memory OOM occurs on the BE side during a query. - When Jdbc Scanner reads data through JDBC, the session variable `batch_size` determines the number of rows processed in the JVM per batch. If a single field is too large, it may cause `field_size * batch_size` (approximate value, considering JVM static memory and data copy overhead) to exceed the JVM memory limit, resulting in OOM. + When JDBC Scanner reads data through JDBC, the Session Variable `batch_size` determines the number of rows processed in the JVM per batch. If a single field is too large, it may cause `field_size * batch_size` (approximate value, considering JVM static memory and data copy overhead) to exceed the JVM memory limit, resulting in OOM. - Solutions: + Solutions: - - Reduce the `batch_size` value by executing `set batch_size = 512;`. The default value is 4064. - - Increase the BE JVM memory by modifying the `-Xmx` parameter in `JAVA_OPTS`. For example: `-Xmx8g`. + - Reduce the `batch_size` value by executing `set batch_size = 512;`. The default value is 4064. + - Increase the BE JVM memory by modifying the `-Xmx` parameter in `JAVA_OPTS`. For example: `-Xmx8g`. ## Hive Catalog 1. Accessing Iceberg or Hive table through Hive Catalog reports an error: `failed to get schema` or `Storage schema reading not supported` You can try the following methods: - - * Put the `iceberg` runtime-related jar package in the lib/ directory of Hive. - - * Configure in `hive-site.xml`: - + + - Put the `iceberg` runtime-related jar package in the `lib/` directory of Hive. + + - Configure in `hive-site.xml`: + ``` metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader ``` - + After the configuration is completed, you need to restart the Hive Metastore. - - * Add `"get_schema_from_table" = "true"` in the Catalog properties - + + - Add `"get_schema_from_table" = "true"` in the Catalog properties. + This parameter is supported since versions 2.1.10 and 3.0.6. 2. Error connecting to Hive Catalog: `Caused by: java.lang.NullPointerException` - If the fe.log contains the following stack trace: + If the fe.log contains the following stack trace: ``` Caused by: java.lang.NullPointerException @@ -148,17 +148,17 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_181] ``` - Try adding `"metastore.filter.hook" = "org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl"` in the `create catalog` statement to resolve. + Try adding `"metastore.filter.hook" = "org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl"` in the `CREATE CATALOG` statement to resolve. 3. If after creating Hive Catalog, `show tables` works fine but querying results in `java.net.UnknownHostException: xxxxx` - Add the following in the CATALOG's PROPERTIES: + Add the following in the Catalog's PROPERTIES: ``` 'fs.defaultFS' = 'hdfs://' ``` -4. Tables in orc format in Hive 1.x may encounter system column names in the underlying orc file schema as `_col0`, `_col1`, `_col2`, etc. In this case, add `hive.version` as 1.x.x in the catalog configuration to map with the column names in the hive table. +4. Tables in ORC format in Hive 1.x may encounter system column names in the underlying ORC file Schema as `_col0`, `_col1`, `_col2`, etc. In this case, add `hive.version` as 1.x.x in the Catalog configuration to map with the column names in the Hive table. ```sql CREATE CATALOG hive PROPERTIES ( @@ -199,15 +199,15 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- } ``` -8. When querying a Hive external table, if you encounter the error `java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found`, search for `hadoop-lzo-*.jar` in the Hadoop environment, place it in the `"${DORIS_HOME}/fe/lib/"` directory, and restart the FE. Starting from version 2.0.2, you can place this file in the `custom_lib/` directory of the FE (if it does not exist, create it manually) to prevent file loss when upgrading the cluster due to the lib directory being replaced. +8. When querying a Hive external table, if you encounter the error `java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found`, search for `hadoop-lzo-*.jar` in the Hadoop environment, place it in the `"${DORIS_HOME}/fe/lib/"` directory, and restart the FE. Starting from version 2.0.2, you can place this file in the `custom_lib/` directory of the FE (if it does not exist, create it manually) to prevent file loss when upgrading the cluster due to the `lib` directory being replaced. -9. When creating a Hive table specifying the serde as `org.apache.hadoop.hive.contrib.serde2.MultiDelimitserDe`, and encountering the error `storage schema reading not supported` when accessing the table, add the following configuration to the hive-site.xml file and restart the HMS service: +9. When creating a Hive table specifying the serde as `org.apache.hadoop.hive.contrib.serde2.MultiDelimitserDe`, and encountering the error `storage schema reading not supported` when accessing the table, add the following configuration to the `hive-site.xml` file and restart the HMS service: - ``` + ```xml - metastore.storage.schema.reader.impl - org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader - + metastore.storage.schema.reader.impl + org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader + ``` 10. Error: `java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty`. The complete error message in the FE log is as follows: @@ -222,7 +222,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- Caused by: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty ``` - Try updating the CA certificate on the FE node using `update-ca-trust (CentOS/RockyLinux)`, and then restart the FE process. + Try updating the CA certificate on the FE node using `update-ca-trust` (CentOS/RockyLinux), and then restart the FE process. 11. BE error: `java.lang.InternalError`. If you see an error similar to the following in `be.INFO`: @@ -243,7 +243,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 12. When inserting data into Hive, an error occurred as `HiveAccessControlException Permission denied: user [user_a] does not have [UPDATE] privilege on [database/table]`. - Since after inserting the data, the corresponding statistical information needs to be updated, and this update operation requires the alter privilege. Therefore, the alter privilege needs to be added for this user on Ranger. + Since after inserting the data, the corresponding statistical information needs to be updated, and this update operation requires the ALTER privilege. Therefore, the ALTER privilege needs to be added for this user on Ranger. 13. When querying ORC files, if an error like `Orc row reader nextBatch failed. reason = Can't open /usr/share/zoneinfo/+08:00` @@ -253,18 +253,21 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- If the session timezone is already set to `Asia/Shanghai` but the query still fails, it indicates that the ORC file was generated with the timezone `+08:00`. During query execution, this timezone is required when parsing the ORC footer. In this case, you can try creating a symbolic link under the `/usr/share/zoneinfo/` directory that points `+08:00` to an equivalent timezone. -14. **When querying Hive Catalog tables, query planning is extremely slow, the `nereids cost too much time` error occurs, and each HMS access takes a consistently long time (e.g., around 10 seconds).** +15. **When querying Hive Catalog tables, query planning is extremely slow, the `nereids cost too much time` error occurs, and each HMS access takes a consistently long time (e.g., around 10 seconds).** **Root Cause Analysis:** + This issue is usually not caused by slow execution of the HMS RPC itself. Instead, the most common root cause is **incorrect DNS configuration on the Doris FE node**. During the initialization phase of the Hive Metastore Client, hostname resolution is triggered. If the configured DNS server is unreachable or unresponsive, it causes a DNS resolution timeout (typically 10 seconds) every time a new HMS client connection is established, which severely slows down metadata fetching. **Typical Symptoms:** + - **Normal Network Connectivity:** The HMS port is reachable, but metadata access in Doris remains extremely slow. - **Consistent Delay:** The delay consistently hits a fixed timeout threshold (e.g., 10 seconds). - **Workarounds Fail:** Simply increasing the HMS client timeout parameter in the Catalog properties only masks the error but does not eliminate the fixed 10-second delay on each connection. **Troubleshooting Steps:** + Run the following commands on the Doris FE node to verify the DNS and hostname resolution: ```bash @@ -281,8 +284,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 2. **Configure Static Hosts Mapping:** Add the IP and Hostname mapping of the HMS nodes to `/etc/hosts` on the FE node. 3. **Standardize Catalog Properties:** When creating the Catalog, it is highly recommended to use a resolvable hostname instead of a bare IP address for the `hive.metastore.uris` property. -<<<<<<< HEAD -14. When querying a Hive table that uses JSON SerDe (e.g., `org.openx.data.jsonserde.JsonSerDe`), an error occurs: `failed to get schema` or `Storage schema reading not supported` +16. When querying a Hive table that uses JSON SerDe (e.g., `org.openx.data.jsonserde.JsonSerDe`), an error occurs: `failed to get schema` or `Storage schema reading not supported` When a Hive table uses JSON format storage (ROW FORMAT SERDE is `org.openx.data.jsonserde.JsonSerDe`), the Hive Metastore may not be able to read the table's schema information through the default method, causing the following error when querying from Doris: @@ -303,20 +305,24 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- ``` This parameter is supported since versions 2.1.10 and 3.0.6. -======= -15. **Queries on Hive Catalog tables occasionally experience extremely long hangs or directly report the optimizer timeout error `nereids cost too much time`, but subsequent queries work fine immediately after.** + +17. **Queries on Hive Catalog tables occasionally experience extremely long hangs or directly report the optimizer timeout error `nereids cost too much time`, but subsequent queries work fine immediately after.** **Problem Description:** + This usually happens after the Catalog has been idle for a while. When an HMS RPC is initiated, if a stale connection from the pool is reused, the request will hang for the duration of the Socket Timeout (default 10s). Due to the Hive Client's internal retry mechanism, this can result in cumulative waits of 20-30 seconds if multiple retries occur. This causes the query planning phase to be extremely slow, often triggering the Doris FE optimizer timeout error `nereids cost too much time`. Once the connection is purged and rebuilt, performance returns to normal. **Root Cause Analysis:** - Doris maintains a Client Pool for each HMS Catalog to reuse connections. In complex network environments (e.g., across VPCs, through firewalls, or NAT gateways), idle TCP connections are "silently" reclaimed by network devices after an `idle timeout`. Since these devices typically do not send FIN/RST packets to notify the endpoints, Doris still believes the connection is valid. Reusing such a "zombie connection" requires waiting for a full Socket Timeout before the failure is detected and a retry is triggered. + + Doris maintains a Client Pool for each HMS Catalog to reuse connections. In complex network environments (e.g., across VPCs, through firewalls, or NAT gateways), idle TCP connections are often "silently" reclaimed by network devices after an `idle timeout`. Since these devices typically do not send FIN/RST packets to notify the endpoints, Doris still believes the connection is valid. Reusing such a "zombie connection" requires waiting for a full Socket Timeout before the failure is detected and a retry is triggered. **Troubleshooting Steps:** + - Verify if there are firewalls, NAT gateways, or Load Balancers between Doris FE and HMS. - Use the **Pulse (hms-tools)** diagnostic tool. If the tool shows fast network connectivity but stable delays that are multiples of 10s when executing RPCs after a long idle period, it confirms that idle connections are being silently reclaimed. **Solution:** + Configure the connection lifetime in your Catalog properties to be slightly shorter than the network device's idle timeout. We recommend using Hive's native socket lifetime property: ```sql @@ -327,11 +333,12 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- "hive.metastore.client.socket.lifetime" = "300s" ); ``` + When set, the HMS Client will check the connection age before sending an RPC. If it exceeds the `lifetime`, it proactively reconnects, avoiding long hangs and optimizer timeouts caused by stale connections. ->>>>>>> 30861289d66a (docs(faq): add HMS connection pool idle timeout and optimizer timeout issues) ## HDFS -1. When accessing HDFS 3.x, if you encounter the error `java.lang.VerifyError: xxx`, in versions prior to 1.2.1, Doris depends on Hadoop version 2.8. You need to update to 2.10.2 or upgrade Doris to versions after 1.2.2. + +1. When accessing HDFS 3.x, if you encounter the error `java.lang.VerifyError: xxx`, in versions prior to 1.2.1, Doris depends on Hadoop version 2.8. You need to update to 2.10.2, or upgrade Doris to versions after 1.2.2. 2. Using Hedged Read to optimize slow HDFS reads. In some cases, high load on HDFS may lead to longer read times for data replicas on a specific HDFS, thereby slowing down overall query efficiency. The HDFS Client provides the Hedged Read feature. This feature initiates another read thread to read the same data if a read request exceeds a certain threshold without returning, and the result returned first is used. @@ -339,12 +346,12 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- You can enable this feature by: - ``` - create catalog regression properties ( - 'type'='hms', + ```sql + CREATE CATALOG regression PROPERTIES ( + 'type' = 'hms', 'hive.metastore.uris' = 'thrift://172.21.16.47:7004', 'dfs.client.hedged.read.threadpool.size' = '128', - 'dfs.client.hedged.read.threshold.millis' = "500" + 'dfs.client.hedged.read.threshold.millis' = '500' ); ``` @@ -356,13 +363,13 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- `TotalHedgedRead`: Number of times Hedged Read was initiated. - `HedgedReadWins`: Number of successful Hedged Reads (times when the request was initiated and returned faster than the original request) + `HedgedReadWins`: Number of successful Hedged Reads (times when the request was initiated and returned faster than the original request). Note that these values are cumulative for a single HDFS Client, not for a single query. The same HDFS Client can be reused by multiple queries. 3. `Couldn't create proxy provider class org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider` - In the start scripts of FE and BE, the environment variable `HADOOP_CONF_DIR` is added to the CLASSPATH. If `HADOOP_CONF_DIR` is set incorrectly, such as pointing to a non-existent or incorrect path, it may load the wrong xxx-site.xml file, resulting in reading incorrect information. + In the startup scripts of FE and BE, the environment variable `HADOOP_CONF_DIR` is added to the CLASSPATH. If `HADOOP_CONF_DIR` is set incorrectly, such as pointing to a non-existent or incorrect path, it may load the wrong `xxx-site.xml` file, resulting in reading incorrect information. Check if `HADOOP_CONF_DIR` is configured correctly or remove this environment variable. @@ -370,7 +377,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- Possible solutions include: - Use `hdfs fsck file -files -blocks -locations` to check if the file is healthy. - - Check connectivity with datanodes using `telnet`. + - Check connectivity with DataNodes using `telnet`. The following error may be printed in the error log: @@ -384,7 +391,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- At the same time, please check whether the parameter is true in the `hdfs-site.xml` file placed under `fe/conf` and `be/conf`. - - Check datanode logs. + - Check DataNode logs. If you encounter the following error: @@ -392,11 +399,11 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to read expected SASL data transfer protection handshake from client at /XXX.XXX.XXX.XXX:XXXXX. Perhaps the client is running an older version of Hadoop which does not support SASL data transfer protection ``` - it means that the current hdfs has enabled encrypted transmission, but the client has not, causing the error. + it means that the current HDFS has enabled encrypted transmission, but the client has not, causing the error. Use any of the following solutions: - Copy `hdfs-site.xml` and `core-site.xml` to `fe/conf` and `be/conf`. (Recommended) - - In `hdfs-site.xml`, find the corresponding configuration `dfs.data.transfer.protection` and set this parameter in the catalog. + - In `hdfs-site.xml`, find the corresponding configuration `dfs.data.transfer.protection` and set this parameter in the Catalog. 5. When querying a Hive Catalog table, an error occurs: `RPC response has a length of xxx exceeds maximum data length` @@ -433,13 +440,13 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- When querying Parquet files, due to potential differences in the format of Parquet files generated by different systems, such as the number of RowGroups, index values, etc., sometimes it is necessary to check the metadata of Parquet files for issue identification or performance analysis. Here is a tool provided to help users analyze Parquet files more conveniently: - 1. Download and unzip [Apache Parquet Cli 1.14.0](https://github.com/morningman/tools/releases/download/apache-parquet-cli-1.14.0/apache-parquet-cli-1.14.0.tar.xz) - 2. Download the Parquet file to be analyzed to your local machine, assuming the path is `/path/to/file.parquet` + 1. Download and unzip [Apache Parquet Cli 1.14.0](https://github.com/morningman/tools/releases/download/apache-parquet-cli-1.14.0/apache-parquet-cli-1.14.0.tar.xz). + 2. Download the Parquet file to be analyzed to your local machine, assuming the path is `/path/to/file.parquet`. 3. Use the following command to analyze the metadata of the Parquet file: `./parquet-tools meta /path/to/file.parquet` - 4. For more functionalities, refer to [Apache Parquet Cli documentation](https://github.com/apache/parquet-java/tree/apache-parquet-1.14.0/parquet-cli) + 4. For more functionalities, refer to [Apache Parquet Cli documentation](https://github.com/apache/parquet-java/tree/apache-parquet-1.14.0/parquet-cli). ## Diagnostic Tools diff --git a/versioned_docs/version-4.x/faq/lakehouse-faq.md b/versioned_docs/version-4.x/faq/lakehouse-faq.md index d18e17cb2cbda..23295d5e54b21 100644 --- a/versioned_docs/version-4.x/faq/lakehouse-faq.md +++ b/versioned_docs/version-4.x/faq/lakehouse-faq.md @@ -2,34 +2,34 @@ { "title": "Data Lakehouse FAQ", "language": "en", - "description": "This is usually due to incorrect Kerberos authentication information. You can troubleshoot by following these steps:" + "description": "Apache Doris Data Lakehouse FAQ and troubleshooting guide, covering certificate issues, Kerberos authentication, JDBC Catalog, Hive Catalog, HDFS, DLF Catalog, and diagnostic tools." } --- ## Certificate Issues 1. When querying, an error `curl 77: Problem with the SSL CA cert.` occurs. This indicates that the current system certificate is too old and needs to be updated locally. - - You can download the latest CA certificate from `https://curl.se/docs/caextract.html`. - - Place the downloaded `cacert-xxx.pem` into the `/etc/ssl/certs/` directory, for example: `sudo cp cacert-xxx.pem /etc/ssl/certs/ca-certificates.crt`. + - You can download the latest CA certificate from `https://curl.se/docs/caextract.html`. + - Place the downloaded `cacert-xxx.pem` into the `/etc/ssl/certs/` directory, for example: `sudo cp cacert-xxx.pem /etc/ssl/certs/ca-certificates.crt`. 2. When querying, an error occurs: `ERROR 1105 (HY000): errCode = 2, detailMessage = (x.x.x.x)[CANCELLED][INTERNAL_ERROR]error setting certificate verify locations: CAfile: /etc/ssl/certs/ca-certificates.crt CApath: none`. -``` -yum install -y ca-certificates -ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca-certificates.crt -``` + ``` + yum install -y ca-certificates + ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca-certificates.crt + ``` ## Kerberos 1. When connecting to a Hive Metastore authenticated with Kerberos, an error `GSS initiate failed` is encountered. - This is usually due to incorrect Kerberos authentication information. You can troubleshoot by following these steps: + This is usually due to incorrect Kerberos authentication information. You can troubleshoot by following these steps: 1. In versions prior to 1.2.1, the libhdfs3 library that Doris depends on did not enable gsasl. Please update to versions 1.2.2 and later. 2. Ensure that correct keytab and principal are set for each component and verify that the keytab file exists on all FE and BE nodes. - - `hadoop.kerberos.keytab`/`hadoop.kerberos.principal`: Used for Hadoop hdfs access, fill in the corresponding values for hdfs. - - `hive.metastore.kerberos.principal`: Used for hive metastore. + 1. `hadoop.kerberos.keytab`/`hadoop.kerberos.principal`: Used for Hadoop HDFS access, fill in the corresponding values for HDFS. + 2. `hive.metastore.kerberos.principal`: Used for Hive Metastore. 3. Try replacing the IP in the principal with a domain name (do not use the default `_HOST` placeholder). 4. Ensure that the `/etc/krb5.conf` file exists on all FE and BE nodes. @@ -37,54 +37,54 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 2. When connecting to a Hive database through the Hive Catalog, an error occurs: `RemoteException: SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]`. If the error occurs during the query when there are no issues with `show databases` and `show tables`, follow these two steps: - - Place core-site.xml and hdfs-site.xml in the fe/conf and be/conf directories. - - Execute Kerberos kinit on the BE node, restart BE, and then proceed with the query. - - When encountering the error `GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos Ticket)` while querying a table configured with Kerberos, restarting FE and BE nodes usually resolves the issue. - + - Place `core-site.xml` and `hdfs-site.xml` in the `fe/conf` and `be/conf` directories. + - Execute Kerberos `kinit` on the BE node, restart BE, and then proceed with the query. + +3. When encountering the error `GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos Ticket)` while querying a table configured with Kerberos, restarting FE and BE nodes usually resolves the issue. + - Before restarting all nodes, configure `-Djavax.security.auth.useSubjectCredsOnly=false` in the JAVA_OPTS parameter in `"${DORIS_HOME}/be/conf/be.conf"` to obtain JAAS credentials information through the underlying mechanism rather than the application. - Refer to [JAAS Troubleshooting](https://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/tutorials/Troubleshooting.html) for solutions to common JAAS errors. - To resolve the error `Unable to obtain password from user` when configuring Kerberos in the Catalog: - +4. To resolve the error `Unable to obtain password from user` when configuring Kerberos in the Catalog: + - Ensure the principal used is listed in klist by checking with `klist -kt your.keytab`. - - Verify the catalog configuration for any missing settings such as `yarn.resourcemanager.principal`. + - Verify the Catalog configuration for any missing settings such as `yarn.resourcemanager.principal`. - If the above checks are fine, it may be due to the JDK version installed by the system's package manager not supporting certain encryption algorithms. Consider installing JDK manually and setting the `JAVA_HOME` environment variable. - Kerberos typically uses AES-256 for encryption. For Oracle JDK, JCE must be installed. Some distributions of OpenJDK automatically provide unlimited strength JCE, eliminating the need for separate installation. - JCE versions correspond to JDK versions; download the appropriate JCE zip package and extract it to the `$JAVA_HOME/jre/lib/security` directory based on the JDK version: - - JDK6: [JCE6](http://www.oracle.com/technetwork/java/javase/downloads/jce-6-download-429243.html) - - JDK7: [JCE7](http://www.oracle.com/technetwork/java/embedded/embedded-se/downloads/jce-7-download-432124.html) - - JDK8: [JCE8](http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html) - - When encountering the error `java.security.InvalidKeyException: Illegal key size` while accessing HDFS with KMS, upgrade the JDK version to >= Java 8 u162 or install the corresponding JCE Unlimited Strength Jurisdiction Policy Files. - - If configuring Kerberos in the Catalog results in the error `SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]`, place the `core-site.xml` file in the `"${DORIS_HOME}/be/conf"` directory. - + - JDK6: [JCE6](http://www.oracle.com/technetwork/java/javase/downloads/jce-6-download-429243.html) + - JDK7: [JCE7](http://www.oracle.com/technetwork/java/embedded/embedded-se/downloads/jce-7-download-432124.html) + - JDK8: [JCE8](http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html) + +5. When encountering the error `java.security.InvalidKeyException: Illegal key size` while accessing HDFS with KMS, upgrade the JDK version to >= Java 8 u162, or install the corresponding JCE Unlimited Strength Jurisdiction Policy Files. + +6. If configuring Kerberos in the Catalog results in the error `SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]`, place the `core-site.xml` file in the `"${DORIS_HOME}/be/conf"` directory. + If accessing HDFS results in the error `No common protection layer between client and server`, ensure that the `hadoop.rpc.protection` properties on the client and server are consistent. - - ``` + + ```xml - + - + hadoop.security.authentication kerberos - + ``` - - When using Broker Load with Kerberos configured and encountering the error `Cannot locate default realm.`: - + +7. When using Broker Load with Kerberos configured and encountering the error `Cannot locate default realm.`: + Add the configuration item `-Djava.security.krb5.conf=/your-path` to the `JAVA_OPTS` in the `start_broker.sh` script for Broker Load. -3. When using Kerberos configuration in the Catalog, the `hadoop.username` property cannot be used simultaneously. +8. When using Kerberos configuration in the Catalog, the `hadoop.username` property cannot be used simultaneously. -4. Accessing Kerberos with JDK 17 +9. Accessing Kerberos with JDK 17 - When running Doris with JDK 17 and accessing Kerberos services, you may encounter issues accessing due to the use of deprecated encryption algorithms. You need to add the `allow_weak_crypto=true` property in krb5.conf or upgrade the encryption algorithm in Kerberos. + When running Doris with JDK 17 and accessing Kerberos services, you may encounter issues due to the use of deprecated encryption algorithms. You need to add the `allow_weak_crypto=true` property in `krb5.conf`, or upgrade the encryption algorithm in Kerberos. For more details, refer to: @@ -92,52 +92,52 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 1. Error connecting to SQLServer via JDBC Catalog: `unable to find valid certification path to requested target` - Add the `trustServerCertificate=true` option in the `jdbc_url`. + Add the `trustServerCertificate=true` option in the `jdbc_url`. 2. Connecting to MySQL database via JDBC Catalog results in Chinese character garbling or incorrect Chinese character query conditions - Add `useUnicode=true&characterEncoding=utf-8` in the `jdbc_url`. + Add `useUnicode=true&characterEncoding=utf-8` in the `jdbc_url`. - > Note: Starting from version 1.2.3, when connecting to MySQL database via JDBC Catalog, these parameters will be automatically added. + > Note: Starting from version 1.2.3, when connecting to MySQL database via JDBC Catalog, these parameters will be automatically added. 3. Error connecting to MySQL database via JDBC Catalog: `Establishing SSL connection without server's identity verification is not recommended` - Add `useSSL=true` in the `jdbc_url`. + Add `useSSL=true` in the `jdbc_url`. -4. When synchronizing MySQL data to Doris using JDBC Catalog, date data synchronization error occurs. Verify if the MySQL version matches the MySQL driver package, for example, MySQL 8 and above require the driver com.mysql.cj.jdbc.Driver. +4. When synchronizing MySQL data to Doris using JDBC Catalog, date data synchronization error occurs. Verify if the MySQL version matches the MySQL driver package, for example, MySQL 8 and above require the driver `com.mysql.cj.jdbc.Driver`. 5. When a single field is too large, a Java memory OOM occurs on the BE side during a query. - When Jdbc Scanner reads data through JDBC, the session variable `batch_size` determines the number of rows processed in the JVM per batch. If a single field is too large, it may cause `field_size * batch_size` (approximate value, considering JVM static memory and data copy overhead) to exceed the JVM memory limit, resulting in OOM. + When JDBC Scanner reads data through JDBC, the Session Variable `batch_size` determines the number of rows processed in the JVM per batch. If a single field is too large, it may cause `field_size * batch_size` (approximate value, considering JVM static memory and data copy overhead) to exceed the JVM memory limit, resulting in OOM. - Solutions: + Solutions: - - Reduce the `batch_size` value by executing `set batch_size = 512;`. The default value is 4064. - - Increase the BE JVM memory by modifying the `-Xmx` parameter in `JAVA_OPTS`. For example: `-Xmx8g`. + - Reduce the `batch_size` value by executing `set batch_size = 512;`. The default value is 4064. + - Increase the BE JVM memory by modifying the `-Xmx` parameter in `JAVA_OPTS`. For example: `-Xmx8g`. ## Hive Catalog 1. Accessing Iceberg or Hive table through Hive Catalog reports an error: `failed to get schema` or `Storage schema reading not supported` You can try the following methods: - - * Put the `iceberg` runtime-related jar package in the lib/ directory of Hive. - - * Configure in `hive-site.xml`: - + + - Put the `iceberg` runtime-related jar package in the `lib/` directory of Hive. + + - Configure in `hive-site.xml`: + ``` metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader ``` - + After the configuration is completed, you need to restart the Hive Metastore. - - * Add `"get_schema_from_table" = "true"` in the Catalog properties - + + - Add `"get_schema_from_table" = "true"` in the Catalog properties. + This parameter is supported since versions 2.1.10 and 3.0.6. 2. Error connecting to Hive Catalog: `Caused by: java.lang.NullPointerException` - If the fe.log contains the following stack trace: + If the fe.log contains the following stack trace: ``` Caused by: java.lang.NullPointerException @@ -148,17 +148,17 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_181] ``` - Try adding `"metastore.filter.hook" = "org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl"` in the `create catalog` statement to resolve. + Try adding `"metastore.filter.hook" = "org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl"` in the `CREATE CATALOG` statement to resolve. 3. If after creating Hive Catalog, `show tables` works fine but querying results in `java.net.UnknownHostException: xxxxx` - Add the following in the CATALOG's PROPERTIES: + Add the following in the Catalog's PROPERTIES: ``` 'fs.defaultFS' = 'hdfs://' ``` -4. Tables in orc format in Hive 1.x may encounter system column names in the underlying orc file schema as `_col0`, `_col1`, `_col2`, etc. In this case, add `hive.version` as 1.x.x in the catalog configuration to map with the column names in the hive table. +4. Tables in ORC format in Hive 1.x may encounter system column names in the underlying ORC file Schema as `_col0`, `_col1`, `_col2`, etc. In this case, add `hive.version` as 1.x.x in the Catalog configuration to map with the column names in the Hive table. ```sql CREATE CATALOG hive PROPERTIES ( @@ -199,15 +199,15 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- } ``` -8. When querying a Hive external table, if you encounter the error `java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found`, search for `hadoop-lzo-*.jar` in the Hadoop environment, place it in the `"${DORIS_HOME}/fe/lib/"` directory, and restart the FE. Starting from version 2.0.2, you can place this file in the `custom_lib/` directory of the FE (if it does not exist, create it manually) to prevent file loss when upgrading the cluster due to the lib directory being replaced. +8. When querying a Hive external table, if you encounter the error `java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found`, search for `hadoop-lzo-*.jar` in the Hadoop environment, place it in the `"${DORIS_HOME}/fe/lib/"` directory, and restart the FE. Starting from version 2.0.2, you can place this file in the `custom_lib/` directory of the FE (if it does not exist, create it manually) to prevent file loss when upgrading the cluster due to the `lib` directory being replaced. -9. When creating a Hive table specifying the serde as `org.apache.hadoop.hive.contrib.serde2.MultiDelimitserDe`, and encountering the error `storage schema reading not supported` when accessing the table, add the following configuration to the hive-site.xml file and restart the HMS service: +9. When creating a Hive table specifying the serde as `org.apache.hadoop.hive.contrib.serde2.MultiDelimitserDe`, and encountering the error `storage schema reading not supported` when accessing the table, add the following configuration to the `hive-site.xml` file and restart the HMS service: - ``` + ```xml - metastore.storage.schema.reader.impl - org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader - + metastore.storage.schema.reader.impl + org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader + ``` 10. Error: `java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty`. The complete error message in the FE log is as follows: @@ -222,7 +222,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- Caused by: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty ``` - Try updating the CA certificate on the FE node using `update-ca-trust (CentOS/RockyLinux)`, and then restart the FE process. + Try updating the CA certificate on the FE node using `update-ca-trust` (CentOS/RockyLinux), and then restart the FE process. 11. BE error: `java.lang.InternalError`. If you see an error similar to the following in `be.INFO`: @@ -243,7 +243,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 12. When inserting data into Hive, an error occurred as `HiveAccessControlException Permission denied: user [user_a] does not have [UPDATE] privilege on [database/table]`. - Since after inserting the data, the corresponding statistical information needs to be updated, and this update operation requires the alter privilege. Therefore, the alter privilege needs to be added for this user on Ranger. + Since after inserting the data, the corresponding statistical information needs to be updated, and this update operation requires the ALTER privilege. Therefore, the ALTER privilege needs to be added for this user on Ranger. 13. When querying ORC files, if an error like `Orc row reader nextBatch failed. reason = Can't open /usr/share/zoneinfo/+08:00` @@ -253,18 +253,21 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- If the session timezone is already set to `Asia/Shanghai` but the query still fails, it indicates that the ORC file was generated with the timezone `+08:00`. During query execution, this timezone is required when parsing the ORC footer. In this case, you can try creating a symbolic link under the `/usr/share/zoneinfo/` directory that points `+08:00` to an equivalent timezone. -14. **When querying Hive Catalog tables, query planning is extremely slow, the `nereids cost too much time` error occurs, and each HMS access takes a consistently long time (e.g., around 10 seconds).** +15. **When querying Hive Catalog tables, query planning is extremely slow, the `nereids cost too much time` error occurs, and each HMS access takes a consistently long time (e.g., around 10 seconds).** **Root Cause Analysis:** + This issue is usually not caused by slow execution of the HMS RPC itself. Instead, the most common root cause is **incorrect DNS configuration on the Doris FE node**. During the initialization phase of the Hive Metastore Client, hostname resolution is triggered. If the configured DNS server is unreachable or unresponsive, it causes a DNS resolution timeout (typically 10 seconds) every time a new HMS client connection is established, which severely slows down metadata fetching. **Typical Symptoms:** + - **Normal Network Connectivity:** The HMS port is reachable, but metadata access in Doris remains extremely slow. - **Consistent Delay:** The delay consistently hits a fixed timeout threshold (e.g., 10 seconds). - **Workarounds Fail:** Simply increasing the HMS client timeout parameter in the Catalog properties only masks the error but does not eliminate the fixed 10-second delay on each connection. **Troubleshooting Steps:** + Run the following commands on the Doris FE node to verify the DNS and hostname resolution: ```bash @@ -281,8 +284,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- 2. **Configure Static Hosts Mapping:** Add the IP and Hostname mapping of the HMS nodes to `/etc/hosts` on the FE node. 3. **Standardize Catalog Properties:** When creating the Catalog, it is highly recommended to use a resolvable hostname instead of a bare IP address for the `hive.metastore.uris` property. -<<<<<<< HEAD -14. When querying a Hive table that uses JSON SerDe (e.g., `org.openx.data.jsonserde.JsonSerDe`), an error occurs: `failed to get schema` or `Storage schema reading not supported` +16. When querying a Hive table that uses JSON SerDe (e.g., `org.openx.data.jsonserde.JsonSerDe`), an error occurs: `failed to get schema` or `Storage schema reading not supported` When a Hive table uses JSON format storage (ROW FORMAT SERDE is `org.openx.data.jsonserde.JsonSerDe`), the Hive Metastore may not be able to read the table's schema information through the default method, causing the following error when querying from Doris: @@ -303,20 +305,24 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- ``` This parameter is supported since versions 2.1.10 and 3.0.6. -======= -15. **Queries on Hive Catalog tables occasionally experience extremely long hangs or directly report the optimizer timeout error `nereids cost too much time`, but subsequent queries work fine immediately after.** + +17. **Queries on Hive Catalog tables occasionally experience extremely long hangs or directly report the optimizer timeout error `nereids cost too much time`, but subsequent queries work fine immediately after.** **Problem Description:** + This usually happens after the Catalog has been idle for a while. When an HMS RPC is initiated, if a stale connection from the pool is reused, the request will hang for the duration of the Socket Timeout (default 10s). Due to the Hive Client's internal retry mechanism, this can result in cumulative waits of 20-30 seconds if multiple retries occur. This causes the query planning phase to be extremely slow, often triggering the Doris FE optimizer timeout error `nereids cost too much time`. Once the connection is purged and rebuilt, performance returns to normal. **Root Cause Analysis:** - Doris maintains a Client Pool for each HMS Catalog to reuse connections. In complex network environments (e.g., across VPCs, through firewalls, or NAT gateways), idle TCP connections are "silently" reclaimed by network devices after an `idle timeout`. Since these devices typically do not send FIN/RST packets to notify the endpoints, Doris still believes the connection is valid. Reusing such a "zombie connection" requires waiting for a full Socket Timeout before the failure is detected and a retry is triggered. + + Doris maintains a Client Pool for each HMS Catalog to reuse connections. In complex network environments (e.g., across VPCs, through firewalls, or NAT gateways), idle TCP connections are often "silently" reclaimed by network devices after an `idle timeout`. Since these devices typically do not send FIN/RST packets to notify the endpoints, Doris still believes the connection is valid. Reusing such a "zombie connection" requires waiting for a full Socket Timeout before the failure is detected and a retry is triggered. **Troubleshooting Steps:** + - Verify if there are firewalls, NAT gateways, or Load Balancers between Doris FE and HMS. - Use the **Pulse (hms-tools)** diagnostic tool. If the tool shows fast network connectivity but stable delays that are multiples of 10s when executing RPCs after a long idle period, it confirms that idle connections are being silently reclaimed. **Solution:** + Configure the connection lifetime in your Catalog properties to be slightly shorter than the network device's idle timeout. We recommend using Hive's native socket lifetime property: ```sql @@ -327,12 +333,12 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- "hive.metastore.client.socket.lifetime" = "300s" ); ``` + When set, the HMS Client will check the connection age before sending an RPC. If it exceeds the `lifetime`, it proactively reconnects, avoiding long hangs and optimizer timeouts caused by stale connections. ->>>>>>> 30861289d66a (docs(faq): add HMS connection pool idle timeout and optimizer timeout issues) ## HDFS -1. When accessing HDFS 3.x, if you encounter the error `java.lang.VerifyError: xxx`, in versions prior to 1.2.1, Doris depends on Hadoop version 2.8. You need to update to 2.10.2 or upgrade Doris to versions after 1.2.2. +1. When accessing HDFS 3.x, if you encounter the error `java.lang.VerifyError: xxx`, in versions prior to 1.2.1, Doris depends on Hadoop version 2.8. You need to update to 2.10.2, or upgrade Doris to versions after 1.2.2. 2. Using Hedged Read to optimize slow HDFS reads. In some cases, high load on HDFS may lead to longer read times for data replicas on a specific HDFS, thereby slowing down overall query efficiency. The HDFS Client provides the Hedged Read feature. This feature initiates another read thread to read the same data if a read request exceeds a certain threshold without returning, and the result returned first is used. @@ -340,12 +346,12 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- You can enable this feature by: - ``` - create catalog regression properties ( - 'type'='hms', + ```sql + CREATE CATALOG regression PROPERTIES ( + 'type' = 'hms', 'hive.metastore.uris' = 'thrift://172.21.16.47:7004', 'dfs.client.hedged.read.threadpool.size' = '128', - 'dfs.client.hedged.read.threshold.millis' = "500" + 'dfs.client.hedged.read.threshold.millis' = '500' ); ``` @@ -357,13 +363,13 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- `TotalHedgedRead`: Number of times Hedged Read was initiated. - `HedgedReadWins`: Number of successful Hedged Reads (times when the request was initiated and returned faster than the original request) + `HedgedReadWins`: Number of successful Hedged Reads (times when the request was initiated and returned faster than the original request). Note that these values are cumulative for a single HDFS Client, not for a single query. The same HDFS Client can be reused by multiple queries. 3. `Couldn't create proxy provider class org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider` - In the start scripts of FE and BE, the environment variable `HADOOP_CONF_DIR` is added to the CLASSPATH. If `HADOOP_CONF_DIR` is set incorrectly, such as pointing to a non-existent or incorrect path, it may load the wrong xxx-site.xml file, resulting in reading incorrect information. + In the startup scripts of FE and BE, the environment variable `HADOOP_CONF_DIR` is added to the CLASSPATH. If `HADOOP_CONF_DIR` is set incorrectly, such as pointing to a non-existent or incorrect path, it may load the wrong `xxx-site.xml` file, resulting in reading incorrect information. Check if `HADOOP_CONF_DIR` is configured correctly or remove this environment variable. @@ -371,7 +377,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- Possible solutions include: - Use `hdfs fsck file -files -blocks -locations` to check if the file is healthy. - - Check connectivity with datanodes using `telnet`. + - Check connectivity with DataNodes using `telnet`. The following error may be printed in the error log: @@ -385,7 +391,7 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- At the same time, please check whether the parameter is true in the `hdfs-site.xml` file placed under `fe/conf` and `be/conf`. - - Check datanode logs. + - Check DataNode logs. If you encounter the following error: @@ -393,11 +399,11 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to read expected SASL data transfer protection handshake from client at /XXX.XXX.XXX.XXX:XXXXX. Perhaps the client is running an older version of Hadoop which does not support SASL data transfer protection ``` - it means that the current hdfs has enabled encrypted transmission, but the client has not, causing the error. + it means that the current HDFS has enabled encrypted transmission, but the client has not, causing the error. Use any of the following solutions: - Copy `hdfs-site.xml` and `core-site.xml` to `fe/conf` and `be/conf`. (Recommended) - - In `hdfs-site.xml`, find the corresponding configuration `dfs.data.transfer.protection` and set this parameter in the catalog. + - In `hdfs-site.xml`, find the corresponding configuration `dfs.data.transfer.protection` and set this parameter in the Catalog. 5. When querying a Hive Catalog table, an error occurs: `RPC response has a length of xxx exceeds maximum data length` @@ -434,13 +440,13 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- When querying Parquet files, due to potential differences in the format of Parquet files generated by different systems, such as the number of RowGroups, index values, etc., sometimes it is necessary to check the metadata of Parquet files for issue identification or performance analysis. Here is a tool provided to help users analyze Parquet files more conveniently: - 1. Download and unzip [Apache Parquet Cli 1.14.0](https://github.com/morningman/tools/releases/download/apache-parquet-cli-1.14.0/apache-parquet-cli-1.14.0.tar.xz) - 2. Download the Parquet file to be analyzed to your local machine, assuming the path is `/path/to/file.parquet` + 1. Download and unzip [Apache Parquet Cli 1.14.0](https://github.com/morningman/tools/releases/download/apache-parquet-cli-1.14.0/apache-parquet-cli-1.14.0.tar.xz). + 2. Download the Parquet file to be analyzed to your local machine, assuming the path is `/path/to/file.parquet`. 3. Use the following command to analyze the metadata of the Parquet file: `./parquet-tools meta /path/to/file.parquet` - 4. For more functionalities, refer to [Apache Parquet Cli documentation](https://github.com/apache/parquet-java/tree/apache-parquet-1.14.0/parquet-cli) + 4. For more functionalities, refer to [Apache Parquet Cli documentation](https://github.com/apache/parquet-java/tree/apache-parquet-1.14.0/parquet-cli). ## Diagnostic Tools From 4362a70b981c97fd893135089f53dfb38255db07 Mon Sep 17 00:00:00 2001 From: "Mingyu Chen (Rayner)" Date: Mon, 30 Mar 2026 14:26:57 -0700 Subject: [PATCH 4/4] opt --- docs/faq/lakehouse-faq.md | 48 +++++++++---------- .../current/faq/lakehouse-faq.md | 2 +- .../version-3.x/faq/lakehouse-faq.md | 2 +- .../version-4.x/faq/lakehouse-faq.md | 2 +- .../version-3.x/faq/lakehouse-faq.md | 48 +++++++++---------- .../version-4.x/faq/lakehouse-faq.md | 48 +++++++++---------- 6 files changed, 75 insertions(+), 75 deletions(-) diff --git a/docs/faq/lakehouse-faq.md b/docs/faq/lakehouse-faq.md index 23295d5e54b21..e87329d64c9f2 100644 --- a/docs/faq/lakehouse-faq.md +++ b/docs/faq/lakehouse-faq.md @@ -253,7 +253,29 @@ If the session timezone is already set to `Asia/Shanghai` but the query still fails, it indicates that the ORC file was generated with the timezone `+08:00`. During query execution, this timezone is required when parsing the ORC footer. In this case, you can try creating a symbolic link under the `/usr/share/zoneinfo/` directory that points `+08:00` to an equivalent timezone. -15. **When querying Hive Catalog tables, query planning is extremely slow, the `nereids cost too much time` error occurs, and each HMS access takes a consistently long time (e.g., around 10 seconds).** +14. When querying a Hive table that uses JSON SerDe (e.g., `org.openx.data.jsonserde.JsonSerDe`), an error occurs: `failed to get schema` or `Storage schema reading not supported` + + When a Hive table uses JSON format storage (ROW FORMAT SERDE is `org.openx.data.jsonserde.JsonSerDe`), the Hive Metastore may not be able to read the table's schema information through the default method, causing the following error when querying from Doris: + + ``` + errCode = 2, detailMessage = failed to get schema for table xxx in db xxx. + reason: org.apache.hadoop.hive.metastore.api.MetaException: + java.lang.UnsupportedOperationException: Storage schema reading not supported + ``` + + This can be resolved by adding `"get_schema_from_table" = "true"` in the Catalog properties. This parameter instructs Doris to retrieve the schema directly from the Hive table metadata instead of relying on the underlying storage's Schema Reader. + + ```sql + CREATE CATALOG hive PROPERTIES ( + 'type' = 'hms', + 'hive.metastore.uris' = 'thrift://x.x.x.x:9083', + 'get_schema_from_table' = 'true' + ); + ``` + + This parameter is supported since versions 2.1.10 and 3.0.6. + +15. When querying Hive Catalog tables, query planning is extremely slow, the `nereids cost too much time` error occurs, and each HMS access takes a consistently long time (e.g., around 10 seconds). **Root Cause Analysis:** @@ -284,29 +306,7 @@ 2. **Configure Static Hosts Mapping:** Add the IP and Hostname mapping of the HMS nodes to `/etc/hosts` on the FE node. 3. **Standardize Catalog Properties:** When creating the Catalog, it is highly recommended to use a resolvable hostname instead of a bare IP address for the `hive.metastore.uris` property. -16. When querying a Hive table that uses JSON SerDe (e.g., `org.openx.data.jsonserde.JsonSerDe`), an error occurs: `failed to get schema` or `Storage schema reading not supported` - - When a Hive table uses JSON format storage (ROW FORMAT SERDE is `org.openx.data.jsonserde.JsonSerDe`), the Hive Metastore may not be able to read the table's schema information through the default method, causing the following error when querying from Doris: - - ``` - errCode = 2, detailMessage = failed to get schema for table xxx in db xxx. - reason: org.apache.hadoop.hive.metastore.api.MetaException: - java.lang.UnsupportedOperationException: Storage schema reading not supported - ``` - - This can be resolved by adding `"get_schema_from_table" = "true"` in the Catalog properties. This parameter instructs Doris to retrieve the schema directly from the Hive table metadata instead of relying on the underlying storage's Schema Reader. - - ```sql - CREATE CATALOG hive PROPERTIES ( - 'type' = 'hms', - 'hive.metastore.uris' = 'thrift://x.x.x.x:9083', - 'get_schema_from_table' = 'true' - ); - ``` - - This parameter is supported since versions 2.1.10 and 3.0.6. - -17. **Queries on Hive Catalog tables occasionally experience extremely long hangs or directly report the optimizer timeout error `nereids cost too much time`, but subsequent queries work fine immediately after.** +16. Queries on Hive Catalog tables occasionally experience extremely long hangs or directly report the optimizer timeout error `nereids cost too much time`, but subsequent queries work fine immediately after. **Problem Description:** diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/faq/lakehouse-faq.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/faq/lakehouse-faq.md index c4a94de62e060..869b974744860 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/faq/lakehouse-faq.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/faq/lakehouse-faq.md @@ -328,7 +328,7 @@ 2. **配置 Hosts 静态映射**:在 FE 节点的 `/etc/hosts` 中添加 HMS 节点的 IP 与 Hostname 映射。 3. **规范 Catalog 配置**:创建 Catalog 时,`hive.metastore.uris` 参数建议优先使用正确的 Hostname 而不是裸 IP。 -16. **查询 Hive Catalog 表时,偶尔出现查询长时间卡住,或者直接报错优化器超时 `nereids cost too much time`,但紧接着再次查询又恢复正常。** +16. 查询 Hive Catalog 表时,偶尔出现查询长时间卡住,或者直接报错优化器超时 `nereids cost too much time`,但紧接着再次查询又恢复正常。 **问题描述:** diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/faq/lakehouse-faq.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/faq/lakehouse-faq.md index c4a94de62e060..869b974744860 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/faq/lakehouse-faq.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/faq/lakehouse-faq.md @@ -328,7 +328,7 @@ 2. **配置 Hosts 静态映射**:在 FE 节点的 `/etc/hosts` 中添加 HMS 节点的 IP 与 Hostname 映射。 3. **规范 Catalog 配置**:创建 Catalog 时,`hive.metastore.uris` 参数建议优先使用正确的 Hostname 而不是裸 IP。 -16. **查询 Hive Catalog 表时,偶尔出现查询长时间卡住,或者直接报错优化器超时 `nereids cost too much time`,但紧接着再次查询又恢复正常。** +16. 查询 Hive Catalog 表时,偶尔出现查询长时间卡住,或者直接报错优化器超时 `nereids cost too much time`,但紧接着再次查询又恢复正常。 **问题描述:** diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/faq/lakehouse-faq.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/faq/lakehouse-faq.md index c4a94de62e060..869b974744860 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/faq/lakehouse-faq.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/faq/lakehouse-faq.md @@ -328,7 +328,7 @@ 2. **配置 Hosts 静态映射**:在 FE 节点的 `/etc/hosts` 中添加 HMS 节点的 IP 与 Hostname 映射。 3. **规范 Catalog 配置**:创建 Catalog 时,`hive.metastore.uris` 参数建议优先使用正确的 Hostname 而不是裸 IP。 -16. **查询 Hive Catalog 表时,偶尔出现查询长时间卡住,或者直接报错优化器超时 `nereids cost too much time`,但紧接着再次查询又恢复正常。** +16. 查询 Hive Catalog 表时,偶尔出现查询长时间卡住,或者直接报错优化器超时 `nereids cost too much time`,但紧接着再次查询又恢复正常。 **问题描述:** diff --git a/versioned_docs/version-3.x/faq/lakehouse-faq.md b/versioned_docs/version-3.x/faq/lakehouse-faq.md index 23295d5e54b21..e87329d64c9f2 100644 --- a/versioned_docs/version-3.x/faq/lakehouse-faq.md +++ b/versioned_docs/version-3.x/faq/lakehouse-faq.md @@ -253,7 +253,29 @@ If the session timezone is already set to `Asia/Shanghai` but the query still fails, it indicates that the ORC file was generated with the timezone `+08:00`. During query execution, this timezone is required when parsing the ORC footer. In this case, you can try creating a symbolic link under the `/usr/share/zoneinfo/` directory that points `+08:00` to an equivalent timezone. -15. **When querying Hive Catalog tables, query planning is extremely slow, the `nereids cost too much time` error occurs, and each HMS access takes a consistently long time (e.g., around 10 seconds).** +14. When querying a Hive table that uses JSON SerDe (e.g., `org.openx.data.jsonserde.JsonSerDe`), an error occurs: `failed to get schema` or `Storage schema reading not supported` + + When a Hive table uses JSON format storage (ROW FORMAT SERDE is `org.openx.data.jsonserde.JsonSerDe`), the Hive Metastore may not be able to read the table's schema information through the default method, causing the following error when querying from Doris: + + ``` + errCode = 2, detailMessage = failed to get schema for table xxx in db xxx. + reason: org.apache.hadoop.hive.metastore.api.MetaException: + java.lang.UnsupportedOperationException: Storage schema reading not supported + ``` + + This can be resolved by adding `"get_schema_from_table" = "true"` in the Catalog properties. This parameter instructs Doris to retrieve the schema directly from the Hive table metadata instead of relying on the underlying storage's Schema Reader. + + ```sql + CREATE CATALOG hive PROPERTIES ( + 'type' = 'hms', + 'hive.metastore.uris' = 'thrift://x.x.x.x:9083', + 'get_schema_from_table' = 'true' + ); + ``` + + This parameter is supported since versions 2.1.10 and 3.0.6. + +15. When querying Hive Catalog tables, query planning is extremely slow, the `nereids cost too much time` error occurs, and each HMS access takes a consistently long time (e.g., around 10 seconds). **Root Cause Analysis:** @@ -284,29 +306,7 @@ 2. **Configure Static Hosts Mapping:** Add the IP and Hostname mapping of the HMS nodes to `/etc/hosts` on the FE node. 3. **Standardize Catalog Properties:** When creating the Catalog, it is highly recommended to use a resolvable hostname instead of a bare IP address for the `hive.metastore.uris` property. -16. When querying a Hive table that uses JSON SerDe (e.g., `org.openx.data.jsonserde.JsonSerDe`), an error occurs: `failed to get schema` or `Storage schema reading not supported` - - When a Hive table uses JSON format storage (ROW FORMAT SERDE is `org.openx.data.jsonserde.JsonSerDe`), the Hive Metastore may not be able to read the table's schema information through the default method, causing the following error when querying from Doris: - - ``` - errCode = 2, detailMessage = failed to get schema for table xxx in db xxx. - reason: org.apache.hadoop.hive.metastore.api.MetaException: - java.lang.UnsupportedOperationException: Storage schema reading not supported - ``` - - This can be resolved by adding `"get_schema_from_table" = "true"` in the Catalog properties. This parameter instructs Doris to retrieve the schema directly from the Hive table metadata instead of relying on the underlying storage's Schema Reader. - - ```sql - CREATE CATALOG hive PROPERTIES ( - 'type' = 'hms', - 'hive.metastore.uris' = 'thrift://x.x.x.x:9083', - 'get_schema_from_table' = 'true' - ); - ``` - - This parameter is supported since versions 2.1.10 and 3.0.6. - -17. **Queries on Hive Catalog tables occasionally experience extremely long hangs or directly report the optimizer timeout error `nereids cost too much time`, but subsequent queries work fine immediately after.** +16. Queries on Hive Catalog tables occasionally experience extremely long hangs or directly report the optimizer timeout error `nereids cost too much time`, but subsequent queries work fine immediately after. **Problem Description:** diff --git a/versioned_docs/version-4.x/faq/lakehouse-faq.md b/versioned_docs/version-4.x/faq/lakehouse-faq.md index 23295d5e54b21..e87329d64c9f2 100644 --- a/versioned_docs/version-4.x/faq/lakehouse-faq.md +++ b/versioned_docs/version-4.x/faq/lakehouse-faq.md @@ -253,7 +253,29 @@ If the session timezone is already set to `Asia/Shanghai` but the query still fails, it indicates that the ORC file was generated with the timezone `+08:00`. During query execution, this timezone is required when parsing the ORC footer. In this case, you can try creating a symbolic link under the `/usr/share/zoneinfo/` directory that points `+08:00` to an equivalent timezone. -15. **When querying Hive Catalog tables, query planning is extremely slow, the `nereids cost too much time` error occurs, and each HMS access takes a consistently long time (e.g., around 10 seconds).** +14. When querying a Hive table that uses JSON SerDe (e.g., `org.openx.data.jsonserde.JsonSerDe`), an error occurs: `failed to get schema` or `Storage schema reading not supported` + + When a Hive table uses JSON format storage (ROW FORMAT SERDE is `org.openx.data.jsonserde.JsonSerDe`), the Hive Metastore may not be able to read the table's schema information through the default method, causing the following error when querying from Doris: + + ``` + errCode = 2, detailMessage = failed to get schema for table xxx in db xxx. + reason: org.apache.hadoop.hive.metastore.api.MetaException: + java.lang.UnsupportedOperationException: Storage schema reading not supported + ``` + + This can be resolved by adding `"get_schema_from_table" = "true"` in the Catalog properties. This parameter instructs Doris to retrieve the schema directly from the Hive table metadata instead of relying on the underlying storage's Schema Reader. + + ```sql + CREATE CATALOG hive PROPERTIES ( + 'type' = 'hms', + 'hive.metastore.uris' = 'thrift://x.x.x.x:9083', + 'get_schema_from_table' = 'true' + ); + ``` + + This parameter is supported since versions 2.1.10 and 3.0.6. + +15. When querying Hive Catalog tables, query planning is extremely slow, the `nereids cost too much time` error occurs, and each HMS access takes a consistently long time (e.g., around 10 seconds). **Root Cause Analysis:** @@ -284,29 +306,7 @@ 2. **Configure Static Hosts Mapping:** Add the IP and Hostname mapping of the HMS nodes to `/etc/hosts` on the FE node. 3. **Standardize Catalog Properties:** When creating the Catalog, it is highly recommended to use a resolvable hostname instead of a bare IP address for the `hive.metastore.uris` property. -16. When querying a Hive table that uses JSON SerDe (e.g., `org.openx.data.jsonserde.JsonSerDe`), an error occurs: `failed to get schema` or `Storage schema reading not supported` - - When a Hive table uses JSON format storage (ROW FORMAT SERDE is `org.openx.data.jsonserde.JsonSerDe`), the Hive Metastore may not be able to read the table's schema information through the default method, causing the following error when querying from Doris: - - ``` - errCode = 2, detailMessage = failed to get schema for table xxx in db xxx. - reason: org.apache.hadoop.hive.metastore.api.MetaException: - java.lang.UnsupportedOperationException: Storage schema reading not supported - ``` - - This can be resolved by adding `"get_schema_from_table" = "true"` in the Catalog properties. This parameter instructs Doris to retrieve the schema directly from the Hive table metadata instead of relying on the underlying storage's Schema Reader. - - ```sql - CREATE CATALOG hive PROPERTIES ( - 'type' = 'hms', - 'hive.metastore.uris' = 'thrift://x.x.x.x:9083', - 'get_schema_from_table' = 'true' - ); - ``` - - This parameter is supported since versions 2.1.10 and 3.0.6. - -17. **Queries on Hive Catalog tables occasionally experience extremely long hangs or directly report the optimizer timeout error `nereids cost too much time`, but subsequent queries work fine immediately after.** +16. Queries on Hive Catalog tables occasionally experience extremely long hangs or directly report the optimizer timeout error `nereids cost too much time`, but subsequent queries work fine immediately after. **Problem Description:**