HDDS-14103. Create an option in SCM to ack/ignore missing containers#9719
HDDS-14103. Create an option in SCM to ack/ignore missing containers#9719sarvekshayr wants to merge 2 commits intoapache:masterfrom
Conversation
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerInfo.java
Outdated
Show resolved
Hide resolved
|
I am not sure about this idea. Surely, if the container is missing and all efforts have been made to ensure there are no copies that can be recovered, the correct thing to do is to remove the container from the system? |
|
@sodonnel I agree that safely removing would be the best long term solution. However implementing that robustly is more complicated. Even if all the keys are deleted from OM, SCM won't have any DNs to send the block delete requests to, and those DNs cannot tell SCM that their replicas are empty and safe to be deleted. We therefore need a check for orphan containers in between SCM and OM that handles the cleanup. I don't think we want to allow admins to manually remove containers from the system based on their own investigation. |
priyeshkaratha
left a comment
There was a problem hiding this comment.
Thanks @sarvekshayr for the patch. I left few comments related to admin check and other good to go changes. Please have a look into those.
...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMClientProtocolServer.java
Show resolved
Hide resolved
...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMClientProtocolServer.java
Show resolved
Hide resolved
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerInfo.java
Outdated
Show resolved
Hide resolved
| public void execute(ScmClient scmClient) throws IOException { | ||
| if (list) { | ||
| // List acknowledged containers | ||
| ContainerListResult result = scmClient.listContainer(1, Integer.MAX_VALUE); |
There was a problem hiding this comment.
Fetching all containers and filtering them on the client side can be inefficient if the cluster has a large number of containers. Consider adding a server-side filter to listContainer to fetch only the containers with ackMissing=true. This would require changes to the StorageContainerLocationProtocol.
If it is difficult, I am ok with the current changes since it is used by CLI tool only.
There was a problem hiding this comment.
Do we plan to display the list of acknowledged missing containers in the Recon UI?
If yes, I can add the required server-side filter. If not, since this isn’t a frequently used command, we can retain the current implementation.
cc: @errose28 @devmadhuu
What changes were proposed in this pull request?
Ozone currently has no way to clear missing containers from the system. Even if all the data is deleted from the OM, the block deletes will never leave SCM because it has no replicas to send them to.
As a short term mitigation, we added a CLI to SCM that supports “acking“ missing containers by ID if the admin confirms they are not a problem, so they do not mask future issues. This would remove them from
ozone admin container reportoutput and the missing container count metric. This would need to be persisted in theContainerInfoin SCM, and we show this property inozone admin container info. There is also a CLI to raise containers as an issue again and to query the list of acked missing containers.What is the link to the Apache JIRA
HDDS-14103
How was this patch tested?
Container report shows 1 MISSING container.
Acknowledge the container as MISSING as it is not an issue.
Container report removes # 1 as MISSING.
Unacknowledge the container as MISSING as it is problematic.
Container report adds # 1 as MISSING again.