Skip to content

[#2695] feat(doc): Add docs for fileset catalog#2781

Merged
jerryshao merged 6 commits intoapache:mainfrom
jerryshao:issue-2695
Apr 3, 2024
Merged

[#2695] feat(doc): Add docs for fileset catalog#2781
jerryshao merged 6 commits intoapache:mainfrom
jerryshao:issue-2695

Conversation

@jerryshao
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR proposes to add docs for fileset catalog.

Why are the changes needed?

Fix: #2695

Does this PR introduce any user-facing change?

No.

How was this patch tested?

No.

@jerryshao jerryshao self-assigned this Apr 2, 2024
@jerryshao
Copy link
Copy Markdown
Contributor Author

@coolderli @xloya would you please help to review? Thanks.

Comment thread docs/hadoop-catalog.md Outdated
Comment thread docs/hadoop-catalog.md Outdated
@coolderli
Copy link
Copy Markdown
Contributor

@jerryshao Do we need to introduce how to use the Fileset in the Spark engine? In addition, I have already tested the Tensorflow and submitted an MR: tensorflow/io#1970. After https://github.com/datastrato/gravitino/pull/2779 is resolved, we can support tensorflow. I think we can add a doc like https://help.aliyun.com/zh/hdfs/using-tensorflow-on?spm=a2c4g.11186623.0.i6. What do you think?

@jerryshao
Copy link
Copy Markdown
Contributor Author

@jerryshao Do we need to introduce how to use the Fileset in the Spark engine? In addition, I have already tested the Tensorflow and submitted an MR: tensorflow/io#1970. After #2779 is resolved, we can support tensorflow. I think we can add a doc like https://help.aliyun.com/zh/hdfs/using-tensorflow-on?spm=a2c4g.11186623.0.i6. What do you think?

I would suggest to have another doc about gvfs and add Spark, TF related things there.

Comment thread docs/hadoop-catalog.md Outdated
Comment thread docs/hadoop-catalog.md Outdated
Comment thread docs/hadoop-catalog.md Outdated
Comment thread docs/manage-fileset-metadata-using-gravitino.md
Comment thread docs/hadoop-catalog.md Outdated
Comment thread docs/hadoop-catalog.md Outdated
Comment thread docs/manage-fileset-metadata-using-gravitino.md Outdated
Comment thread docs/manage-fileset-metadata-using-gravitino.md Outdated

FilesetCatalog filesetCatalog = catalog.asFilesetCatalog();
NameIdentifier[] identifiers =
filesetCatalog.listFilesets(Namespace.ofFileset("metalake", "catalog", "schema"));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's another issue. The metalake in Namespace seems redundant. The new GravitinoClient, we have declared the name of the current metalake. It is not related to this MR. Never mind.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is unrelated, and we are working on the client refactoring things.

@coolderli
Copy link
Copy Markdown
Contributor

@jerryshao Left some comments. Overall, it looks good to me.

@jerryshao
Copy link
Copy Markdown
Contributor Author

@shaofengshi would you please also check the java client part? Thanks.

Comment thread docs/manage-fileset-metadata-using-gravitino.md Outdated
Comment thread docs/index.md Outdated
shaofengshi
shaofengshi previously approved these changes Apr 3, 2024
Copy link
Copy Markdown
Contributor

@shaofengshi shaofengshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread docs/manage-fileset-metadata-using-gravitino.md
Comment thread docs/manage-fileset-metadata-using-gravitino.md Outdated
Comment thread docs/manage-fileset-metadata-using-gravitino.md Outdated
Comment thread docs/manage-fileset-metadata-using-gravitino.md
Comment thread docs/manage-fileset-metadata-using-gravitino.md
Comment thread docs/manage-fileset-metadata-using-gravitino.md Outdated
Comment thread docs/manage-fileset-metadata-using-gravitino.md
tabular data and others in Gravitino with a unified way.

After fileset is created, users can easily access, manage the files/directories through
Fileset's identifier, without needing to know the physical path of the managed datasets. Also, with
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe of the managed datasets is not necessary.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it is still necessary, it means that the dataset is managed by Gravitino, so users don't need to know the physical path. Fro unmanaged dataset, users still need to know the physical path before visiting the dataset.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Comment thread docs/manage-fileset-metadata-using-gravitino.md Outdated
Comment thread docs/manage-fileset-metadata-using-gravitino.md Outdated
Copy link
Copy Markdown
Contributor

@qqqttt123 qqqttt123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jerryshao jerryshao merged commit f119d90 into apache:main Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Subtask] Add document about how to use fileset type catalog

6 participants