CJK Support Added for Markdown Slug Generation#6
Merged
Conversation
**CHANGES:** - Update sanitizeText to handle Chinese, Japanese, Korean characters. - Extend regex to include Unicode ranges for CJK. - Add test for CLI explode command with CJK support. - Bump version to 1.6.0 for new feature release.
There was a problem hiding this comment.
Pull Request Overview
This PR adds support for CJK characters in markdown slug generation, ensuring that filenames and anchors retain Chinese, Japanese, and Korean characters instead of stripping them.
- Updated the regex in the slug generation logic in bin/md-tree.js to include Unicode ranges for CJK characters.
- Added new tests in test/test-cli.js and test/test-cjk.md to verify that the explode command correctly handles CJK headings.
- Bumped the package version to 1.6.0 to reflect the feature addition.
Reviewed Changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| test/test-cli.js | Added a new test case to verify filename and link generation with CJK text |
| test/test-cjk.md | Introduced a markdown file containing CJK headings for testing |
| package.json | Updated version to 1.6.0 as part of the new feature integration |
Comments suppressed due to low confidence (1)
test/test-cli.js:479
- [nitpick] Consider adding a clarifying comment that explains the Japanese subsection 'セクション 2.1' is intentionally nested under the parent section '章二', ensuring maintainers understand the test's intent.
indexContent.includes('[セクション 2.1](./章二.md#セクション-21)'),
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CJK Support Added for Markdown Slug Generation
Summary
This PR adds support for CJK (Chinese, Japanese, Korean) characters in the markdown file slug generation, allowing non-Latin characters to be preserved in generated filenames and anchors when using the
explodecommand.Related Isssues
Closes #5
Files Changed
bin/md-tree.jsModified the
sanitizeTextmethod to preserve CJK characters instead of stripping them out. The regex now includes Unicode ranges for Chinese, Japanese (Hiragana and Katakana), and Korean characters.package.jsonBumped the version from 1.5.1 to 1.6.0 to reflect the new feature addition.
test/test-cjk.mdAdded a new test markdown file containing CJK characters in headings to verify the functionality works correctly.
test/test-cli.jsAdded a comprehensive test case to verify that the
explodecommand correctly handles CJK characters in headings, generates appropriate filenames, and creates proper links.Code Changes
bin/md-tree.jsThe regex now includes Unicode ranges:
\u4e00-\u9fff: Chinese characters\u3040-\u309f: Japanese Hiragana\u30a0-\u30ff: Japanese Katakana\uac00-\ud7af: Korean Hangultest/test-cli.jsReason for Changes
Previously, the markdown tree parser would strip out all non-Latin characters when generating slugs for filenames and anchors. This made the tool unsuitable for documentation written in CJK languages, as headings like "章节一" would be converted to empty strings or hyphens only, resulting in non-descriptive or conflicting filenames.
Impact of Changes
Test Plan
A comprehensive test case has been added that:
explodecommand on this fileThe test covers Chinese characters in main headings and Japanese characters in subsections, alongside regular English headings to ensure mixed-language documents work correctly.
Additional Notes