Skip to content

Add preliminary files with mappings from XCCS to Unihan set of Unicode.#2559

Merged
MattHeffron merged 4 commits intomasterfrom
mth67--Add_unihan_XCCS_mapping_files
Apr 17, 2026
Merged

Add preliminary files with mappings from XCCS to Unihan set of Unicode.#2559
MattHeffron merged 4 commits intomasterfrom
mth67--Add_unihan_XCCS_mapping_files

Conversation

@MattHeffron
Copy link
Copy Markdown
Member

These were generated, using scripting, from the data in the unihan folder of the Unicode Data Base.
That info claims to know the mapping from Unicode(unihan) to "Xerox" coding (2 bytes in octal).
These were not validated at all for correctness/completeness.
None of these files have any descriptive header.

(Unfortunately, this kind of mapping information to "Xerox" is only for Unihan characters.)

These were generated with scripting from the data in the unihan folder of the Unicode Data Base.
That info claims to know the mapping from Unicode(unihan) to "Xerox" coding (2 bytes in octal).
These were not validated at all for correctness/completeness.
None of these files have any descriptive header.

(Unfortunately, this kind of mapping information to "Xerox" is *only* for Unihan characters.)
@MattHeffron MattHeffron self-assigned this Apr 10, 2026
@MattHeffron MattHeffron added the enhancement New feature or request label Apr 10, 2026
@pamoroso
Copy link
Copy Markdown
Member

The nw files appear to be intact and readable.

@rmkaplan
Copy link
Copy Markdown
Contributor

The format seems good. The only issue is that we need a predicate that is true for the range of Unihan charsetset numbers, like we have for Kanji and Chinese. That would be used to make sure that the glyphs for the bold and italic versions of the display fonts, if we ever got them, would not be faked up.

@MattHeffron
Copy link
Copy Markdown
Member Author

Unihan is the unification of Kanji, Chinese, Japanese, etc. characters. See: Han unification
But "many characters have regional variants assigned to different code points"

The XCCS Standard Document map of character sets to languages (page 34, 2-8) appears quite incomplete.
In addition, some character sets include mixes of Latin, Symbols, Asian, Arabic, etc. characters, so by character set management of faking might be insufficient.

@MattHeffron MattHeffron marked this pull request as ready for review April 15, 2026 19:05
Copy link
Copy Markdown
Contributor

@rmkaplan rmkaplan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? Turns out that the function CHINESECHARSETP was basically picking out just these character sets (except for 171, which KANJICHARSETP should cover), so I'll separately rename CHINESECHARSETP to UNIHANCHARSETP

@MattHeffron MattHeffron merged commit 2814618 into master Apr 17, 2026
@MattHeffron MattHeffron deleted the mth67--Add_unihan_XCCS_mapping_files branch April 17, 2026 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants