Jump to content

Template:General Category (Unicode)

From Wikipedia, the free encyclopedia
General Category (Unicode Character Property)[a]
ValueCategory Major, minorBasic type[b]Character assigned[b]Count[c]
(as of 16.0)
Remarks
 
L, Letter; LC, Cased Letter (Lu, Ll, and Lt only)[d]
LuLetter, uppercaseGraphicCharacter 1,858
LlLetter, lowercaseGraphicCharacter 2,258
LtLetter, titlecaseGraphicCharacter 31Ligatures or digraphs containing an uppercase followed by a lowercase part (e.g., Dž, Lj, Nj, and Dz)
LmLetter, modifierGraphicCharacter 404A modifier letter
LoLetter, otherGraphicCharacter 136,477An ideograph or a letter in a unicase alphabet
M, Mark
MnMark, nonspacingGraphicCharacter 2,020
McMark, spacing combiningGraphicCharacter 468
MeMark, enclosingGraphicCharacter 13
N, Number
NdNumber, decimal digitGraphicCharacter 760All these, and only these, have Numeric Type = De[e]
NlNumber, letterGraphicCharacter 236Numerals composed of letters or letterlike symbols (e.g., Roman numerals)
NoNumber, otherGraphicCharacter 915E.g., vulgar fractions, superscript and subscript digits, vigesimal digits
P, Punctuation
PcPunctuation, connectorGraphicCharacter 10Includes spacing underscore characters such as "_", and other spacing tie characters. Unlike other punctuation characters, these may be classified as "word" characters by regular expression libraries.[f]
PdPunctuation, dashGraphicCharacter 27Includes several hyphen characters
PsPunctuation, openGraphicCharacter 79Opening bracket characters
PePunctuation, closeGraphicCharacter 77Closing bracket characters
PiPunctuation, initial quoteGraphicCharacter 12Opening quotation mark. Does not include the ASCII "neutral" quotation mark. May behave like Ps or Pe depending on usage
PfPunctuation, final quoteGraphicCharacter 10Closing quotation mark. May behave like Ps or Pe depending on usage
PoPunctuation, otherGraphicCharacter 640
S, Symbol
SmSymbol, mathGraphicCharacter 950Mathematical symbols (e.g., +, , =, ×, ÷, , , ). Does not include parentheses and brackets, which are in categories Ps and Pe. Also does not include !, *, -, or /, which despite frequent use as mathematical operators, are primarily considered to be "punctuation".
ScSymbol, currencyGraphicCharacter 63Currency symbols
SkSymbol, modifierGraphicCharacter 125
SoSymbol, otherGraphicCharacter 7,376
Z, Separator
ZsSeparator, spaceGraphicCharacter 17Includes the space, but not TAB, CR, or LF, which are Cc
ZlSeparator, lineFormatCharacter 1Only U+2028 LINE SEPARATOR (LSEP)
ZpSeparator, paragraphFormatCharacter 1Only U+2029 PARAGRAPH SEPARATOR (PSEP)
C, Other
CcOther, controlControlCharacter 65 (will never change)[e]No name,[g] <control>
CfOther, formatFormatCharacter 170Includes the soft hyphen, joining control characters (ZWNJ and ZWJ), control characters to support bidirectional text, and language tag characters
CsOther, surrogateSurrogateNot (only used in UTF-16) 2,048 (will never change)[e]No name,[g] <surrogate>
CoOther, private usePrivate-useCharacter (but no interpretation specified) 137,468 total (will never change)[e] (6,400 in BMP, 131,068 in Planes 15–16)No name,[g] <private-use>
CnOther, not assignedNoncharacterNot 66 (will not change unless the range of Unicode code points is expanded)[e]No name,[g] <noncharacter>
ReservedNot 819,467No name,[g] <reserved>
  1. ^"Table 4-4: General Category". The Unicode Standard. Unicode Consortium. September 2024.
  2. ^ ab"Table 2-3: Types of code points". The Unicode Standard. Unicode Consortium. September 2024.
  3. ^"DerivedGeneralCategory.txt". The Unicode Consortium. 2024-04-30.
  4. ^"5.7.1 General Category Values". UTR #44: Unicode Character Database. Unicode Consortium. 2024-08-27.
  5. ^ abcdeUnicode Character Encoding Stability Policies: Property Value Stability Stability policy: Some gc groups will never change. gc=Nd corresponds with Numeric Type=De (decimal).
  6. ^"Annex C: Compatibility Properties (§ word)". Unicode Regular Expressions. Version 23. Unicode Consortium. 2022-02-08. Unicode Technical Standard #18.
  7. ^ abcde"Table 4-9: Construction of Code Point Labels". The Unicode Standard. Unicode Consortium. September 2024. A Code Point Label may be used to identify a nameless code point. E.g. <control-hhhh>, <control-0088>. The Name remains blank, which can prevent inadvertently replacing, in documentation, a Control Name with a true Control code. Unicode also uses <not a character> for <noncharacter>.

References

close