- Proposal: SE-0211
- Author: Tony Allevato
- Review Manager: Ben Cohen
- Status: Implemented (Swift 5.0)
- Implementation: apple/swift#15593
- Decision Notes: Acceptance, Update
- Previous Revision: 1
We propose adding a number of properties to the Unicode.Scalar
type to support both common and advanced text processing use cases, filling in a number of gaps in Swift's text support compared to other programming languages.
Swift-evolution thread: Adding Unicode properties to UnicodeScalar/Character
The Swift String
type, and related types like Character
and Unicode.Scalar
, provide very rich support for Unicode-correct operations. String comparisons are normalized, grapheme cluster boundaries are automatically detected, and string contents can be easily accessed in terms of grapheme clusters, code points, and UTF-8 and -16 code units.
However, when drilling down to lower levels, like individual code points (i.e., Unicode.Scalar
elements), the current APIs are missing a number of fundamental features available in other programming languages. Unicode.Scalar
lacks the ability to ask whether a scalar is upper/lowercase or what its upper/lowercase mapping is, if it is a whitespace character, and so forth.
Without pulling in third-party code, users can currently import the Darwin/Glibc
module and access C functions like isspace
, but these only work with ASCII characters.
The Swift standard library uses the system's ICU libraries to implement its Unicode support. A third-party developer may expect that they could also link their application directly to the system ICU to access the functionality that they need, but this proves problematic on both Apple and Linux platforms.
On Apple operating systems, libicucore.dylib
is built with function renaming disabled (function names lack the _NN
version number suffix). This makes it fairly straightforward to import the C APIs and call them from Swift without worrying about which version the operating system is using.
Unfortunately, libicucore.dylib
is considered to be private API for submissions to the App Store, so applications doing this will be rejected. Instead, users must built their own copy of ICU from source and link that into their applications. This is significant overhead.
On Linux, system ICU libraries are built with function renaming enabled (the default), so function names have the _NN
version number suffix. Function renaming makes it more difficult to use these APIs from Swift; even though the C header files contain #define
s that map function names like u_foo_59
to u_foo
, these #define
s are not imported into Swift—only the suffixed function names are available. This means that Swift bindings would be fixed to a specific version of the library without some other intermediary layer. Again, this is significant overhead.
Therefore, this proposal not only fills in important gaps in the standard library's capabilities, but removes a significant pain point for users who may try to access that functionality through other means.
We propose adding a nested struct, Unicode.Scalar.Properties
, which will encapsulate many of the properties that the Unicode specification defines on scalars. Supporting types, such as enums representing the values of certain properties, will also be added to the Unicode
enum "namespace."
This proposal is restricted, by design, to add functionality to Unicode.Scalar
only. While we believe that some of the properties described here (and others) would be valuable on Character
as well, we have intentionally saved those for a future proposal in order to keep this one small and focused. Such a future proposal would likely depend on the design and implementation herein.
The code snippets below reflect an elided sketch of the proposed public API only. Full details can be found in the implementation pull request.
In general, the names of the properties inside the Properties
struct are derived directly from the names of the properties as they are defined in the Unicode Standard.
extensionUnicode.Scalar{ // NOT @_fixed_layout publicstructProperties{ // Remaining API is defined in the subsections below. } /// The value that encapsulates the properties exposed. publicvarproperties:Properties{get}}
Each of the Boolean properties in the first block below would be implemented by calling ICU's u_hasBinaryProperty
function. The official Unicode name of each property is indicated by the comment next to the computed properties below.
We propose supporting all of the Boolean properties that are currently available using ICU's u_hasBinaryProperty
that correspond to properties in the Unicode Standard (but not ICU-specific properties), with the following exceptions:
Grapheme_Link
is omitted because it is deprecated and equivalent to canonical combining class 9.Hyphen
is omitted because is deprecated in favor of theLine_Break
property.
extensionUnicode.Scalar.Properties{publicvarisAlphabetic:Bool{get} // Alphabetic publicvarisASCIIHexDigit:Bool{get} // ASCII_Hex_Digit publicvarisBidiControl:Bool{get} // Bidi_Control publicvarisBidiMirrored:Bool{get} // Bidi_Mirrored publicvarisDash:Bool{get} // Dash publicvarisDefaultIgnorableCodePoint:Bool{get} // Default_Ignorable_Code_Point publicvarisDeprecated:Bool{get} // Deprecated publicvarisDiacritic:Bool{get} // Diacritic publicvarisExtender:Bool{get} // Extender publicvarisFullCompositionExclusion:Bool{get} // Full_Composition_Exclusion publicvarisGraphemeBase:Bool{get} // Grapheme_Base publicvarisGraphemeExtend:Bool{get} // Grapheme_Extend publicvarisHexDigit:Bool{get} // Hex_Digit publicvarisIDContinue:Bool{get} // ID_Continue publicvarisIDStart:Bool{get} // ID_Start publicvarisIdeographic:Bool{get} // Ideographic publicvarisIDSBinaryOperator:Bool{get} // IDS_Binary_Operator publicvarisIDSTrinaryOperator:Bool{get} // IDS_Trinary_Operator publicvarisJoinControl:Bool{get} // Join_Control publicvarisLogicalOrderException:Bool{get} // Logical_Order_Exception publicvarisLowercase:Bool{get} // Lowercase publicvarisMath:Bool{get} // Math publicvarisNoncharacterCodePoint:Bool{get} // Noncharacter_Code_Point publicvarisQuotationMark:Bool{get} // Quotation_Mark publicvarisRadical:Bool{get} // Radical publicvarisSoftDotted:Bool{get} // Soft_Dotted publicvarisTerminalPunctuation:Bool{get} // Terminal_Punctuation publicvarisUnifiedIdeograph:Bool{get} // Unified_Ideograph publicvarisUppercase:Bool{get} // Uppercase publicvarisWhitespace:Bool{get} // Whitespace publicvarisXIDContinue:Bool{get} // XID_Continue publicvarisXIDStart:Bool{get} // XID_Start publicvarisCaseSensitive:Bool{get} // Case_Sensitive publicvarisSentenceTerminal:Bool{get} // Sentence_Terminal (S_Term) publicvarisVariationSelector:Bool{get} // Variation_Selector publicvarisNFDInert:Bool{get} // NFD_Inert publicvarisNFKDInert:Bool{get} // NFKD_Inert publicvarisNFCInert:Bool{get} // NFC_Inert publicvarisNFKCInert:Bool{get} // NFKC_Inert publicvarisSegmentStarter:Bool{get} // Segment_Starter publicvarisPatternSyntax:Bool{get} // Pattern_Syntax publicvarisPatternWhitespace:Bool{get} // Pattern_White_Space publicvarisCased:Bool{get} // Cased publicvarisCaseIgnorable:Bool{get} // Case_Ignorable publicvarchangesWhenLowercased:Bool{get} // Changes_When_Lowercased publicvarchangesWhenUppercased:Bool{get} // Changes_When_Uppercased publicvarchangesWhenTitlecased:Bool{get} // Changes_When_Titlecased publicvarchangesWhenCaseFolded:Bool{get} // Changes_When_Casefolded publicvarchangesWhenCaseMapped:Bool{get} // Changes_When_Casemapped publicvarchangesWhenNFKCCaseFolded:Bool{get} // Changes_When_NFKC_Casefolded publicvarisEmoji:Bool{get} // Emoji publicvarisEmojiPresentation:Bool{get} // Emoji_Presentation publicvarisEmojiModifier:Bool{get} // Emoji_Modifier publicvarisEmojiModifierBase:Bool{get} // Emoji_Modifier_Base }
The properties below provide full case mappings for scalars. Since a handful of mappings result in multiple scalars (e.g., "ß" uppercases to "SS"), these properties are String
-valued, not Unicode.Scalar
.
These properties are also common enough that they could be reasonably hoisted out of Unicode.Scalar.Properties
and made into instance properties directly on Unicode.Scalar
.
extensionUnicode.Scalar.Properties{publicvarlowercaseMapping:String{get} // u_strToLower publicvartitlecaseMapping:String{get} // u_strToTitle publicvaruppercaseMapping:String{get} // u_strToUpper }
We add the following properties for the purposes of identifying and categorizing scalars.
The Canonical_Combining_Class
property is somewhat unique in that the Unicode standard provides names for some, but not all, of the 255 valid values. We choose to represent this in Swift as a RawRepresentable
struct
that wraps the raw integer value, while still being Comparable
for the purposes of implementing manual decomposition logic if necessary, and still providing the named values through static let
constants.
extensionUnicode.Scalar.Properties{ /// Corresponds to the `Age` Unicode property, when a code point was first /// defined. publicvarage:Unicode.Version?{get} /// Corresponds to the `Name` Unicode property. publicvarname:String?{get} /// Corresponds to the `Name_Alias` Unicode property. publicvarnameAlias:String?{get} /// Corresponds to the `General_Category` Unicode property. publicvargeneralCategory:Unicode.GeneralCategory{get} /// Corresponds to the `Canonical_Combining_Class` Unicode property. publicvarcanonicalCombiningClass:Unicode.CanonicalCombiningClass{get}}extensionUnicode{ /// Represents the version of Unicode in which a scalar was introduced. publictypealiasVersion=(major:Int, minor:Int) /// General categories returned by /// `Unicode.Scalar.Properties.generalCategory`. Listed along with their /// two-letter code. publicenumGeneralCategory{case uppercaseLetter // Lu case lowercaseLetter // Ll case titlecaseLetter // Lt case modifierLetter // Lm case otherLetter // Lo case nonspacingMark // Mn case spacingMark // Mc case enclosingMark // Me case decimalNumber // Nd case letterlikeNumber // Nl case otherNumber // No case connectorPunctuation //Pc case dashPunctuation // Pd case openPunctuation // Ps case closePunctuation // Pe case initialPunctuation // Pi case finalPunctuation // Pf case otherPunctuation // Po case mathSymbol // Sm case currencySymbol // Sc case modifierSymbol // Sk case otherSymbol // So case spaceSeparator // Zs case lineSeparator // Zl case paragraphSeparator // Zp case control // Cc case format // Cf case surrogate // Cs case privateUse // Co case unassigned // Cn }publicstructCanonicalCombiningClass:Comparable,Hashable,RawRepresentable{publicstaticletnotReordered=CanonicalCombiningClass(rawValue:0)publicstaticletoverlay=CanonicalCombiningClass(rawValue:1)publicstaticletnukta=CanonicalCombiningClass(rawValue:7)publicstaticletkanaVoicing=CanonicalCombiningClass(rawValue:8)publicstaticletvirama=CanonicalCombiningClass(rawValue:9)publicstaticletattachedBelowLeft=CanonicalCombiningClass(rawValue:200)publicstaticletattachedBelow=CanonicalCombiningClass(rawValue:202)publicstaticletattachedAbove=CanonicalCombiningClass(rawValue:214)publicstaticletattachedAboveRight=CanonicalCombiningClass(rawValue:216)publicstaticletbelowLeft=CanonicalCombiningClass(rawValue:218)publicstaticletbelow=CanonicalCombiningClass(rawValue:220)publicstaticletbelowRight=CanonicalCombiningClass(rawValue:222)publicstaticletleft=CanonicalCombiningClass(rawValue:224)publicstaticletright=CanonicalCombiningClass(rawValue:226)publicstaticletaboveLeft=CanonicalCombiningClass(rawValue:228)publicstaticletabove=CanonicalCombiningClass(rawValue:230)publicstaticletaboveRight=CanonicalCombiningClass(rawValue:232)publicstaticletdoubleBelow=CanonicalCombiningClass(rawValue:233)publicstaticletdoubleAbove=CanonicalCombiningClass(rawValue:234)publicstaticletiotaSubscript=CanonicalCombiningClass(rawValue:240)publicletrawValue:UInt8publicinit(rawValue:UInt8)}}
Many Unicode scalars have associated numeric values. These are not only the common digits zero through nine, but also vulgar fractions and various other linguistic characters and ideographs that have an innate numeric value. These properties are exposed below. They can be useful for determining whether segments of text contain numbers or non-numeric data, and can also help in the design of algorithms to determine the values of such numbers.
extensionUnicode.Scalar.Properties{ /// Corresponds to the `Numeric_Type` Unicode property. publicvarnumericType:Unicode.NumericType? /// Corresponds to the `Numeric_Value` Unicode property. publicvarnumericValue:Double?}extensionUnicode{publicenumNumericType{case decimal case digit case numeric }}
These changes are strictly additive. This proposal does not affect source compatibility.
These changes are strictly additive. This proposal does not affect the ABI of existing language features.
The Unicode.Scalar.Properties
struct is currently defined as a resilient (non-@_fixed_layout
) struct whose only stored property in the initial implementation is the integer value of the scalar whose properties are being retrieved. All other properties are computed properties, and new properties can be added without breaking the ABI.
In the future, we may choose to cache certain properties if data show that they are more frequently accessed than others and that there would be a meaningful performance improvement by computing them early. It is too soon to make such a determination now, however.
We considered other representations for the Boolean properties of a scalar:
- A
BooleanProperty
enum with a case for each property, and aUnicode.Scalar.hasProperty
method used to query it. This is very close to the underlying ICU C APIs and does not bloat theUnicode.Scalar
API, but makes the kinds of queries users would commonly make less discoverable. - A
Unicode.Scalar.properties
property whose type conforms toOptionSet
. This would allow us to use the underlying property enum constants as the bit-shifts for the option set values, but there are already 64 Boolean properties defined by ICU. Since the underlying integral type is part of the public API/ABI of the option set, we would not be able to change it in the future without breaking compatibility. - A
Unicode.Scalar.properties
property whose type is aSet<BooleanProperty>
, but we would not be able to form this collection without querying all Boolean properties upon any access (theOptionSet
solution above suffers the same problem). This would be needlessly inefficient in almost all usage.
We feel that by putting the properties into a separate Unicode.Scalar.Properties
struct, the large number of advanced properties does not contribute to bloat of the main Unicode.Scalar
API, and allows us to cleanly represent not only Boolean properties but other types of properties with ease.
The names of the Boolean properties are all of the form is<Unicode Property Name>
, with the exception of a small number of properties whose names already start with indicative verb forms and read as assertions (e.g., changesWhenUppercased
). This leads to some technical and/or awkward property names, like isXIDContinue
.
We considered modifying these names to make them read more naturally like other Swift APIs; for example, extendsPrecedingScalar
instead of isExtender
. However, since these properties are intended for advanced users who are likely already somewhat familiar with the Unicode Standard and its definitions, we decided to keep the names directly derived from the Standard, which makes them more discoverable to the intended audience.
The original version of this proposal also included a Boolean isDefined
property that was equivalent to the u_isdefined
function from ICU, which would evaluate to true for exactly the code points that are assigned (those with general category other than "Cn"). In Swift, however, this would be redundant and was thus dropped from the final implementation; its value would differ from Unicode.Scalar.Properties.generalCategory != .unassigned
only for surrogate code points, which cannot be created as Unicode.Scalar
values.