Python Unicodedata Library functions normalize and decomposition














































Python Unicodedata Library functions normalize and decomposition



Python Unicode data:
The 'unicodedata' library in python is helpful in defining the properties for all the Unicode characters available in the Unicode Character Database. This library accesses the Unicode Character Database (UCD) for defining the characters.
UCD file link: https://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt

The next two functions defined here are:
1. unicodedata.normalize(chr)
2. unicodedata.decomposition(chr)

But, let us first understand the concept of Normalization in Unicode characters.

Many of the Unicode characters which are in use seems like some variants of other existing characters.
People use odd characters to express individuality or to grab attention or both. and many a times two seemingly matching characters do not match. Thus Unicode Normalization comes to the rescue.

Unicode Normalization is basically the decomposition and composition of characters. Some of the Unicode characters have same appearance and multiple representations.
For example, "â" can be represented as one code point for "â" (U+00E2), and two decomposed code points for "a" (U+0061) and " ̂" (U+0302). It can also be expressed as (base character + combining character). The former is called a precomposed character and the latter is called a combining character sequence (CCS).

The Unicode Standard defines two equivalences between characters:
1. Canonical Equivalence 
2. Compatibility Equivalence

Canonical Equivalence is the fundamental equivalency between characters or sequences of characters which represent the same abstract character, and which when correctly displayed should always have the same visual appearance and behaviour.

Example: 

Consider 2 characters, 'Latin Capital C with Cedilla' and Latin Capital C plus Cedilla' characters as defined in the code below.
Both these characters are having the same appearances but internally they are different characters.




Compatibility Equivalence is a weaker type of equivalence between characters or sequences of characters which represent the same abstract character (or sequence of abstract characters), but which may have distinct visual appearances or behaviours. The visual appearances of the compatibility equivalent forms typically constitute a subset of the expected range of visual appearances of the character (or sequence of characters) they are equivalent to. However, these variant forms may represent a visual distinction that is significant in some textual contexts, but not in others. As a result, greater care is required to determine when use of a compatibility equivalent is appropriate. if the visual distinction is stylistic, then markup or styling could be used to represent the formatting information. However, some characters with compatibility decompositions are used in mathematical notation to represent a distinction of a semantic nature; replacing the use of the distinct character codes by formatting in such contexts may cause problems.
Some of the subtypes of Compatibility Equivalence characters are:

Example: 
Circled Variants:
'ⓓ' and 'd' are same but looks different.
Font Variants:
'ℍ' and 'H' are same but looks different.
etc....

Unicode Normalization is our solution to both canonical and compatibility equivalence issues.
In normalization, there are 2 types of conversions in each direction and there are 2 directions is total namely Decomposition and Composition.

Decomposition is the method of breaking down the characters.
Composition is the method of merging separate characters into a single Unicode character

For example:
If we talk about the Unicode character 'Latin Capital c with cedilla' i.e. "\u00C7" : Ç



Decomposition applied on this character can result in the breaking down of this character in to its two separate Unicode characters that are "\u0043" (Latin Character C) and "\u0327" (Combining Cedilla)
i.e.







And Composition applied to characters "\u0043" (Latin Character C) and "\u0327" (Combining Cedilla) forms "Latin Capital c plus cedilla" i.e. "\u0043\u0327" : Ç.


Both the directions have 2 types of conversions each:
Decomposition:
1. NFD- Canonical Decomposition
2. NFKD- Compatibility Decomposition
Composition:
1. NFC- Canonical Decomposition followed by Canonical Composition
2. NFKC-Compatibility Decomposition followed by Canonical Composition

We can use the function 'unicodedata.normalize(form, unistr)' function for these conversions:

unicodedata.normalize(form, unistr):
This is a function in the 'unicodedata' library in python that returns the normal form 'form' for the Unicode String 'unistr'.
The argument 'form' can take values: 'NFD', NFKD', 'NFC' and 'NFKC'

Here, let us see how to use these forms in this function:

NFD (Canonical Decomposition):
Breaking down the Unicode Characters into their independent parts.

Example:
Latin character C with Cedilla : "\u00C7" - 'Ç' and
Latin Character C plus Cedilla : "\u0043\u0327" - 'Ç'.
When we apply the NFD (Canonical Decomposition: Since these characters' visual appearances are same but internally they are different) on 'Latin character c with cedilla':"\u00C7", it will get decomposed into 'Latin Character c plus cedilla" : "\u0043\u0327".




NFC (Canonical Decomposition followed by Canonical Composition):
This first decomposes characters just like NFD- but then composes them back into single character.

Example:
Latin character C with Cedilla : "\u00C7" - 'Ç' and
Latin Character C plus Cedilla : "\u0043\u0327" - 'Ç'.
If we apply NFC to 'Latin Character C plus Cedilla' : "\u0043\u0327" , it will first decompose into 
"\u0043" and "\u0327" and then canonical composition will result into merging of these chatacters into "\u00C7" and we will get 'Latin character C with Cedilla'.



NFKD (Compatibility Decomposition):
This, if possible,  splits the fancy alternative characters down to their basic and more simpler parts.

Example:
1.    '½' can be broken down to ('1' + '/' + '2')



2. 'ℌ' can be broken into 'H'



NFKC (Compatibility Decomposition followed by Canonical Composition):

This again, like NFC, follows 2 steps, first it goes for the Compatibility Decomposition and then Canonical Composition occurs.
Example:
'Fancy H with Cedilla': "\u210B\u0327" - 'ℋ̧'
'Normal H with Cedilla': "\u1e28" - 'Ḩ'
Here if 'Fancy H with Cedilla' is fed to the function with 'NFKC' form then first it ets converted to 'Normal H' and 'combining cedilla'
Then it will undergo Canonical Composition to form 'Normal H with Cedilla'.



So, this was all about normalization.


Lets move to what unicodedata.decomposition(chr) does:

unicodedata.decomposition(chr):
This is a function in unicodedata library in python which returns the character decomposition mapping assigned to the character 'chr' given as the argument.
If no such mapping is defined, this function returns an empty string.

Now what is decomposition Mapping?
Decomposition Mapping is a mapping from a character to a sequence of one or more characters that is canonical or compatibility equivalent, and that is listed in the character name list.
Each character has at most one decomposition mapping which are either compatibility or canonical mappings.
Another term in this respect is "Decomposable Characters". Decomposable Characters are characters that is equivalent to a sequence of one or more characters, according to decomposition mappings found in the Unicode Character Database (UCD). A Decomposable Character is also referred to as 'precomposed' character or 'composite' character.
And Decomposition refers to the sequence of one or more characters that is equal to the decomposable characters.
So basically, this function 'unicodedata.decomposition(chr)' gives the decomposition of the character 'chr' given as argument to the function. If this character is a 'Decomposable' character, it returns the 'Decomposition' of the character 'chr', otherwise returns an empty string.

EXAMPLE:

'Normal H with Cedilla': "\u1e28" - 'Ḩ'
Applying 'unicodedata.decomposition(chr)' on this character gives its 'Decomposition'




 

More Articles of Arkaja Sharan:

Name Views Likes
Python codecs Library Error Handling schemes module functions 43 0
Python codecs Library Error Handler register_error and lookup_error functions 39 0
Python codecs Library Error Handlers 37 0
Python codecs Library open and EncodedFile functions 33 0
Python codecs Library iterencode and iterdecode functions 40 0
Python codecs Library register and unregister functions 33 0
Python codecs Library getreader and getwriter functions 40 0
Python codecs Library getincrementalencoder and getincrementaldecoder 30 0
Python codecs Library getencoder and getdecoder functions 34 0
Python Introduction to codecs Library 53 0
Python fcntl Library flock and lockf functions 36 0
Python fcntl Library fcntl and ioctl functions 50 0
Python Resource Library resource usage functions 48 0
Python Resource Library resource usage symbolic constants 39 0
Python Resource Library Resource Limit Functions 49 0
Python resource library resource limit symbolic constants 44 0
Python Introduction to Resource Library 33 0
Python stringprep Library in_table_d1 and in_table_d2 functions 39 0
Python stringprep Library in_table_c8 and in_table_c9 functions 42 0
Python stringprep Library in_table_c5 in_table_c6 and in_table_c7 functions 30 0
Python stringprep Library in_table_c3 and in_table_c4 functions 41 0
Python stringprep library in_table_c21 in_table_c22 and in_table_c21_c22 31 0
Python stringprep library functions in_table_c11 in_table_c12 and in_table_c11_c12 44 0
Python Introduction to stringprep Library 42 0
Python unicodedata library is_normalized unidata_version and ucd_3_2_0 41 0
Python Unicodedata Library functions normalize and decomposition 75 0
Python Unicodedata Library functions east_asian_width and mirrored 40 1
Python Unicodedata Library category bidirectional and combining functions 68 0
Introduction to Unicodedata library lookup and name functions 41 0
Unicode Library decimal digit and numeric functions 41 0
Introduction to Unicode Data library 0 0

Comments