Python stringprep Library
Stringprep describes a framework for preparing Unicode text strings in order to increase the possibility that string input and string comparison work in ways that make sense for typical users throughout the world.
Some functions of stringprep discussed here are as follows:
1. stringprep.in_table_c5(code)
2. stringprep.in_table_c6(code)
3. stringprep.in_table_c7(code)
1. stringprep.in_table_c5(code)
This function in python stringprep library returns "True" or "False" depending on whether the given Unicode code point passed as argument is listed in the Table C.5.
This Table C.5 contains the Surrogate characters.
Surrogate characters are code points from two special ranges of Unicode values reserved for use as the leading and trailing values of paired code units in UTF-16.
Surrogate characters have some special values and are typically referred to as surrogate pairs. They are the combination of two characters, containing a single code point. The range of first Unicode code point is U+D800 to U+D8FF (called high surrogates) and the range of the second Unicode code point is U+DC00 to U+DCFF (called low surrogates). When programs see a bit sequence that falls in this range, they immediately know that they have encountered a surrogate pair. That's why they are called surrogates because they do not represent characters directly, but only as a pair.
To view the table visit: https://datatracker.ietf.org/doc/html/rfc3454#appendix-C.5
2. stringprep.in_table_c6(code):
This function in python stringprep library returns "True" or "False" depending on whether the given Unicode code point passed as argument is listed in the Table C.6.
This Table C.6 contains the codes inappropriate for plaintext.
This table contains 5 entries in total that contains characters that are inappropriate for plaintext usage.
To view the table visit: https://datatracker.ietf.org/doc/html/rfc3454#appendix-C.6
3. stringprep.in_table_c7(code):
This function in python stringprep library returns "True" or "False" depending on whether the given Unicode code point passed as argument is listed in the Table C.7.
This Table C.7 contains the codes inappropriate for canonical representation.
Canonical representation is a form of representation such that every object has unique representation(canonical being the process through which representation is put into its canonical form). Thus the equality of two objects can easily be tested by testing the equality of their canonical forms.
The Table C.7 contains entries that range from U+2FF0 to U+2FFB that are inappropriate for canonical representation i.e. can't be represented in any canonical forms.
To view the table visit: https://datatracker.ietf.org/doc/html/rfc3454#appendix-C.7
Comments