Python stringprep Library:
Another useful library in python is stringprep.
Stringprep describes a framework for preparing Unicode text strings in order to increase the possibility that string input and string comparison work in ways that make sense for typical users throughout the world.
When we try to find out the required resources on the internet, we use strings for matching and comparing for identifying them on the internet. Exactly how these comparisons need to be made entirely depend on the application domains like whether it should be case-sensitive or not, and whether whitespaces are required or can be ignored, etc.
RFC-3454 defines the procedure to prepare the Unicode strings before transmitting through the wire and after going through the procedure of preparation, they have a certain normalized form.
RFC-3454 defines a set of tables, which can be combined into profiles and each profile must define which table it uses and what other optional parts of the stringprep procedure are a part of the profile.
One example of the stringprep profile is nameprep, which is used for internationalized domain names.
There are two kinds of tables, the set and the mappings.
Set Tables:
If one character is present in the set table, then it will return true otherwise false.
Mappings Tables:
In mapping tables, when the key is passed then the associated value is returned.
The module stringprep exposes the tables from RFC-3454. This module uses the Unicode Character Database internally, since these modules can't be represented as dictionaries or lists. Thus, these tables are exposed as functions not as data structures.
Let's talk about some of its functions:
1. stringprep.in_table_a1(code):
This is a function that returns "True" if the code provided as the argument is there in the table A.1 and "False" if the code is not present in the table A.1.
Table A.1 contains the list of all the unassigned code points in Unicode 3.2
First of all, what is a code point in Unicode?
Code Point is a number assigned to represent an abstract character in a system for representing text and Unicode code points are expressed in the forms "U+1234" where "1234" is the assigned number.
Example: character "A" is assigned a code point of "U+0041"
So, if the mentioned code point in the parameter is not assigned, it will be present in the Table A.1 and hence will result into "True". And if the mentioned code point in the parameter is an assigned Unicode code point, it will not be present in the Table A.1 and hence will result into "False".
EXAMPLE:
2. stringprep.in_table_b1(code):
This function returns "True" and "False" depending on whether the code provided as argument to the function is present in the table B.1 or not.
Table B.1 contains all those Unicode characters which are commonly mapped to nothing.
EXAMPLE:
3. stringprep.map_table_b2(code):
Unlike the previous two functions, this function does not return "True" or "False", but returns the mapped value for the code. Table B.1 is the mappings for case-folding used with NFKC normalization form.
Case folding is basically removing of case distinction, by replacing the upper case and title case characters with the lower case
Compatibility mappings substitute characters with their compatibility decomposition. Many compatibility mappings are folding, some are multigraph expansions(replacement of the multigraph such as example double prime, by its expansion into an equivalent series of single characters, in this case, two single primes. these are a subset of compatibility mappings).
These types of case folding are useful in fuzzy searches.
EXAMPLE:
4. stringprep.map_table_b3(code):
This function from stringprep library in python returns the mapped value for the code. Table B.1 is the mappings for case-folding used with no normalization. It simply converts them to lower case.
EXAMPLE:
Comments