The forgex_utf8_m
module processes a byte-indexed character strings type as UTF-8 strings.
Type | Visibility | Attributes | Name | Initial | |||
---|---|---|---|---|---|---|---|
integer(kind=int8), | public, | parameter | :: | ascii_mask | = | 127 | |
integer(kind=int8), | public, | parameter | :: | continuation_mask | = | -65 | |
integer(kind=int8), | public, | parameter | :: | fullbit | = | -1 | |
integer(kind=int8), | public, | parameter | :: | lead_2_mask | = | -33 | |
integer(kind=int8), | public, | parameter | :: | lead_3_mask | = | -17 | |
integer(kind=int8), | public, | parameter | :: | lead_4_mask | = | -9 |
Type | Intent | Optional | Attributes | Name | ||
---|---|---|---|---|---|---|
character(len=*), | intent(in) | :: | chara |
The char_utf8
function takes a code point as integer in Unicode character set,
and returns the corresponding character as UTF-8 binary string.
Type | Intent | Optional | Attributes | Name | ||
---|---|---|---|---|---|---|
integer(kind=int32), | intent(in) | :: | code |
This function counts the occurrence of a spcified character(token) in a given string.
Type | Intent | Optional | Attributes | Name | ||
---|---|---|---|---|---|---|
character(len=*), | intent(in) | :: | str | |||
character(len=1), | intent(in) | :: | token |
Take a UTF-8 character as an argument and return the integer (also known as "code point" in Unicode) representing its UTF-8 binary string.
Type | Intent | Optional | Attributes | Name | ||
---|---|---|---|---|---|---|
character(len=*), | intent(in) | :: | chara |
This function returns the index of the end of the (multibyte) character, given the string str and the current index curr. Class of invalid UTF-8 characters 1. invalid lead byte 2. invalid trail byte 3. overrun 4. over long encoding 5. incomplete multibyte sequence 6. invalid character range (U+D800-U+DFFF) 7. BOM appears in the middle 8. isolated trail byte
Type | Intent | Optional | Attributes | Name | ||
---|---|---|---|---|---|---|
character(len=*), | intent(in) | :: | str | |||
integer(kind=int32), | intent(in) | :: | curr |
This function determines if a given character is the first byte of a UTF-8 multibyte character. It takes a 1-byte character as input and returns a logical value indicating if it is the first byte of an UTF-8 binary string.
Type | Intent | Optional | Attributes | Name | ||
---|---|---|---|---|---|---|
character(len=1), | intent(in) | :: | chara |
This function checks the input byte string is valid as a single UTF-8 character.
Type | Intent | Optional | Attributes | Name | ||
---|---|---|---|---|---|---|
character(len=*), | intent(in) | :: | chara |
This function calculates the length of a UTF-8 string excluding tailing spaces.
Type | Intent | Optional | Attributes | Name | ||
---|---|---|---|---|---|---|
character(len=*), | intent(in) | :: | str |
This function calculates the length of a UTF-8 string.
Type | Intent | Optional | Attributes | Name | ||
---|---|---|---|---|---|---|
character(len=*), | intent(in) | :: | str |
This function returns the index of the next character, given the string str and the current index curr. If the current index is for the last character, it returns the invalid value.
Type | Intent | Optional | Attributes | Name | ||
---|---|---|---|---|---|---|
character(len=*), | intent(in) | :: | str | |||
integer, | intent(in) | :: | curr |
Type | Intent | Optional | Attributes | Name | ||
---|---|---|---|---|---|---|
character(len=*), | intent(in) | :: | str |
Type | Intent | Optional | Attributes | Name | ||
---|---|---|---|---|---|---|
character(len=*), | intent(in) | :: | chara |
This function take one byte, set the first two bits to 10, and returns one byte of the continuation part.
Type | Intent | Optional | Attributes | Name | ||
---|---|---|---|---|---|---|
integer(kind=int8), | intent(in) | :: | byte |
This subroutine determines if each character in a given string is the first byte of a UTF-8 multibyte character. It takes a UTF-8 string and return a logical array indicating for each position if it is the first byte.
Type | Intent | Optional | Attributes | Name | ||
---|---|---|---|---|---|---|
character(len=length), | intent(in) | :: | str | |||
logical, | intent(inout), | allocatable | :: | array(:) | ||
integer(kind=int32), | intent(in) | :: | length |
This subroutine returns the index of the next UTF-8 character conteined in str
.
This is used to handle strings that may not be encoded by UTF-8.
Type | Intent | Optional | Attributes | Name | ||
---|---|---|---|---|---|---|
character(len=*), | intent(in) | :: | str | |||
integer, | intent(in) | :: | curr | |||
integer, | intent(inout) | :: | next | |||
logical, | intent(inout) | :: | is_valid |