New Light on Fortran String Processing: Forgex 1st Release
Author: Amasaki Shinobu (雨崎しのぶ)
Twitter: @amasaki203
Posted on: 2023-12-25 JST Updated on: 2023-12-27 JST
Abstract
Forgex is a regular expression engine written entirely in Fortran. It is a library module that is executable with just a Fortran compiler, devoid of dependencies other than a compiler and Fortran Package Manager (FPM). This library is freely available under the MIT license.
About Forgex
Forgex is a library module designed for processing regular expressions (pronounced as ‘forge x’ by me).
The distinctive features implemented in this library include:
- UTF-8 code set support
- Operator-oriented API
- FPM and CMake support
Regular Expression Processing Implemented
This section lists the regular expression patterns that are implemented in Forgex.
Metacharacters
|
: Alternation*
: Match zero or more+
: Match one or more?
: Match zero or one\
: Escape a metacharacter.
: Match any character
Character Classes
[a-z]
: Character classes[^a-z]
: Inverted character classes
Both character classes and their invertion support UTF-8
character sets, e. g. [α-ωぁ-ん]
Anchor
^
: Match at the beginning of the line$
: Match at the end of the line
Repetition
{2}
: number of repetitions{,3}
: maximum number of repetitions{5,}
: minimum number of repetitions{7, 11}
: both minimum and maximum numbers of repetitions
Shorthand with Backslash
\t
: TAB\n
: Line Feed (LF) or Carriage return (CR) and LF\r
: CR\s
: A blank character (white space, TAB, LF, CR, Form Feed, Idepgraphic Space U+3000)\S
: Other than\s
\w
:[a-zA-Z0-9_]
\W
: Other than\w
\d
:[0-9]
\D
: Other than\d
(therefore[^0-9]
)
Usage
Build
Operation has been confirmed with the following compilers:
- GNU Fortran (
gfortran
) v13.2.1 - Intel Fortran Compiler (
ifx
) 2024.0.0 20231017
To build this, the use of FPM is assumed. An alternative option is CMake.
Building with FPM
Add the following to your project’s
fpm.toml
:
[dependencies]
forgex = {git = "https://github.com/shinobuamasaki/forgex"}
(Alternative) Building with CMake
Forgex also supports building with CMake.
% wget https://github.com/ShinobuAmasaki/forgex/archive/refs/tags/v1.0.tar.gz # download forgex
% tar xvzf ./v1.0.tar.gz # decompress
% cd ./forgex-1.0 # change directory
% cmake -S . -B ./build # make 'build' directory
% cmake --build ./build # build
% sudo cmake --install ./build --prefix /usr/local # install
% cd ./build
% ctest # test
While it is possible to build using CMake, I recommend the more straightforward approach using FPM.
APIs
Declaring use forgex
at the top of your program
introduces the .in.
and .match.
operators and the regex
function. Below we will
look at how to use each of these three.
The .in.
operator
The .in.
operator returns true if the pattern
(left operand) is contained in the string (right operand).
block
character(:), allocatable :: pattern, str
pattern = 'foo(bar|baz)'
str = "foobarbaz"
print *, pattern .in. str ! T
str = "foofoo"
print *, pattern .in. str ! F
end block
The .match.
operator
The .match.
operator returns true if the
pattern and the string match exactly.
block
character(:), allocatable :: pattern, str
pattern = '\d{3}-\d{4}'
str = '100-0001'
print *, pattern .match. str ! T
str = '1234567'
print *, pattern .match. str ! F
end block
The regex
function
The regex
function takes a pattern and string
as arguments and returns the matched substring. The substring
is the leftmost and longmost match in the string.
block
character(:), allocatable :: pattern, str
pattern = '[a-z]{3}'
str = 'foobarbaz'
print *, regex(pattern, str) ! foo
end block
As an option, you can also pass an integer-type arugment
length
that will stores the byte length of the
substring.
block
character(:), allocatable :: pattern, str
integer :: length
pattern = '[a-z]{3}'
str = 'foobarbaz'
print *, regex(pattern, str, length) ! foo
print *, length ! 3
end block
The interface of the regex
function is as
follows:
function regex (pattern, str, length) result(res)
character(*), intent(in) :: pattern, str
integer, intent(inout), optional :: length
character(:), allocatable :: res
UTF-8 String Processing
UTF-8 string can be matched using regular expression
patterns just like ASCII strings. The following example
demonstrates matching Chinese characters. In this example, the
variable length
stores the byte length, and in
this case there 10 3-byte characters, so the length is 30.
block
character(:), allocatable :: pattern, str
integer :: length
pattern = "夢.{3,7}胡蝶"
str = "昔者莊周夢爲胡蝶 栩栩然胡蝶也"
print *, pattern .in. str ! T
print *, regex(pattern, str, length) ! 夢爲胡蝶 栩栩然胡蝶
print *, length ! 30 (is 3-byte * 10 characters)
end block
Internal Implementation
In the implementation of regular expression engines, there are broadly two approaches: using Deterministic Finite Automaton (DFA) and constructing a virtual machine. In Forgex, I have adopted the former approach based on DFA.
Due to this choice, the following features will NOT be implemented:
- Backreference
- Recursive patterns
Additionally, at the current stage, an on-the-fly DFA
construction algorithm has not been implemented, thereby
failing to address the issue of state number explosion.
Consequently, patterns like [a-z]{20}b
fall into
the category of challenging cases.
Conclusion
While challenges such as addressing state explosion, optimization, and incorporating parallel processing remain, I have opted for the current release as the essential features are now in place.
With the release of this library, I hope for an improvement in the convenience of your Fortran string processing.
Acknowlegements
For the algorithm of the power set construction method and syntax analysis, I referred to Russ Cox’s article and Kondo Yoshiyuki’s book.
The implementation of the priority queue was based on the code written by ue1221.
The idea of applying the .in. operator to strings was inspired by kazulagi’s one.
References
- Russ Cox “Regular Expression Matching Can Be Simple And Fast”, 2007
- 近藤嘉雪 (Kondo Yoshiyuki), “定本 Cプログラマのためのアルゴリズムとデータ構造”, 1998, SB Creative.
- ue1221/fortran-utilities
- Haruka Tomobe (kazulagi), https://github.com/kazulagi, his article in Japanese