Dream of PURE
Regex
Author: Amasaki Shinobu (雨崎しのぶ)
Twitter: @amasaki203
Posted on: 2024-07-26 JST
Abstract
In the next major release of Fortran Regular Expression
library, I aim to provide pure
operators and a
pure
procedure.
Details
One of my recent personal projects is Forgex—Fortran Regular Expression. The latest version, version 2.0, is still under development. It has implemented Lazy DFA and released its documentation. In this version, the regular expression engine utilizes modern Fortran features, particularly modules, object-oriented programming, and pointers. Within the engine, regular expressions are parsed using module variables of derived-types, and matching is achieved by constructing an abstract syntax tree (AST), an equivalent non-deterministic finite automaton (NFA), and an equivalent deterministic finite automaton (DFA).
In the next version of the project, one of the goals is to support parallelization. There are two possible approaches to parallelize within the library (although I’m not sure if the first one is feasible):
- parallelizing the construction of automata,
- parallelizing the matching process.
The first approach involves the challenge of parallelizing the process of constructing an NFA from an AST and a DFA from an NFA. I’m not sure if this is even theoretically possible. If any readers have information on this, please let me know (feel free to contact me through GitHub issue page).
The second approach is simpler. The .in.
operator and regex
procedure shift the text string
to be matched from the beginning to the right by one character
at a time, trying to match, so in theory, this process can be
parallelized by the length of the string. However, this
approach cannot be applied to the .match.
operator.
Additionally, the potential for parallelization will be limited
when the optimization by literal search[1, 2], which is planned
for future implementation, is incorporated.
I initially considered these approaches, but after
reassessing the assumption that users would benefit from
parallelization, I realized that parallelization did not need
to be achieved within the library. The .in.
,
.match.
and regex
procedures provided
by the library depend only on two arguments—a pattern and text
string—and their results are determined without side effects.
In Fortran, procedures that have the properties of pure
functions can be written with the pure
attribute.
These can be used in do concurrent
constructs,
allowing the compiler to partially optimize the code block. In
other words, by implementing thread-safe procedures, users will
be able to handle parallelization themselves.
Therefore, the goal of Forgex version 3 is to rewrite the
code, giving all procedures the pure
attribute,
and to provide pure
.in.
and
.match.
operators, as well as
pure subroutine regex
(which are
function
in v2.0). To achieve this, the following
changes will be added:
- Remove pointers and module variables from the implementation of AST, NFA, and DFA.
- Implement the construction and processing of these using arrays of derived-types with index management.
- Allow reverse transitions in DFA for literal search optimization [1, 2] which I intend to implement in a later version.
Finally, almost all of the source code will be rewritten, but since Forgex v2.0 is only about 3000 lines, I don’t expect it to be too difficult. As of July 24, about 80% of the new AST construction has been written, so if all goes well, I should be able to release version 3 within this year.
Conclusion
Here’s what I’m planning for the next version of Forgex:
- Provide
pure
operators (.in.
and.match.
) and a subroutine (regex
). - Remove pointers and module variables, implement arrays of derived-types.
- Allow for future extensions to implement literal search optimization.
References
- Regex literals optimization - Nitely’s Thoughts, Esteban C. Borsani, 2020.
- Regex engine internals as a library - Andrew Gallant’s Blog, Andrew Gallant, 2023