Fortran Regular Expression: Forgex v2.0 Released
Author: Amasaki Shinobu (雨崎しのぶ)
Twitter: @amasaki203
Posted on: 2024-07-10 JST
Abstract
Forgex v2.0 with Lazy DFA is available.
Details
I have released version 2.0 of Forgex (short for Fortran Regular Expression), and it is now available on GitHub.
Starting with this version, I have adopted a lazy-evaluated deterministic finite automaton (Lazy DFA) approach for subset construction, which has reduced the consumption of time and memory resources compared to using previous statically compiled DFA.
As a minor fix, I have added the forgex_
prefix
to the names of all internal modules. This prevents module name
collisions when introducing Forgex into a new project. The API
module name forgex
and the APIs of
.in.
and .match.
operators, and
regex
function will remain unchanged.
Thank you.
To Do
Here are some of the things I am considering for the future of this project:
- improving documentation of the internal implementation for developers (top priority),
- optimizing with a literal search method,
- supporting escape sequences for Unicode,
- Parallelizing matching.
Memo
The following are just my own thoughts and notes, so I cannot guarantee their accuracy.
Documentation
I am trying to decide where to place and how to build the documentation for my Fortran project. Should I put it in the main repository or create a separate repository for the documentation? Should I use FORD or Sphinx? FORD is very easy to use as it automatically generates the reference documentation, but I am considering Sphinx because of the wide selection of visual themes and good typesetting.
Regex Literals Optimization
The regex literals optimization is a method which avoids executing the regex engine on parts of the input character string that cannot possibly match the regex pattern (cf. [5]). I need to think in advance about which part of the engine this feature should be implemented in. Since I want to avoid changing the syntax tree construction as much as possible, would it be better to implement this naively in a relatively shallow layer, in a module close to the users and API?
Parallelization
The thread parallelization of regex engines can be
discussed in two parts: DFA construction and matching.
Parallelizing DFA construction will undoubtedly be difficult.
Care must be taken to avoid race conditions. On the other hand,
parallelizing matching is more realistic, and is possible if I
prepare as many pointer variables as there are threads, with
the upper limit being the number of starting points for
matching. In other words, the .in.
operator and
the regex
function can be parallelized, but the
.match.
operator cannot.
The above discussion pertains to static DFA construction. However, in Lazy DFA, it will be necessary to balance both of these, that is, to avoid race conditions during matching and execute DFA construction in parallel. At this point, I don’t know whether this is possible or not, so I need to learn by reading computer science papers and studying other implementations of regex engines.
Nonetheless, even if I were to implement it to this extent, it is unclear whether I could expect any significant from parallelization.
Miscellaneous
- I would like to implement escape sequences for Unicode, adopting a notation that is widely used in other engines. But which one should I use?
- I would like to have more effective testing methods and a sufficient number of test cases.]