[1/2]
The second line skip bug >>138 is caused by the chunk tampering behavior of irregex-fold/fast >>135. This seems to have been introduced to compensate for the erroneous empty match detection, so a test suite rerun on >>136 would fail, as these have to be fixed simultaneously. The (bol) branch >>138 can only succeed after the start of the input if it can look at the previous character and check that it is a #\newline. But after
bol ("\n\n\n\n\n\n" 0 6) 0 -> yes
the chunk start position is advanced to 1, and the (bol) attempt at 1 doesn't have the first character available for #\newline checking.
bol ("\n\n\n\n\n\n" 1 6) 1 -> no
The only way it can succeed at 1 is if the chunk start position stays at 0, so this bug is directly caused by the chunk tampering behavior. Even if some other parts of the code rely on it, there is something to fix about it since now it is the source of a bug.
The retest after success behavior is caused by the erroneous empty match detection. This can be demonstrated as a separate bug:
scheme@(guile-user)> (irregex-replace/all "(?=a)" "---a---" "*")
$1 = "---**a---"
The same empty match is replaced twice. We add the usual verbosity:
$ TZ=GMT diff -u irregex.scm irregex2.scm
--- irregex.scm 2020-07-13 20:23:49.195645124 +0000
+++ irregex2.scm 2020-07-14 02:49:04.508860562 +0000
@@ -3234,9 +3225,12 @@
flags
(lambda (cnk init src str i end matches fail) i))))
(lambda (cnk init src str i end matches fail)
+ (simple-format #t "look-ahead ~S ~A" src i)
(if (check cnk init src str i end matches (lambda () #f))
- (next cnk init src str i end matches fail)
- (fail)))))
+ (begin (display " -> yes\n")
+ (next cnk init src str i end matches fail))
+ (begin (display " -> no\n")
+ (fail))))))
((neg-look-ahead)
(let ((check
(lp (sre-sequence (cdr sre))
and we get:
scheme@(guile-user)> (irregex-replace/all "(?=a)" "---a---" "*")
look-ahead ("---a---" 0 7) 0 -> no
look-ahead ("---a---" 0 7) 1 -> no
look-ahead ("---a---" 0 7) 2 -> no
look-ahead ("---a---" 0 7) 3 -> yes
look-ahead ("---a---" 3 7) 3 -> yes
look-ahead ("---a---" 4 7) 4 -> no
look-ahead ("---a---" 4 7) 5 -> no
look-ahead ("---a---" 4 7) 6 -> no
look-ahead ("---a---" 4 7) 7 -> no
$1 = "---**a---"
After the first empty match at 3, the current position is only advanced to 3, not to 4 as would happen on the empty match branch, so it is matched again. The empty match test of irregex-fold/fast >>135 is bogus.
https://github.com/ashinn/irregex/blob/ac27338c5b490d19624c30d787c78bbfa45e1f11/irregex.scm#L3824
It checks that the match end position 'j' equals the start of the search interval 'i'. It can therefore only catch empty matches at the start of the search interval. But empty matches can occur at any later position. To detect them the match end position 'j' must be tested against the match start position. To fix both empty matches and chunk tampering:
$ TZ=GMT diff -u irregex.scm irregex2.scm
--- irregex.scm 2020-07-13 20:23:49.195645124 +0000
+++ irregex2.scm 2020-07-14 10:32:08.239441307 +0000
@@ -3816,16 +3813,20 @@
(if (not m)
(finish i acc)
(let ((j (%irregex-match-end-index m 0))
+ (jstart (%irregex-match-start-index m 0))
(acc (kons i m acc)))
(irregex-reset-matches! matches)
(cond
((flag-set? (irregex-flags irx) ~consumer?)
(finish j acc))
- ((= j i)
+ ; ((= j i)
+ ((= j jstart)
;; skip one char forward if we match the empty string
- (lp (list str (+ j 1) end) (+ j 1) acc))
+ ; (lp (list str (+ j 1) end) (+ j 1) acc))
+ (lp src (+ j 1) acc))
(else
- (lp (list str j end) j acc))))))))))
+ ; (lp (list str j end) j acc))))))))))
+ (lp src j acc))))))))))
(define (irregex-fold irx kons . args)
(if (not (procedure? kons)) (error "irregex-fold: not a procedure" kons))