[1/2]
https://github.com/ashinn/irregex/issues/21
Replacements of positive lookbehinds only replace first match #21
A simplified test case:
$ guile --no-auto-compile -l irregex.scm
GNU Guile 2.2.3
[...]
scheme@(guile-user)> (irregex-replace/all "(?<=x)a" "xa xa xa" "-")
$1 = "x- xa xa"
scheme@(guile-user)> (irregex-replace/all "x(?=a)" "xa xa xa" "-")
$2 = "-a -a -a"
The look-behind processing is at:
https://github.com/ashinn/irregex/blob/353b8db8472f9b36a7b08ca21a4227d827750d93/irregex.scm#L3240
((look-behind neg-look-behind)
(let ((check
(lp (sre-sequence
(cons '(* any) (append (cdr sre) '(eos))))
n
flags
(lambda (cnk init src str i end matches fail) i))))
(lambda (cnk init src str i end matches fail)
(let* ((cnk* (wrap-end-chunker cnk src i))
(str* ((chunker-get-str cnk*) (car init)))
(i* (cdr init))
(end* ((chunker-get-end cnk*) (car init))))
(if ((if (eq? (car sre) 'look-behind) (lambda (x) x) not)
(check cnk* init (car init) str* i* end* matches
(lambda () #f)))
(next cnk init src str i end matches fail)
(fail))))))
The irregex strategy for matching a look-behind is to rescan the entire input stream from the start to the current position with the concatenation of an iterated any-matcher, the look-behind and an end-of-stream anchor. This will work but whether it's a good idea on the "text-buffer opened onto a 500MB file" from the documentation is another matter, especially with repeated matching like that of irregex-replace/all. But for reasonably sized inputs it's not a real problem, and it is also not the source of the bug. We'll add some verbosity to the end of the let* to see the locals:
(end* ((chunker-get-end cnk*) (car init)))
(unused (simple-format #t "look-behind ~A ~A ~A ~A ~A | ~A ~A ~A\n" init src str i end str* i* end*)))
The failing test:
scheme@(guile-user)> (irregex-replace/all "(?<=x)a" "xa xa xa" "-")
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 0 8) xa xa xa 0 8 | xa xa xa 0 0
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 0 8) xa xa xa 1 8 | xa xa xa 0 1
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 2 8) xa xa xa 2 8 | xa xa xa 0 8
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 2 8) xa xa xa 3 8 | xa xa xa 0 8
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 2 8) xa xa xa 4 8 | xa xa xa 0 8
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 2 8) xa xa xa 5 8 | xa xa xa 0 8
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 2 8) xa xa xa 6 8 | xa xa xa 0 8
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 2 8) xa xa xa 7 8 | xa xa xa 0 8
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 2 8) xa xa xa 8 8 | xa xa xa 0 8
$1 = "x- xa xa"
As soon as the first "a" was matched and 'src' switched from (xa xa xa 0 8) to (xa xa xa 2 8), 'end*' stopped following the current index 'i' and jumped straight to 8. The problem is this jump in 'end*'. This also means that we can make a much stranger failing case:
scheme@(guile-user)> (irregex-replace/all "(?<=x)a" "xa aaa x" "-")
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 0 8) xa aaa x 0 8 | xa aaa x 0 0
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 0 8) xa aaa x 1 8 | xa aaa x 0 1
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 2 8) xa aaa x 2 8 | xa aaa x 0 8
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 2 8) xa aaa x 3 8 | xa aaa x 0 8
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 4 8) xa aaa x 4 8 | xa aaa x 0 8
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 5 8) xa aaa x 5 8 | xa aaa x 0 8
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 6 8) xa aaa x 6 8 | xa aaa x 0 8
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 6 8) xa aaa x 7 8 | xa aaa x 0 8
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 6 8) xa aaa x 8 8 | xa aaa x 0 8
$3 = "x- --- x"
This happens because after the first replacement the end-of-stream anchor at 'end*' moves the look-behind check to the end of the string. This 'end*' is computed by chunker-get-end which is a vector-ref into a vector of lambdas computed by wrap-end-chunker:
https://github.com/ashinn/irregex/blob/353b8db8472f9b36a7b08ca21a4227d827750d93/irregex.scm#L330
(define (wrap-end-chunker cnk src i)
(make-irregex-chunker
(lambda (x) (and (not (eq? x src)) ((chunker-get-next cnk) x)))
(chunker-get-str cnk)
(chunker-get-start cnk)
(lambda (x) (if (eq? x src) i ((chunker-get-end cnk) x)))
(chunker-get-substring cnk)
(chunker-get-subchunk cnk)))
This is a wrapper around the input stream that aims to stop at the current position 'i' of the current chunk 'src'. The chunk documentation is at:
http://synthcode.com/scheme/irregex/#SECTION_3.4
This is to be matched by the regex with the iterated any-matcher, the look-behind and an end-of-stream anchor. The wrapper returns #f for a next chunk request after the current one, and overrides the current chunk's end position to 'i'. The current chunk is tested for by eq?, as specified in the documentation.
There are two important constraints on the <get-next> procedure. It must return an eq? identical object when called multiple times on the same chunk, and it must not return a chunk with an empty string (start == end).
The lambda that returns the wrong 'end*' is the fourth item. A steady return of 'i' would be needed for the current chunk, so the eq? fails when it shouldn't. Plain string inputs only ever have one chunk, constructed in irregex-basic-string-chunker, so this eq? should not fail.
https://github.com/ashinn/irregex/blob/353b8db8472f9b36a7b08ca21a4227d827750d93/irregex.scm#L1861
(define irregex-basic-string-chunker
(make-irregex-chunker (lambda (x) #f)
car
cadr
caddr
(lambda (src1 i src2 j)
(substring (car src1) i j))))
The 'x' and 'src' in the eq? of wrap-end-chunker are the (car init) and 'src' from the look-behind section. The debug print shows them diverging when 'end*' stops behaving. The 'init' and 'src' values are passed by an external driver to the lambda produced by the look-behind section. In the case of irregex-replace/all this driver is irregex-fold/fast.
https://github.com/ashinn/irregex/blob/353b8db8472f9b36a7b08ca21a4227d827750d93/irregex.scm#L3780
[2/2]
(define (irregex-fold/fast irx kons knil str . o)
(if (not (string? str)) (error "irregex-fold: not a string" str))
(let* ((irx (irregex irx))
(matches (irregex-new-matches irx))
(finish (or (and (pair? o) (car o)) (lambda (i acc) acc)))
(start (if (and (pair? o) (pair? (cdr o))) (cadr o) 0))
(end (if (and (pair? o) (pair? (cdr o)) (pair? (cddr o)))
(caddr o)
(string-length str)))
(init-src (list str start end))
(init (cons init-src start)))
(if (not (and (integer? start) (exact? start)))
(error "irregex-fold: not an exact integer" start))
(if (not (and (integer? end) (exact? end)))
(error "irregex-fold: not an exact integer" end))
(irregex-match-chunker-set! matches irregex-basic-string-chunker)
(let lp ((src init-src) (i start) (acc knil))
(if (>= i end)
(finish i acc)
(let ((m (irregex-search/matches
irx
irregex-basic-string-chunker
init
src
i
matches)))
(if (not m)
(finish i acc)
(let ((j (%irregex-match-end-index m 0))
(acc (kons i m acc)))
(irregex-reset-matches! matches)
(cond
((flag-set? (irregex-flags irx) ~consumer?)
(finish j acc))
((= j i)
;; skip one char forward if we match the empty string
(lp (list str (+ j 1) end) (+ j 1) acc))
(else
(lp (list str j end) j acc))))))))))
The 'init' value is constructed once and never modified, but 'src' is modified by both loop calls to 'lp' by advancing the start position. The current position is already advanced as 'i', so advancing the current chunk's start position in sync looks suspicious. The modified chunk will no longer count as the initial chunk because they are compared with eq?, and plain strings have a single chunk. This tampering with the current chunk's start position is from:
Fixing folds on conditional begin patterns which aren't treated as searchers
ashinn committed Nov 28, 2012
https://github.com/ashinn/irregex/commit/2949a461474e0ac30d8c72f0dc81127b19c04d0d
By contrast irregex-fold/chunked/fast doesn't tamper with the current chunk and communicates the advancing of the current position within a chunk solely through the index 'i'.
https://github.com/ashinn/irregex/blob/353b8db8472f9b36a7b08ca21a4227d827750d93/irregex.scm#L3825
(define (irregex-fold/chunked/fast irx kons knil cnk start . o)
(let* ((irx (irregex irx))
(matches (irregex-new-matches irx))
(finish (or (and (pair? o) (car o)) (lambda (src i acc) acc)))
(i (if (and (pair? o) (pair? (cdr o)))
(cadr o)
((chunker-get-start cnk) start)))
(init (cons start i)))
(if (not (integer? i)) (error "irregex-fold/chunked: not an integer" i))
(irregex-match-chunker-set! matches cnk)
(let lp ((start start) (i i) (acc knil))
(if (not start)
(finish start i acc)
(let ((m (irregex-search/matches irx cnk init start i matches)))
(if (not m)
(finish start i acc)
(let ((end-src (%irregex-match-end-chunk m 0))
(end-index (%irregex-match-end-index m 0)))
(if (and (eq? end-src start) (= end-index i))
(if (>= end-index ((chunker-get-end cnk) end-src ))
(let ((next ((chunker-get-next cnk) end-src)))
(lp next ((chunker-get-start cnk) next) acc))
(lp end-src (+ end-index 1) acc))
(let ((acc (kons start i m acc)))
(irregex-reset-matches! matches)
(if (flag-set? (irregex-flags irx) ~consumer?)
(finish end-src end-index acc)
(lp end-src end-index acc)))))))))))
Applying this non-tampering approach to irregex-fold/fast yields the diff:
[post size limit dropped/2]
$ TZ=GMT diff -u irregex.scm irregex2.scm
--- irregex.scm 2020-07-12 09:41:49.754022271 +0000
+++ irregex2.scm 2020-07-12 09:45:05.067970996 +0000
@@ -3813,9 +3813,11 @@
(finish j acc))
((= j i)
;; skip one char forward if we match the empty string
- (lp (list str (+ j 1) end) (+ j 1) acc))
+ ; (lp (list str (+ j 1) end) (+ j 1) acc))
+ (lp src (+ j 1) acc))
(else
- (lp (list str j end) j acc))))))))))
+ ; (lp (list str j end) j acc))))))))))
+ (lp src j acc))))))))))
(define (irregex-fold irx kons . args)
(if (not (procedure? kons)) (error "irregex-fold: not a procedure" kons))
This fixes the failing tests because the eq? of wrap-end-chunker now works properly which makes 'end*' in look-behind follow along with 'i', moving the look-behind check to the right place.
scheme@(guile-user)> (irregex-replace/all "(?<=x)a" "xa xa xa" "-")
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 0 8) xa xa xa 0 8 | xa xa xa 0 0
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 0 8) xa xa xa 1 8 | xa xa xa 0 1
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 0 8) xa xa xa 2 8 | xa xa xa 0 2
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 0 8) xa xa xa 3 8 | xa xa xa 0 3
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 0 8) xa xa xa 4 8 | xa xa xa 0 4
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 0 8) xa xa xa 5 8 | xa xa xa 0 5
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 0 8) xa xa xa 6 8 | xa xa xa 0 6
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 0 8) xa xa xa 7 8 | xa xa xa 0 7
$1 = "x- x- x-"
scheme@(guile-user)> (irregex-replace/all "(?<=x)a" "xa aaa x" "-")
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 0 8) xa aaa x 0 8 | xa aaa x 0 0
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 0 8) xa aaa x 1 8 | xa aaa x 0 1
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 0 8) xa aaa x 2 8 | xa aaa x 0 2
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 0 8) xa aaa x 3 8 | xa aaa x 0 3
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 0 8) xa aaa x 4 8 | xa aaa x 0 4
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 0 8) xa aaa x 5 8 | xa aaa x 0 5
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 0 8) xa aaa x 6 8 | xa aaa x 0 6
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 0 8) xa aaa x 7 8 | xa aaa x 0 7
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 0 8) xa aaa x 8 8 | xa aaa x 0 8
$2 = "x- aaa x"
scheme@(guile-user)> (irregex-replace/all "(?<=[xyz])a" "xa ya za" "-")
look-behind ((xa ya za 0 8) . 0) (xa ya za 0 8) xa ya za 0 8 | xa ya za 0 0
look-behind ((xa ya za 0 8) . 0) (xa ya za 0 8) xa ya za 1 8 | xa ya za 0 1
look-behind ((xa ya za 0 8) . 0) (xa ya za 0 8) xa ya za 2 8 | xa ya za 0 2
look-behind ((xa ya za 0 8) . 0) (xa ya za 0 8) xa ya za 3 8 | xa ya za 0 3
look-behind ((xa ya za 0 8) . 0) (xa ya za 0 8) xa ya za 4 8 | xa ya za 0 4
look-behind ((xa ya za 0 8) . 0) (xa ya za 0 8) xa ya za 5 8 | xa ya za 0 5
look-behind ((xa ya za 0 8) . 0) (xa ya za 0 8) xa ya za 6 8 | xa ya za 0 6
look-behind ((xa ya za 0 8) . 0) (xa ya za 0 8) xa ya za 7 8 | xa ya za 0 7
$3 = "x- y- z-"
scheme@(guile-user)>
However, I am not familiar with all of irregex, only the small bits and pieces I've looked at, so it's entirely possible that this breaks something else. I do not know whether or not the chunk tampering behavior by irregex-fold/fast is relied on in some other part of the code. The shinnoid would have to chime in on that. Someone familiar with chicken scheme might also rerun the test suite with this change.
https://lists.nongnu.org/archive/html/chicken-users/2020-07/msg00005.html
https://github.com/ashinn/irregex/commit/ac27338c5b490d19624c30d787c78bbfa45e1f11
fix irregex-replace/all with look-behind patterns
ashinn committed Jul 13, 2020
@@ -332,7 +333,16 @@
(lambda (x) (and (not (eq? x src)) ((chunker-get-next cnk) x)))
(chunker-get-str cnk)
(chunker-get-start cnk)
- (lambda (x) (if (eq? x src) i ((chunker-get-end cnk) x)))
+ (lambda (x)
+ ;; TODO: this is a hack workaround for the fact that we don't
+ ;; have either a notion of chunk equivalence or chunk truncation,
+ ;; until which time (neg-)look-behind in a fold won't work on
+ ;; non-basic chunks.
+ (if (or (eq? x src)
+ (and (not ((chunker-get-next cnk) x))
+ (not ((chunker-get-next cnk) src))))
+ i
+ ((chunker-get-end cnk) x)))
(chunker-get-substring cnk)
(chunker-get-subchunk cnk)))
Chunks are now equated in the fourth item of wrap-end-chunker if they are eq? or they are both lacking a next chunk. Plain strings have a single chunk via irregex-basic-string-chunker so the second part of the 'or' will always be true for plain strings. Irregex-fold/fast was not changed.
Here is a related bug. We might want to add 'start' to the start of every line, and we might have a file with some number of empty lines.
$ for k in $(seq 1 6); do echo ""; done > emptylines.txt
$ hd emptylines.txt
00000000 0a 0a 0a 0a 0a 0a |......|
00000006
We might use sed:
$ sed -e 's/^/start/' emptylines.txt
start
start
start
start
start
start
$
We might use regular expressions in python:
$ python3
[...]
>>> s = "\n" * 6
>>> s
'\n\n\n\n\n\n'
>>> import re
>>> re.sub ("(?m)^", "start", s)
'start\nstart\nstart\nstart\nstart\nstart\nstart'
>>> print (_)
start
start
start
start
start
start
start
>>>
Sed doesn't have a match at the very end of the input because it uses the text file convention for a final newline. We might also use irregex:
$ guile --no-auto-compile -l irregex.scm
GNU Guile 2.2.3
[...]
scheme@(guile-user)> (irregex-replace/all 'bol "\n\n\n\n\n\n" "start")
$1 = "start\n\nstart\nstart\nstart\nstart\nstart"
scheme@(guile-user)> (display $1)
start
start
start
start
start
startscheme@(guile-user)>
There is no match at the start of the second line, when there clearly should be. We'll add the usual verbosity to see what happens:
$ TZ=GMT diff -u irregex.scm irregex2.scm
--- irregex.scm 2020-07-13 20:23:49.195645124 +0000
+++ irregex2.scm 2020-07-13 20:15:27.347747147 +0000
@@ -3419,11 +3419,14 @@
(fail))))
((bol)
(lambda (cnk init src str i end matches fail)
+ (simple-format #t "bol ~S ~A" src i)
(if (or (and (eq? src (car init)) (eqv? i (cdr init)))
(and (> i ((chunker-get-start cnk) src))
(eqv? #\newline (string-ref str (- i 1)))))
- (next cnk init src str i end matches fail)
- (fail))))
+ (begin (display " -> yes\n")
+ (next cnk init src str i end matches fail))
+ (begin (display " -> no\n")
+ (fail)))))
((bow)
(lambda (cnk init src str i end matches fail)
(if (and (if (> i ((chunker-get-start cnk) src))
Rerunning the test:
$ guile --no-auto-compile -l irregex2.scm
GNU Guile 2.2.3
[...]
scheme@(guile-user)> (irregex-replace/all 'bol "\n\n\n\n\n\n" "start")
bol ("\n\n\n\n\n\n" 0 6) 0 -> yes
bol ("\n\n\n\n\n\n" 1 6) 1 -> no
bol ("\n\n\n\n\n\n" 1 6) 2 -> yes
bol ("\n\n\n\n\n\n" 2 6) 2 -> no
bol ("\n\n\n\n\n\n" 2 6) 3 -> yes
bol ("\n\n\n\n\n\n" 3 6) 3 -> no
bol ("\n\n\n\n\n\n" 3 6) 4 -> yes
bol ("\n\n\n\n\n\n" 4 6) 4 -> no
bol ("\n\n\n\n\n\n" 4 6) 5 -> yes
bol ("\n\n\n\n\n\n" 5 6) 5 -> no
bol ("\n\n\n\n\n\n" 5 6) 6 -> yes
$1 = "start\n\nstart\nstart\nstart\nstart\nstart"
scheme@(guile-user)>
There is no match at index 1, and every location from 2 to 5 is retested after success. Both of these bugs are in irregex-fold/fast which completely mishandles empty matches.
Here is the current irregex 0.9.8 behavior on searching for an empty regex, as a building block for later.
$ guile --no-auto-compile -l irregex.scm
GNU Guile 2.2.3
[...]
scheme@(guile-user)> (irregex-replace/all "" "aaa" "*")
$1 = "*a*a*a"
scheme@(guile-user)> (irregex-replace/all "" "\n\n\n" "*")
$2 = "*\n*\n*\n"
scheme@(guile-user)>
There is no match at the very end of the input. This may simply be the behavior desired by the author. If it is not desired, the loop end condition of irregex-fold/fast can be made strict.
https://github.com/ashinn/irregex/blob/ac27338c5b490d19624c30d787c78bbfa45e1f11/irregex.scm#L3807
[1/2]
The second line skip bug >>138 is caused by the chunk tampering behavior of irregex-fold/fast >>135. This seems to have been introduced to compensate for the erroneous empty match detection, so a test suite rerun on >>136 would fail, as these have to be fixed simultaneously. The (bol) branch >>138 can only succeed after the start of the input if it can look at the previous character and check that it is a #\newline. But after
bol ("\n\n\n\n\n\n" 0 6) 0 -> yes
the chunk start position is advanced to 1, and the (bol) attempt at 1 doesn't have the first character available for #\newline checking.
bol ("\n\n\n\n\n\n" 1 6) 1 -> no
The only way it can succeed at 1 is if the chunk start position stays at 0, so this bug is directly caused by the chunk tampering behavior. Even if some other parts of the code rely on it, there is something to fix about it since now it is the source of a bug.
The retest after success behavior is caused by the erroneous empty match detection. This can be demonstrated as a separate bug:
scheme@(guile-user)> (irregex-replace/all "(?=a)" "---a---" "*")
$1 = "---**a---"
The same empty match is replaced twice. We add the usual verbosity:
$ TZ=GMT diff -u irregex.scm irregex2.scm
--- irregex.scm 2020-07-13 20:23:49.195645124 +0000
+++ irregex2.scm 2020-07-14 02:49:04.508860562 +0000
@@ -3234,9 +3225,12 @@
flags
(lambda (cnk init src str i end matches fail) i))))
(lambda (cnk init src str i end matches fail)
+ (simple-format #t "look-ahead ~S ~A" src i)
(if (check cnk init src str i end matches (lambda () #f))
- (next cnk init src str i end matches fail)
- (fail)))))
+ (begin (display " -> yes\n")
+ (next cnk init src str i end matches fail))
+ (begin (display " -> no\n")
+ (fail))))))
((neg-look-ahead)
(let ((check
(lp (sre-sequence (cdr sre))
and we get:
scheme@(guile-user)> (irregex-replace/all "(?=a)" "---a---" "*")
look-ahead ("---a---" 0 7) 0 -> no
look-ahead ("---a---" 0 7) 1 -> no
look-ahead ("---a---" 0 7) 2 -> no
look-ahead ("---a---" 0 7) 3 -> yes
look-ahead ("---a---" 3 7) 3 -> yes
look-ahead ("---a---" 4 7) 4 -> no
look-ahead ("---a---" 4 7) 5 -> no
look-ahead ("---a---" 4 7) 6 -> no
look-ahead ("---a---" 4 7) 7 -> no
$1 = "---**a---"
After the first empty match at 3, the current position is only advanced to 3, not to 4 as would happen on the empty match branch, so it is matched again. The empty match test of irregex-fold/fast >>135 is bogus.
https://github.com/ashinn/irregex/blob/ac27338c5b490d19624c30d787c78bbfa45e1f11/irregex.scm#L3824
It checks that the match end position 'j' equals the start of the search interval 'i'. It can therefore only catch empty matches at the start of the search interval. But empty matches can occur at any later position. To detect them the match end position 'j' must be tested against the match start position. To fix both empty matches and chunk tampering:
$ TZ=GMT diff -u irregex.scm irregex2.scm
--- irregex.scm 2020-07-13 20:23:49.195645124 +0000
+++ irregex2.scm 2020-07-14 10:32:08.239441307 +0000
@@ -3816,16 +3813,20 @@
(if (not m)
(finish i acc)
(let ((j (%irregex-match-end-index m 0))
+ (jstart (%irregex-match-start-index m 0))
(acc (kons i m acc)))
(irregex-reset-matches! matches)
(cond
((flag-set? (irregex-flags irx) ~consumer?)
(finish j acc))
- ((= j i)
+ ; ((= j i)
+ ((= j jstart)
;; skip one char forward if we match the empty string
- (lp (list str (+ j 1) end) (+ j 1) acc))
+ ; (lp (list str (+ j 1) end) (+ j 1) acc))
+ (lp src (+ j 1) acc))
(else
- (lp (list str j end) j acc))))))))))
+ ; (lp (list str j end) j acc))))))))))
+ (lp src j acc))))))))))
(define (irregex-fold irx kons . args)
(if (not (procedure? kons)) (error "irregex-fold: not a procedure" kons))
[2/2]
A rerun of the second line skip test:
scheme@(guile-user)> (irregex-replace/all 'bol "\n\n\n\n\n\n" "start")
bol ("\n\n\n\n\n\n" 0 6) 0 -> yes
bol ("\n\n\n\n\n\n" 0 6) 1 -> yes
bol ("\n\n\n\n\n\n" 0 6) 2 -> yes
bol ("\n\n\n\n\n\n" 0 6) 3 -> yes
bol ("\n\n\n\n\n\n" 0 6) 4 -> yes
bol ("\n\n\n\n\n\n" 0 6) 5 -> yes
$2 = "start\nstart\nstart\nstart\nstart\nstart\n"
There is no match at the very end of the input for the same reason as >>139. Previously there was because irregex-search/backtrack allows looking for matches past the last character, which is used by some branches of sre->procedure as a signal to try to acquire more input.
https://github.com/ashinn/irregex/blob/ac27338c5b490d19624c30d787c78bbfa45e1f11/irregex.scm#L1955
A rerun of the double match test:
scheme@(guile-user)> (irregex-replace/all "(?=a)" "---a---" "*")
look-ahead ("---a---" 0 7) 0 -> no
look-ahead ("---a---" 0 7) 1 -> no
look-ahead ("---a---" 0 7) 2 -> no
look-ahead ("---a---" 0 7) 3 -> yes
look-ahead ("---a---" 0 7) 4 -> no
look-ahead ("---a---" 0 7) 5 -> no
look-ahead ("---a---" 0 7) 6 -> no
look-ahead ("---a---" 0 7) 7 -> no
$1 = "---*---"
The match positions are correct but the output lost the 'a'. This happens because the kons lambda passed by irregex-replace/all to irregex-fold/fast is also incompetent at detecting empty matches.
https://github.com/ashinn/irregex/blob/ac27338c5b490d19624c30d787c78bbfa45e1f11/irregex.scm#L3880
(define (irregex-replace/all irx str . o)
(if (not (string? str)) (error "irregex-replace/all: not a string" str))
(irregex-fold/fast
irx
(lambda (i m acc)
(let* ((m-start (%irregex-match-start-index m 0))
(res (if (>= i m-start)
(append (irregex-apply-match m o) acc)
(append (irregex-apply-match m o)
(cons (substring str i m-start) acc)))))
;; include the skipped char on empty matches
(if (= i (%irregex-match-end-index m 0))
(cons (substring str i (+ i 1)) res)
res)))
'()
str
(lambda (i acc)
(let ((end (string-length str)))
(string-cat-reverse (if (>= i end)
acc
(cons (substring str i end) acc)))))))
Once again the match end is compared to the search start instead of the match start. The fix:
$ TZ=GMT diff -u irregex.scm irregex2.scm
--- irregex.scm 2020-07-13 20:23:49.195645124 +0000
+++ irregex2.scm 2020-07-14 11:52:27.010199230 +0000
@@ -3888,8 +3889,9 @@
(append (irregex-apply-match m o)
(cons (substring str i m-start) acc)))))
;; include the skipped char on empty matches
- (if (= i (%irregex-match-end-index m 0))
- (cons (substring str i (+ i 1)) res)
+ (if (and (= m-start (%irregex-match-end-index m 0))
+ (< m-start (string-length str)))
+ (cons (substring str m-start (+ m-start 1)) res)
res)))
'()
str
A rerun of the double match test:
scheme@(guile-user)> (irregex-replace/all "(?=a)" "---a---" "*")
look-ahead ("---a---" 0 7) 0 -> no
look-ahead ("---a---" 0 7) 1 -> no
look-ahead ("---a---" 0 7) 2 -> no
look-ahead ("---a---" 0 7) 3 -> yes
look-ahead ("---a---" 0 7) 4 -> no
look-ahead ("---a---" 0 7) 5 -> no
look-ahead ("---a---" 0 7) 6 -> no
look-ahead ("---a---" 0 7) 7 -> no
$1 = "---*a---"
I expect that the other kons lambdas will have the same bug. These patches allow wrap-end-chunker to work without the hack from >>137. Also, irregex-fold/chunked/fast >>135 has the bugs of irregex-fold/fast and a few more. This entire irregex thing seems to have a worrying amount of undiscovered bugs.
There are three users of irregex-fold/fast in irregex.scm: irregex-replace/all, irregex-extract and irregex-split. There is also a public wrapper: irregex-fold. Irregex-replace/all has the double match bug >>140 which has been dealt with. Irregex-extract also has a double match bug:
https://github.com/ashinn/irregex/blob/ac27338c5b490d19624c30d787c78bbfa45e1f11/irregex.scm#L3936
$ guile --no-auto-compile -l irregex.scm
GNU Guile 2.2.3
[...]
scheme@(guile-user)> (irregex-extract "(?=a)" "---a---")
$1 = ("" "")
but its lambdas are so simple that after the irregex-fold/fast fix >>140 it starts working:
$ guile --no-auto-compile -l irregex2.scm
GNU Guile 2.2.3
[...]
scheme@(guile-user)> (irregex-extract "(?=a)" "---a---")
$1 = ("")
The truly strange one is irregex-split.
https://github.com/ashinn/irregex/blob/ac27338c5b490d19624c30d787c78bbfa45e1f11/irregex.scm#L3946
(define (irregex-split irx str . o)
(if (not (string? str)) (error "irregex-split: not a string" str))
(let ((start (if (pair? o) (car o) 0))
(end (if (and (pair? o) (pair? (cdr o))) (cadr o) (string-length str))))
(irregex-fold/fast
irx
(lambda (i m a)
(cond
((= i (%irregex-match-end-index m 0))
;; empty match, include the skipped char to rejoin in finish
(cons (string-ref str i) a))
((= i (%irregex-match-start-index m 0))
a)
(else
(cons (substring str i (%irregex-match-start-index m 0)) a))))
'()
str
(lambda (i a)
(let lp ((ls (if (= i end) a (cons (substring str i end) a)))
(res '())
(was-char? #f))
(cond
((null? ls) res)
((char? (car ls))
(lp (cdr ls)
(if (or was-char? (null? res))
(cons (string (car ls)) res)
(cons (string-append (string (car ls)) (car res))
(cdr res)))
#t))
(else (lp (cdr ls) (cons (car ls) res) #f)))))
start
end)))
It relies on the bug that it will see offset empty matches twice. On the first sighting it adds the prefix, and on the second one it adds the skipped character. The point of the chunk tampering behavior of irregex-fold/fast seems to be to use the chunk starting position as an undeclared part of the kons accumulator for irregex-split. However, with all of its cases and intricate postprocessing, irregex-split is still wrong:
$ guile --no-auto-compile -l irregex.scm
GNU Guile 2.2.3
[...]
scheme@(guile-user)> (irregex-split "-" "-a-")
$8 = ("a")
scheme@(guile-user)> (irregex-split "a" "---a")
$9 = ("---")
scheme@(guile-user)> (irregex-split "a" "---aa")
$10 = ("---")
scheme@(guile-user)> (irregex-split "a" "aaa")
$11 = ()
scheme@(guile-user)> (irregex-split 'any "aaa")
$13 = ()
Irregex-split needs a careful fix, but one that relies on the corrected irregex-fold/fast.
Is this some important library that many projects depend on?
It is the default regex library in some scheme implementations. I cannot speak to its importance and number of dependents. I only started looking at it because it is used in schemebbs and we hit upon a bug in the quotelink regex in the previous thread.