[1/2]
https://github.com/ashinn/irregex/issues/21
Replacements of positive lookbehinds only replace first match #21
A simplified test case:
$ guile --no-auto-compile -l irregex.scm
GNU Guile 2.2.3
[...]
scheme@(guile-user)> (irregex-replace/all "(?<=x)a" "xa xa xa" "-")
$1 = "x- xa xa"
scheme@(guile-user)> (irregex-replace/all "x(?=a)" "xa xa xa" "-")
$2 = "-a -a -a"
The look-behind processing is at:
https://github.com/ashinn/irregex/blob/353b8db8472f9b36a7b08ca21a4227d827750d93/irregex.scm#L3240
((look-behind neg-look-behind)
(let ((check
(lp (sre-sequence
(cons '(* any) (append (cdr sre) '(eos))))
n
flags
(lambda (cnk init src str i end matches fail) i))))
(lambda (cnk init src str i end matches fail)
(let* ((cnk* (wrap-end-chunker cnk src i))
(str* ((chunker-get-str cnk*) (car init)))
(i* (cdr init))
(end* ((chunker-get-end cnk*) (car init))))
(if ((if (eq? (car sre) 'look-behind) (lambda (x) x) not)
(check cnk* init (car init) str* i* end* matches
(lambda () #f)))
(next cnk init src str i end matches fail)
(fail))))))
The irregex strategy for matching a look-behind is to rescan the entire input stream from the start to the current position with the concatenation of an iterated any-matcher, the look-behind and an end-of-stream anchor. This will work but whether it's a good idea on the "text-buffer opened onto a 500MB file" from the documentation is another matter, especially with repeated matching like that of irregex-replace/all. But for reasonably sized inputs it's not a real problem, and it is also not the source of the bug. We'll add some verbosity to the end of the let* to see the locals:
(end* ((chunker-get-end cnk*) (car init)))
(unused (simple-format #t "look-behind ~A ~A ~A ~A ~A | ~A ~A ~A\n" init src str i end str* i* end*)))
The failing test:
scheme@(guile-user)> (irregex-replace/all "(?<=x)a" "xa xa xa" "-")
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 0 8) xa xa xa 0 8 | xa xa xa 0 0
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 0 8) xa xa xa 1 8 | xa xa xa 0 1
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 2 8) xa xa xa 2 8 | xa xa xa 0 8
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 2 8) xa xa xa 3 8 | xa xa xa 0 8
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 2 8) xa xa xa 4 8 | xa xa xa 0 8
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 2 8) xa xa xa 5 8 | xa xa xa 0 8
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 2 8) xa xa xa 6 8 | xa xa xa 0 8
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 2 8) xa xa xa 7 8 | xa xa xa 0 8
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 2 8) xa xa xa 8 8 | xa xa xa 0 8
$1 = "x- xa xa"
As soon as the first "a" was matched and 'src' switched from (xa xa xa 0 8) to (xa xa xa 2 8), 'end*' stopped following the current index 'i' and jumped straight to 8. The problem is this jump in 'end*'. This also means that we can make a much stranger failing case:
scheme@(guile-user)> (irregex-replace/all "(?<=x)a" "xa aaa x" "-")
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 0 8) xa aaa x 0 8 | xa aaa x 0 0
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 0 8) xa aaa x 1 8 | xa aaa x 0 1
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 2 8) xa aaa x 2 8 | xa aaa x 0 8
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 2 8) xa aaa x 3 8 | xa aaa x 0 8
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 4 8) xa aaa x 4 8 | xa aaa x 0 8
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 5 8) xa aaa x 5 8 | xa aaa x 0 8
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 6 8) xa aaa x 6 8 | xa aaa x 0 8
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 6 8) xa aaa x 7 8 | xa aaa x 0 8
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 6 8) xa aaa x 8 8 | xa aaa x 0 8
$3 = "x- --- x"
This happens because after the first replacement the end-of-stream anchor at 'end*' moves the look-behind check to the end of the string. This 'end*' is computed by chunker-get-end which is a vector-ref into a vector of lambdas computed by wrap-end-chunker:
https://github.com/ashinn/irregex/blob/353b8db8472f9b36a7b08ca21a4227d827750d93/irregex.scm#L330
(define (wrap-end-chunker cnk src i)
(make-irregex-chunker
(lambda (x) (and (not (eq? x src)) ((chunker-get-next cnk) x)))
(chunker-get-str cnk)
(chunker-get-start cnk)
(lambda (x) (if (eq? x src) i ((chunker-get-end cnk) x)))
(chunker-get-substring cnk)
(chunker-get-subchunk cnk)))
This is a wrapper around the input stream that aims to stop at the current position 'i' of the current chunk 'src'. The chunk documentation is at:
http://synthcode.com/scheme/irregex/#SECTION_3.4
This is to be matched by the regex with the iterated any-matcher, the look-behind and an end-of-stream anchor. The wrapper returns #f for a next chunk request after the current one, and overrides the current chunk's end position to 'i'. The current chunk is tested for by eq?, as specified in the documentation.
There are two important constraints on the <get-next> procedure. It must return an eq? identical object when called multiple times on the same chunk, and it must not return a chunk with an empty string (start == end).
The lambda that returns the wrong 'end*' is the fourth item. A steady return of 'i' would be needed for the current chunk, so the eq? fails when it shouldn't. Plain string inputs only ever have one chunk, constructed in irregex-basic-string-chunker, so this eq? should not fail.
https://github.com/ashinn/irregex/blob/353b8db8472f9b36a7b08ca21a4227d827750d93/irregex.scm#L1861
(define irregex-basic-string-chunker
(make-irregex-chunker (lambda (x) #f)
car
cadr
caddr
(lambda (src1 i src2 j)
(substring (car src1) i j))))
The 'x' and 'src' in the eq? of wrap-end-chunker are the (car init) and 'src' from the look-behind section. The debug print shows them diverging when 'end*' stops behaving. The 'init' and 'src' values are passed by an external driver to the lambda produced by the look-behind section. In the case of irregex-replace/all this driver is irregex-fold/fast.
https://github.com/ashinn/irregex/blob/353b8db8472f9b36a7b08ca21a4227d827750d93/irregex.scm#L3780
And the post size limit dropped. 7.902 bytes got the 413.
[2/2]
(define (irregex-fold/fast irx kons knil str . o)
(if (not (string? str)) (error "irregex-fold: not a string" str))
(let* ((irx (irregex irx))
(matches (irregex-new-matches irx))
(finish (or (and (pair? o) (car o)) (lambda (i acc) acc)))
(start (if (and (pair? o) (pair? (cdr o))) (cadr o) 0))
(end (if (and (pair? o) (pair? (cdr o)) (pair? (cddr o)))
(caddr o)
(string-length str)))
(init-src (list str start end))
(init (cons init-src start)))
(if (not (and (integer? start) (exact? start)))
(error "irregex-fold: not an exact integer" start))
(if (not (and (integer? end) (exact? end)))
(error "irregex-fold: not an exact integer" end))
(irregex-match-chunker-set! matches irregex-basic-string-chunker)
(let lp ((src init-src) (i start) (acc knil))
(if (>= i end)
(finish i acc)
(let ((m (irregex-search/matches
irx
irregex-basic-string-chunker
init
src
i
matches)))
(if (not m)
(finish i acc)
(let ((j (%irregex-match-end-index m 0))
(acc (kons i m acc)))
(irregex-reset-matches! matches)
(cond
((flag-set? (irregex-flags irx) ~consumer?)
(finish j acc))
((= j i)
;; skip one char forward if we match the empty string
(lp (list str (+ j 1) end) (+ j 1) acc))
(else
(lp (list str j end) j acc))))))))))
The 'init' value is constructed once and never modified, but 'src' is modified by both loop calls to 'lp' by advancing the start position. The current position is already advanced as 'i', so advancing the current chunk's start position in sync looks suspicious. The modified chunk will no longer count as the initial chunk because they are compared with eq?, and plain strings have a single chunk. This tampering with the current chunk's start position is from:
Fixing folds on conditional begin patterns which aren't treated as searchers
ashinn committed Nov 28, 2012
https://github.com/ashinn/irregex/commit/2949a461474e0ac30d8c72f0dc81127b19c04d0d
By contrast irregex-fold/chunked/fast doesn't tamper with the current chunk and communicates the advancing of the current position within a chunk solely through the index 'i'.
https://github.com/ashinn/irregex/blob/353b8db8472f9b36a7b08ca21a4227d827750d93/irregex.scm#L3825
(define (irregex-fold/chunked/fast irx kons knil cnk start . o)
(let* ((irx (irregex irx))
(matches (irregex-new-matches irx))
(finish (or (and (pair? o) (car o)) (lambda (src i acc) acc)))
(i (if (and (pair? o) (pair? (cdr o)))
(cadr o)
((chunker-get-start cnk) start)))
(init (cons start i)))
(if (not (integer? i)) (error "irregex-fold/chunked: not an integer" i))
(irregex-match-chunker-set! matches cnk)
(let lp ((start start) (i i) (acc knil))
(if (not start)
(finish start i acc)
(let ((m (irregex-search/matches irx cnk init start i matches)))
(if (not m)
(finish start i acc)
(let ((end-src (%irregex-match-end-chunk m 0))
(end-index (%irregex-match-end-index m 0)))
(if (and (eq? end-src start) (= end-index i))
(if (>= end-index ((chunker-get-end cnk) end-src ))
(let ((next ((chunker-get-next cnk) end-src)))
(lp next ((chunker-get-start cnk) next) acc))
(lp end-src (+ end-index 1) acc))
(let ((acc (kons start i m acc)))
(irregex-reset-matches! matches)
(if (flag-set? (irregex-flags irx) ~consumer?)
(finish end-src end-index acc)
(lp end-src end-index acc)))))))))))
Applying this non-tampering approach to irregex-fold/fast yields the diff:
[post size limit dropped/2]
$ TZ=GMT diff -u irregex.scm irregex2.scm
--- irregex.scm 2020-07-12 09:41:49.754022271 +0000
+++ irregex2.scm 2020-07-12 09:45:05.067970996 +0000
@@ -3813,9 +3813,11 @@
(finish j acc))
((= j i)
;; skip one char forward if we match the empty string
- (lp (list str (+ j 1) end) (+ j 1) acc))
+ ; (lp (list str (+ j 1) end) (+ j 1) acc))
+ (lp src (+ j 1) acc))
(else
- (lp (list str j end) j acc))))))))))
+ ; (lp (list str j end) j acc))))))))))
+ (lp src j acc))))))))))
(define (irregex-fold irx kons . args)
(if (not (procedure? kons)) (error "irregex-fold: not a procedure" kons))
This fixes the failing tests because the eq? of wrap-end-chunker now works properly which makes 'end*' in look-behind follow along with 'i', moving the look-behind check to the right place.
scheme@(guile-user)> (irregex-replace/all "(?<=x)a" "xa xa xa" "-")
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 0 8) xa xa xa 0 8 | xa xa xa 0 0
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 0 8) xa xa xa 1 8 | xa xa xa 0 1
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 0 8) xa xa xa 2 8 | xa xa xa 0 2
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 0 8) xa xa xa 3 8 | xa xa xa 0 3
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 0 8) xa xa xa 4 8 | xa xa xa 0 4
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 0 8) xa xa xa 5 8 | xa xa xa 0 5
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 0 8) xa xa xa 6 8 | xa xa xa 0 6
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 0 8) xa xa xa 7 8 | xa xa xa 0 7
$1 = "x- x- x-"
scheme@(guile-user)> (irregex-replace/all "(?<=x)a" "xa aaa x" "-")
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 0 8) xa aaa x 0 8 | xa aaa x 0 0
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 0 8) xa aaa x 1 8 | xa aaa x 0 1
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 0 8) xa aaa x 2 8 | xa aaa x 0 2
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 0 8) xa aaa x 3 8 | xa aaa x 0 3
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 0 8) xa aaa x 4 8 | xa aaa x 0 4
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 0 8) xa aaa x 5 8 | xa aaa x 0 5
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 0 8) xa aaa x 6 8 | xa aaa x 0 6
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 0 8) xa aaa x 7 8 | xa aaa x 0 7
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 0 8) xa aaa x 8 8 | xa aaa x 0 8
$2 = "x- aaa x"
scheme@(guile-user)> (irregex-replace/all "(?<=[xyz])a" "xa ya za" "-")
look-behind ((xa ya za 0 8) . 0) (xa ya za 0 8) xa ya za 0 8 | xa ya za 0 0
look-behind ((xa ya za 0 8) . 0) (xa ya za 0 8) xa ya za 1 8 | xa ya za 0 1
look-behind ((xa ya za 0 8) . 0) (xa ya za 0 8) xa ya za 2 8 | xa ya za 0 2
look-behind ((xa ya za 0 8) . 0) (xa ya za 0 8) xa ya za 3 8 | xa ya za 0 3
look-behind ((xa ya za 0 8) . 0) (xa ya za 0 8) xa ya za 4 8 | xa ya za 0 4
look-behind ((xa ya za 0 8) . 0) (xa ya za 0 8) xa ya za 5 8 | xa ya za 0 5
look-behind ((xa ya za 0 8) . 0) (xa ya za 0 8) xa ya za 6 8 | xa ya za 0 6
look-behind ((xa ya za 0 8) . 0) (xa ya za 0 8) xa ya za 7 8 | xa ya za 0 7
$3 = "x- y- z-"
scheme@(guile-user)>
However, I am not familiar with all of irregex, only the small bits and pieces I've looked at, so it's entirely possible that this breaks something else. I do not know whether or not the chunk tampering behavior by irregex-fold/fast is relied on in some other part of the code. The shinnoid would have to chime in on that. Someone familiar with chicken scheme might also rerun the test suite with this change.
https://lists.nongnu.org/archive/html/chicken-users/2020-07/msg00005.html
https://github.com/ashinn/irregex/commit/ac27338c5b490d19624c30d787c78bbfa45e1f11
fix irregex-replace/all with look-behind patterns
ashinn committed Jul 13, 2020
@@ -332,7 +333,16 @@
(lambda (x) (and (not (eq? x src)) ((chunker-get-next cnk) x)))
(chunker-get-str cnk)
(chunker-get-start cnk)
- (lambda (x) (if (eq? x src) i ((chunker-get-end cnk) x)))
+ (lambda (x)
+ ;; TODO: this is a hack workaround for the fact that we don't
+ ;; have either a notion of chunk equivalence or chunk truncation,
+ ;; until which time (neg-)look-behind in a fold won't work on
+ ;; non-basic chunks.
+ (if (or (eq? x src)
+ (and (not ((chunker-get-next cnk) x))
+ (not ((chunker-get-next cnk) src))))
+ i
+ ((chunker-get-end cnk) x)))
(chunker-get-substring cnk)
(chunker-get-subchunk cnk)))
Chunks are now equated in the fourth item of wrap-end-chunker if they are eq? or they are both lacking a next chunk. Plain strings have a single chunk via irregex-basic-string-chunker so the second part of the 'or' will always be true for plain strings. Irregex-fold/fast was not changed.
Here is a related bug. We might want to add 'start' to the start of every line, and we might have a file with some number of empty lines.
$ for k in $(seq 1 6); do echo ""; done > emptylines.txt
$ hd emptylines.txt
00000000 0a 0a 0a 0a 0a 0a |......|
00000006
We might use sed:
$ sed -e 's/^/start/' emptylines.txt
start
start
start
start
start
start
$
We might use regular expressions in python:
$ python3
[...]
>>> s = "\n" * 6
>>> s
'\n\n\n\n\n\n'
>>> import re
>>> re.sub ("(?m)^", "start", s)
'start\nstart\nstart\nstart\nstart\nstart\nstart'
>>> print (_)
start
start
start
start
start
start
start
>>>
Sed doesn't have a match at the very end of the input because it uses the text file convention for a final newline. We might also use irregex:
$ guile --no-auto-compile -l irregex.scm
GNU Guile 2.2.3
[...]
scheme@(guile-user)> (irregex-replace/all 'bol "\n\n\n\n\n\n" "start")
$1 = "start\n\nstart\nstart\nstart\nstart\nstart"
scheme@(guile-user)> (display $1)
start
start
start
start
start
startscheme@(guile-user)>
There is no match at the start of the second line, when there clearly should be. We'll add the usual verbosity to see what happens:
$ TZ=GMT diff -u irregex.scm irregex2.scm
--- irregex.scm 2020-07-13 20:23:49.195645124 +0000
+++ irregex2.scm 2020-07-13 20:15:27.347747147 +0000
@@ -3419,11 +3419,14 @@
(fail))))
((bol)
(lambda (cnk init src str i end matches fail)
+ (simple-format #t "bol ~S ~A" src i)
(if (or (and (eq? src (car init)) (eqv? i (cdr init)))
(and (> i ((chunker-get-start cnk) src))
(eqv? #\newline (string-ref str (- i 1)))))
- (next cnk init src str i end matches fail)
- (fail))))
+ (begin (display " -> yes\n")
+ (next cnk init src str i end matches fail))
+ (begin (display " -> no\n")
+ (fail)))))
((bow)
(lambda (cnk init src str i end matches fail)
(if (and (if (> i ((chunker-get-start cnk) src))
Rerunning the test:
$ guile --no-auto-compile -l irregex2.scm
GNU Guile 2.2.3
[...]
scheme@(guile-user)> (irregex-replace/all 'bol "\n\n\n\n\n\n" "start")
bol ("\n\n\n\n\n\n" 0 6) 0 -> yes
bol ("\n\n\n\n\n\n" 1 6) 1 -> no
bol ("\n\n\n\n\n\n" 1 6) 2 -> yes
bol ("\n\n\n\n\n\n" 2 6) 2 -> no
bol ("\n\n\n\n\n\n" 2 6) 3 -> yes
bol ("\n\n\n\n\n\n" 3 6) 3 -> no
bol ("\n\n\n\n\n\n" 3 6) 4 -> yes
bol ("\n\n\n\n\n\n" 4 6) 4 -> no
bol ("\n\n\n\n\n\n" 4 6) 5 -> yes
bol ("\n\n\n\n\n\n" 5 6) 5 -> no
bol ("\n\n\n\n\n\n" 5 6) 6 -> yes
$1 = "start\n\nstart\nstart\nstart\nstart\nstart"
scheme@(guile-user)>
There is no match at index 1, and every location from 2 to 5 is retested after success. Both of these bugs are in irregex-fold/fast which completely mishandles empty matches.
Here is the current irregex 0.9.8 behavior on searching for an empty regex, as a building block for later.
$ guile --no-auto-compile -l irregex.scm
GNU Guile 2.2.3
[...]
scheme@(guile-user)> (irregex-replace/all "" "aaa" "*")
$1 = "*a*a*a"
scheme@(guile-user)> (irregex-replace/all "" "\n\n\n" "*")
$2 = "*\n*\n*\n"
scheme@(guile-user)>
There is no match at the very end of the input. This may simply be the behavior desired by the author. If it is not desired, the loop end condition of irregex-fold/fast can be made strict.
https://github.com/ashinn/irregex/blob/ac27338c5b490d19624c30d787c78bbfa45e1f11/irregex.scm#L3807
[1/2]
The second line skip bug >>138 is caused by the chunk tampering behavior of irregex-fold/fast >>135. This seems to have been introduced to compensate for the erroneous empty match detection, so a test suite rerun on >>136 would fail, as these have to be fixed simultaneously. The (bol) branch >>138 can only succeed after the start of the input if it can look at the previous character and check that it is a #\newline. But after
bol ("\n\n\n\n\n\n" 0 6) 0 -> yes
the chunk start position is advanced to 1, and the (bol) attempt at 1 doesn't have the first character available for #\newline checking.
bol ("\n\n\n\n\n\n" 1 6) 1 -> no
The only way it can succeed at 1 is if the chunk start position stays at 0, so this bug is directly caused by the chunk tampering behavior. Even if some other parts of the code rely on it, there is something to fix about it since now it is the source of a bug.
The retest after success behavior is caused by the erroneous empty match detection. This can be demonstrated as a separate bug:
scheme@(guile-user)> (irregex-replace/all "(?=a)" "---a---" "*")
$1 = "---**a---"
The same empty match is replaced twice. We add the usual verbosity:
$ TZ=GMT diff -u irregex.scm irregex2.scm
--- irregex.scm 2020-07-13 20:23:49.195645124 +0000
+++ irregex2.scm 2020-07-14 02:49:04.508860562 +0000
@@ -3234,9 +3225,12 @@
flags
(lambda (cnk init src str i end matches fail) i))))
(lambda (cnk init src str i end matches fail)
+ (simple-format #t "look-ahead ~S ~A" src i)
(if (check cnk init src str i end matches (lambda () #f))
- (next cnk init src str i end matches fail)
- (fail)))))
+ (begin (display " -> yes\n")
+ (next cnk init src str i end matches fail))
+ (begin (display " -> no\n")
+ (fail))))))
((neg-look-ahead)
(let ((check
(lp (sre-sequence (cdr sre))
and we get:
scheme@(guile-user)> (irregex-replace/all "(?=a)" "---a---" "*")
look-ahead ("---a---" 0 7) 0 -> no
look-ahead ("---a---" 0 7) 1 -> no
look-ahead ("---a---" 0 7) 2 -> no
look-ahead ("---a---" 0 7) 3 -> yes
look-ahead ("---a---" 3 7) 3 -> yes
look-ahead ("---a---" 4 7) 4 -> no
look-ahead ("---a---" 4 7) 5 -> no
look-ahead ("---a---" 4 7) 6 -> no
look-ahead ("---a---" 4 7) 7 -> no
$1 = "---**a---"
After the first empty match at 3, the current position is only advanced to 3, not to 4 as would happen on the empty match branch, so it is matched again. The empty match test of irregex-fold/fast >>135 is bogus.
https://github.com/ashinn/irregex/blob/ac27338c5b490d19624c30d787c78bbfa45e1f11/irregex.scm#L3824
It checks that the match end position 'j' equals the start of the search interval 'i'. It can therefore only catch empty matches at the start of the search interval. But empty matches can occur at any later position. To detect them the match end position 'j' must be tested against the match start position. To fix both empty matches and chunk tampering:
$ TZ=GMT diff -u irregex.scm irregex2.scm
--- irregex.scm 2020-07-13 20:23:49.195645124 +0000
+++ irregex2.scm 2020-07-14 10:32:08.239441307 +0000
@@ -3816,16 +3813,20 @@
(if (not m)
(finish i acc)
(let ((j (%irregex-match-end-index m 0))
+ (jstart (%irregex-match-start-index m 0))
(acc (kons i m acc)))
(irregex-reset-matches! matches)
(cond
((flag-set? (irregex-flags irx) ~consumer?)
(finish j acc))
- ((= j i)
+ ; ((= j i)
+ ((= j jstart)
;; skip one char forward if we match the empty string
- (lp (list str (+ j 1) end) (+ j 1) acc))
+ ; (lp (list str (+ j 1) end) (+ j 1) acc))
+ (lp src (+ j 1) acc))
(else
- (lp (list str j end) j acc))))))))))
+ ; (lp (list str j end) j acc))))))))))
+ (lp src j acc))))))))))
(define (irregex-fold irx kons . args)
(if (not (procedure? kons)) (error "irregex-fold: not a procedure" kons))
[2/2]
A rerun of the second line skip test:
scheme@(guile-user)> (irregex-replace/all 'bol "\n\n\n\n\n\n" "start")
bol ("\n\n\n\n\n\n" 0 6) 0 -> yes
bol ("\n\n\n\n\n\n" 0 6) 1 -> yes
bol ("\n\n\n\n\n\n" 0 6) 2 -> yes
bol ("\n\n\n\n\n\n" 0 6) 3 -> yes
bol ("\n\n\n\n\n\n" 0 6) 4 -> yes
bol ("\n\n\n\n\n\n" 0 6) 5 -> yes
$2 = "start\nstart\nstart\nstart\nstart\nstart\n"
There is no match at the very end of the input for the same reason as >>139. Previously there was because irregex-search/backtrack allows looking for matches past the last character, which is used by some branches of sre->procedure as a signal to try to acquire more input.
https://github.com/ashinn/irregex/blob/ac27338c5b490d19624c30d787c78bbfa45e1f11/irregex.scm#L1955
A rerun of the double match test:
scheme@(guile-user)> (irregex-replace/all "(?=a)" "---a---" "*")
look-ahead ("---a---" 0 7) 0 -> no
look-ahead ("---a---" 0 7) 1 -> no
look-ahead ("---a---" 0 7) 2 -> no
look-ahead ("---a---" 0 7) 3 -> yes
look-ahead ("---a---" 0 7) 4 -> no
look-ahead ("---a---" 0 7) 5 -> no
look-ahead ("---a---" 0 7) 6 -> no
look-ahead ("---a---" 0 7) 7 -> no
$1 = "---*---"
The match positions are correct but the output lost the 'a'. This happens because the kons lambda passed by irregex-replace/all to irregex-fold/fast is also incompetent at detecting empty matches.
https://github.com/ashinn/irregex/blob/ac27338c5b490d19624c30d787c78bbfa45e1f11/irregex.scm#L3880
(define (irregex-replace/all irx str . o)
(if (not (string? str)) (error "irregex-replace/all: not a string" str))
(irregex-fold/fast
irx
(lambda (i m acc)
(let* ((m-start (%irregex-match-start-index m 0))
(res (if (>= i m-start)
(append (irregex-apply-match m o) acc)
(append (irregex-apply-match m o)
(cons (substring str i m-start) acc)))))
;; include the skipped char on empty matches
(if (= i (%irregex-match-end-index m 0))
(cons (substring str i (+ i 1)) res)
res)))
'()
str
(lambda (i acc)
(let ((end (string-length str)))
(string-cat-reverse (if (>= i end)
acc
(cons (substring str i end) acc)))))))
Once again the match end is compared to the search start instead of the match start. The fix:
$ TZ=GMT diff -u irregex.scm irregex2.scm
--- irregex.scm 2020-07-13 20:23:49.195645124 +0000
+++ irregex2.scm 2020-07-14 11:52:27.010199230 +0000
@@ -3888,8 +3889,9 @@
(append (irregex-apply-match m o)
(cons (substring str i m-start) acc)))))
;; include the skipped char on empty matches
- (if (= i (%irregex-match-end-index m 0))
- (cons (substring str i (+ i 1)) res)
+ (if (and (= m-start (%irregex-match-end-index m 0))
+ (< m-start (string-length str)))
+ (cons (substring str m-start (+ m-start 1)) res)
res)))
'()
str
A rerun of the double match test:
scheme@(guile-user)> (irregex-replace/all "(?=a)" "---a---" "*")
look-ahead ("---a---" 0 7) 0 -> no
look-ahead ("---a---" 0 7) 1 -> no
look-ahead ("---a---" 0 7) 2 -> no
look-ahead ("---a---" 0 7) 3 -> yes
look-ahead ("---a---" 0 7) 4 -> no
look-ahead ("---a---" 0 7) 5 -> no
look-ahead ("---a---" 0 7) 6 -> no
look-ahead ("---a---" 0 7) 7 -> no
$1 = "---*a---"
I expect that the other kons lambdas will have the same bug. These patches allow wrap-end-chunker to work without the hack from >>137. Also, irregex-fold/chunked/fast >>135 has the bugs of irregex-fold/fast and a few more. This entire irregex thing seems to have a worrying amount of undiscovered bugs.
This entire irregex thing seems to have a worrying amount of undiscovered bugs.
Is this some important library that many projects depend on?
There are three users of irregex-fold/fast in irregex.scm: irregex-replace/all, irregex-extract and irregex-split. There is also a public wrapper: irregex-fold. Irregex-replace/all has the double match bug >>140 which has been dealt with. Irregex-extract also has a double match bug:
https://github.com/ashinn/irregex/blob/ac27338c5b490d19624c30d787c78bbfa45e1f11/irregex.scm#L3936
$ guile --no-auto-compile -l irregex.scm
GNU Guile 2.2.3
[...]
scheme@(guile-user)> (irregex-extract "(?=a)" "---a---")
$1 = ("" "")
but its lambdas are so simple that after the irregex-fold/fast fix >>140 it starts working:
$ guile --no-auto-compile -l irregex2.scm
GNU Guile 2.2.3
[...]
scheme@(guile-user)> (irregex-extract "(?=a)" "---a---")
$1 = ("")
The truly strange one is irregex-split.
https://github.com/ashinn/irregex/blob/ac27338c5b490d19624c30d787c78bbfa45e1f11/irregex.scm#L3946
(define (irregex-split irx str . o)
(if (not (string? str)) (error "irregex-split: not a string" str))
(let ((start (if (pair? o) (car o) 0))
(end (if (and (pair? o) (pair? (cdr o))) (cadr o) (string-length str))))
(irregex-fold/fast
irx
(lambda (i m a)
(cond
((= i (%irregex-match-end-index m 0))
;; empty match, include the skipped char to rejoin in finish
(cons (string-ref str i) a))
((= i (%irregex-match-start-index m 0))
a)
(else
(cons (substring str i (%irregex-match-start-index m 0)) a))))
'()
str
(lambda (i a)
(let lp ((ls (if (= i end) a (cons (substring str i end) a)))
(res '())
(was-char? #f))
(cond
((null? ls) res)
((char? (car ls))
(lp (cdr ls)
(if (or was-char? (null? res))
(cons (string (car ls)) res)
(cons (string-append (string (car ls)) (car res))
(cdr res)))
#t))
(else (lp (cdr ls) (cons (car ls) res) #f)))))
start
end)))
It relies on the bug that it will see offset empty matches twice. On the first sighting it adds the prefix, and on the second one it adds the skipped character. The point of the chunk tampering behavior of irregex-fold/fast seems to be to use the chunk starting position as an undeclared part of the kons accumulator for irregex-split. However, with all of its cases and intricate postprocessing, irregex-split is still wrong:
$ guile --no-auto-compile -l irregex.scm
GNU Guile 2.2.3
[...]
scheme@(guile-user)> (irregex-split "-" "-a-")
$8 = ("a")
scheme@(guile-user)> (irregex-split "a" "---a")
$9 = ("---")
scheme@(guile-user)> (irregex-split "a" "---aa")
$10 = ("---")
scheme@(guile-user)> (irregex-split "a" "aaa")
$11 = ()
scheme@(guile-user)> (irregex-split 'any "aaa")
$13 = ()
Irregex-split needs a careful fix, but one that relies on the corrected irregex-fold/fast.
Is this some important library that many projects depend on?
It is the default regex library in some scheme implementations. I cannot speak to its importance and number of dependents. I only started looking at it because it is used in schemebbs and we hit upon a bug in the quotelink regex in the previous thread.
>>133,135-141,143
This is not your personal journal. Could you please stop spamming or summarize your point? The thread is on what you're doing, not an invitation to write down your every thought.
>>144
I am already following your advice. This is indeed the summary, far from containing every thought. It has the failing test case, the explanation and the fix. "Your every thought" would evidently also contain failed investigation paths that did not lead to the fix, but those have been omitted here. If you have some specific examples of "your every thought" that could be reasonably removed without detracting from the failing test case, the explanation or the fix, do not hesitate to point them out so that I can improve my future posts. I am fully aware that it is unusual to see programming posted on a programming board, but if you wish to avoid such content the majority of the threads on this board are available to you, since they lack it. You may also find technology boards rather than programming boards to be more to your taste, should you be inclined to give them a try.
>>145
Well I don't know who you're intending to bore here, but every thread you participate in dies, rightfully. Nobody is interested, nobody is interacting with you, nobody cares about your pedantic, pompous, pretentious essays. If you care so much about irregex, send a message to the maintainer, and you absolutely need an audience use github or whatever. Just stop for one moment and reconsider if a half-dead textboard really needs a fedora-tipping nerd annoying everyone with a pointless monologue.
And of course other programming boards don't have nonsense like this, because as the good old saying goes: This is not your personal blog!
nobody cares about your pedantic, pompous, pretentious essays.
You seem to care at least enough to make some fairly rude remarks. Why not leave the good man to his business instead of freaking out because someone is actually using this half-dead textboard?
>>147
I care that he posts, not what he posts. And I have been ignoring it, every day I visit this site hoping to meet new scheme/lisp enthusiasts, ready to talk about and discuss ideas (you know... a like a dialogue), but instead this guy
is all I see. So it his mindless ramblings are tolerated, I just ask for my rant to be given a pass.
>>146
But this is the personal blog thread.
>>149
But is it his personal blog?
>>149
Even if it was, he should 1. post VIP posts 2. not do exactly the same thing in over half the threads on the frontpage.
Irregex-split has a strange result on the 'bol test >>140 as well, due to the fact that irregex-fold/fast cannot match a 'bol at position 1.
$ guile --no-auto-compile -l irregex.scm
GNU Guile 2.2.3
[...]
scheme@(guile-user)> (irregex-split 'bol "\n\n\n\n\n\n")
$1 = ("\n\n" "\n" "\n" "\n" "\n")
So the chunk tampering behavior makes irregex-split >>143 misbehave as well.
every thread you participate in dies, rightfully. Nobody is interested, nobody is interacting with you
You would have to know which posts are mine to make this assertion. Either you do not possess this information, in which case you are not in a position to make the assertion and it is therefore vacuous, or you do possess this information, in which case you already know the assertion to be false while you are making it. A subset of my posts can be found in the "How can I run my own instance of this" thread, the only thread that maxed out so far, the current thread prior to the irregex session, and the "LISP Puzzles" thread. They received the average amount of interaction for a programming board, except in "LISP Puzzles" where it was a bit higher. Those three threads happen to be the top three threads by post count on this board, so your statement is contradicted by verifiable reality. Your opinion is also not shared by https://textboard.org/prog/39#t39p106 who referred to the first irregex bughunt, which was written in a style that was a lot less terse than the one used for the current bughunt. It also seems a bit strange that you are offering the interaction you claim is missing, even if you are doing it in an off-topic way unrelated to programming. I find it hard to believe that you really cannot perceive the irony of going out of your way to repeatedly raise the post count of a thread I post in while claiming it to be dead.
send a message to the maintainer
use github
Since you claim to know which posts are mine, you already know my stance on accounts, verification and enabling remote code execution, so these suggestions make no sense. See also: https://textboard.org/prog/140#t140p7
your pedantic, pompous, pretentious essays
[...]
Just stop for one moment and reconsider if a half-dead textboard really needs a fedora-tipping nerd annoying everyone with a pointless monologue.
From https://textboard.org/prog/130#t130p29
"""I do not know how long it will take you, but at some point in the future you will realize that insults like these do not communicate any information about the person being insulted, they only communicate a different kind of information about the insult employer's state of mind."""
And of course other programming boards don't have nonsense like this
You can help out by linking what you consider to be the best programming content posted reasonably recently on an anonymous programming board, so I can use it to improve my future posts. You can also easily avoid seeing my posts during this period by simply not opening "SchemeBBS [part 2]", "LISP Puzzles" and the current thread. If the irregex posts specifically elicit this reaction from you, I can help you with that as well. Here is a script that replaces the body of any post containing "irregex" with a warning.
Array.from (document.querySelectorAll ("dl > dd")).filter (e => /irregex/i.test (e.innerHTML)).forEach (e => { e.innerHTML = "<p>Warning: this post may contain trace amounts of irregex.</p>"; })
I hope this helps you have a more pleasant experience during your stay on textboard.org. The script is idempotent so feel free to run it repeatedly for additional peace of mind. Alternatively it may simply be the case that you are looking for a technology board rather than a programming board.
exactly the same thing
A moment of reflection will allow you to easily perceive the difference between a bug in the schemebbs source code that can abort the running instance, and the fix for that bug, which belong in "SchemeBBS [part 2]", and a class of bugs in one of the libraries schemebbs depends on but which do not hit schemebbs directly, and therefore belong in a second thread, such as "What are you working on?".
over half the threads on the frontpage
Exactly 2/10, soon to be 1/10. I'm sure you understand that there is a point past which a hyperbole will have the exact opposite of its intended effect.
I'm not sure where this need comes from, to post a stream of trivially refutable falsehoods. Perhaps it is the accepted standard of interaction on some other boards. Here, however, the main currency is on-topic programming content. Those who have outlandish reactions to the irregex posts can instead show up the irregex poster simply by posting better written and more focused on-topic programming content. They are more than welcome to do so, and I will enjoy reading it.
>>152
All I'm saying is fix your problem, submit the patch. That's the normal way of doing it. I don't care if you're afraid of github of whatever. Nobody needs your journals, just get to the point.
Here is the irregex-split fix consistent with the irregex-fold/fast fix >>140 and the irregex-replace/all fix >>141. The accumulator is a vector because that is ashinn's preferred structure for records throughout irregex.scm. Where an edge case could be reasonably decided either way, the existng 0.9.8 behavior on searching for an empty regex >>139 was used as guidance.
(define (irregex-split irx str . o)
(if (not (string? str)) (error "irregex-split: not a string" str))
(let ((start (if (pair? o) (car o) 0))
(end (if (and (pair? o) (pair? (cdr o))) (cadr o) (string-length str))))
(irregex-fold/fast
irx
(lambda (i m a)
(let* ((msi (%irregex-match-start-index m 0))
(mei (%irregex-match-end-index m 0))
(empty (= msi mei))
(lst (vector-ref a 0))
(pos (vector-ref a 1)))
(cond ((not empty)
(vector-set! a 0 (cons (substring str pos msi) lst))
(vector-set! a 1 mei)
(vector-set! a 2 #f))
((< pos msi)
(vector-set! a 0 (cons (substring str pos msi) lst))
(vector-set! a 1 msi)
(vector-set! a 2 #t))
((> pos start)
(vector-set! a 0 (cons "" lst))
(vector-set! a 2 #t))
(else
(vector-set! a 2 #t)))
a))
(let ((acc (make-vector 3)))
(vector-set! acc 0 '())
(vector-set! acc 1 start)
(vector-set! acc 2 #t)
acc)
str
(lambda (i a)
(let* ((lst (vector-ref a 0))
(pos (vector-ref a 1))
(empty (vector-ref a 2)))
(reverse (cond
((< pos end)
(cons (substring str pos end) lst))
(empty lst)
(else (cons "" lst))))))
start
end)))
The key is to decouple the match search position used internally by irregex-fold/fast from the split position used internally by irregex-split. The failing tests from >>143 are corrected:
$ guile --no-auto-compile -l irregex2.scm
GNU Guile 2.2.3
[...]
scheme@(guile-user)> (irregex-split "-" "-a-")
$1 = ("" "a" "")
scheme@(guile-user)> (irregex-split "a" "---a")
$2 = ("---" "")
scheme@(guile-user)> (irregex-split "a" "---aa")
$3 = ("---" "" "")
scheme@(guile-user)> (irregex-split "a" "aaa")
$4 = ("" "" "" "")
scheme@(guile-user)> (irregex-split 'any "aaa")
$5 = ("" "" "" "")
The 'bol test >>152 is also corrected:
scheme@(guile-user)> (irregex-split 'bol "\n\n\n\n\n\n")
$6 = ("\n" "\n" "\n" "\n" "\n" "\n")
And some more edge cases:
scheme@(guile-user)> (irregex-split "" "")
$7 = ()
scheme@(guile-user)> (irregex-split "" "aaa")
$8 = ("a" "a" "a")
scheme@(guile-user)> (irregex-split 'bos "aaa")
$9 = ("aaa")
scheme@(guile-user)> (irregex-split 'eos "aaa")
$10 = ("aaa")
scheme@(guile-user)> (irregex-split "(?<=a)" "aaa")
$11 = ("a" "a" "a")
scheme@(guile-user)> (irregex-split "(?=a)" "aaa")
$12 = ("a" "a" "a")
The three fixes can be applied as a group, and the 0.9.8 hack to wrap-end-chunker >>137 can be reverted.
just get to the point
Already answered in >>145.
journals
1. Already answered in >>145.
2. You may not be familiar with the opening post of the thread you choose to visit of your own free will:
This is the type of thread that you can write in every time you visit the site! You can do anything from use this as a personal log for your project, to sharing what you're hacking on at the moment, or to critique something some else did in the thread. I'm looking forward to hearing what you guys are up to!
>>154
You're hopeless, and appear not to even realize it. But who cares anymore? Why even bother to make you reconsider your apparent autism? Hope at least you have fun with your thing, because if you then nobody else will.
You're hopeless, and appear not to even realize it. But who cares anymore? Why even bother to make you reconsider your apparent autism?
Answered in >>152 and https://textboard.org/prog/130#t130p29
"""I do not know how long it will take you, but at some point in the future you will realize that insults like these do not communicate any information about the person being insulted, they only communicate a different kind of information about the insult employer's state of mind."""
Hope at least you have fun with your thing, because if you then nobody else will.
Answeed in >>152:
1. "the top three threads by post count"
2. "Your opinion is also not shared by https://textboard.org/prog/39#t39p106 who [...]"
3. """I find it hard to believe that you really cannot perceive the irony of going out of your way to repeatedly raise the post count of a thread I post in while claiming it to be dead."""
Another irregex aborting bug will follow shortly, I just need to reduce the size of the test case.
Meant for >>155, of course.
As pointed out at the end of >>141 irregex-fold/chunked/fast >>135 has the bugs of irregex-fold/fast and a few more.
- Because of its placement kons doesn't see every match.
- Because of its placement irregex-reset-matches! is not called on every round.
- Because of its placement the ~consumer? test can miss rounds.
- The proper order on the match branch is: kons -> reset-matches! -> ~consumer? -> empty -> else, as used in irregex-fold/fast >>135.
- The empty match test is bogus. It compares the match end index with the search start index instead of the match start index.
- The 'next' chunk is not checked for existence, so the start index of a non-existent chunk may be requested from the chunker. Depending on the chunker implementation this may abort.
The last point can be demonstrated on the rope chunker from the documentation, which is not part of the library proper but is present in the test suite.
http://synthcode.com/scheme/irregex/#SECTION_3.4
https://github.com/ashinn/irregex/blob/ac27338c5b490d19624c30d787c78bbfa45e1f11/test-irregex.scm#L122
$ cat attack.scm
(define (rope . args)
(map (lambda (x) (if (pair? x) x (list x 0 (string-length x)))) args))
(define rope-chunker
(make-irregex-chunker
(lambda (x) (and (pair? (cdr x)) (cdr x)))
caar
cadar
caddar
(lambda (src1 i src2 j)
(if (eq? src1 src2)
(substring (caar src1) i j)
(let lp ((src (cdr src1))
(res (list (substring (caar src1) i (caddar src1)))))
(if (eq? src src2)
(string-join
(reverse (cons (substring (caar src2) (cadar src2) j) res))
"")
(lp (cdr src)
(cons (substring (caar src) (cadar src) (caddar src))
res))))))))
$ guile --no-auto-compile -l irregex.scm -l attack.scm
GNU Guile 2.2.3
[...]
scheme@(guile-user)> (irregex-fold/chunked '(or "a\n" bol) (lambda (start i m acc) acc) '() rope-chunker (rope "a\n"))
ERROR: In procedure cadar:
In procedure cadar: Wrong type (expecting pair): #f
Entering a new prompt. Type `,bt' for a backtrace or `,q' to continue.
scheme@(guile-user) [1]>
So this is another attack vector for irregex. Evidently the 'next' chunk must be checked for existence.
>>152
You don't have a job do you? I can't imagine how you'd interact with other people.
Section 3.1 of the irregex documentation specifies the meaning of the index parameter of the kons lambda taken by irregex-fold:
http://synthcode.com/scheme/irregex/#SECTION_3.1
(irregex-fold <irx> <kons> <knil> <str> [<finish> <start> <end>])
This performs a fold operation over every non-overlapping place <irx> occurs in the string str.
The <kons> procedure takes the following signature:
(<kons> <from-index> <match> <seed>)
where <from-index> is the index from where we started searching (initially <start> and thereafter the end index of the last match), <match> is the resulting match-data object, and <seed> is the accumulated fold result starting with <knil>.
Equating the search start index with the last match end index is false, as the documentation notes a bit further on:
Note if an empty match is found <kons> will be called on that empty string, and to avoid an infinite loop matching will resume at the next char. It is up to the programmer to do something sensible with the skipped char in this case.
This is a poor design choice, because it effectively mandates that all nontrivial kons lambdas keep track of their own last match end index, as the fixed irregex-split >>154 does. With author powers to change the public contract the proper design choice would be to switch the meaning of the index parameter of the kons to the last match end index. Irregex-fold/fast would keep track internally of both the search start index and the last match end index, and would pass the former to searches and the latter to 'kons' and 'finish'. This would simplify empty match handling, as well as irregex-replace/all and irregex-split. But this is outside the scope of a mere bugfix as it requires a change to the public contract, and is therefore something only ashinn can do.
With the group of fixes for the fold family >>140 >>141 >>154 and the bonus aborting attack >>158, I think this should be enough material for the second irregex bughunt. The first irregex bughunt, with the leftmost longest violations from no backtracking and the bonus DoS attacks, can be found in the group of posts starting at https://textboard.org/prog/39#t39p100 as well as a few preceding posts in that thread.
You don't have a job do you?
I work in academia and have for several decades. In all those decades, however, I've never had such a devoted follower. What's your day job, Anon?
I can't imagine how you'd interact with other people.
You can easily glean the answer by comparing the two sides of the exchange: without any need to resort to insults, and using elements of reality instead of "a stream of trivially refutable falsehoods". >>152
>>160
>>159 wasn't me (though he is right). I gave up after >>155.
What happened to our resident boomer?
I don't understand what problems people have with the write-ups from >>160
I personally enjoy reading them, and his writing style.
>>163
Thanks. If you have some constructive critique or spot some programming error feel free to suggest improvements.