[1/2]
https://github.com/ashinn/irregex/issues/21
Replacements of positive lookbehinds only replace first match #21
A simplified test case:
$ guile --no-auto-compile -l irregex.scm
GNU Guile 2.2.3
[...]
scheme@(guile-user)> (irregex-replace/all "(?<=x)a" "xa xa xa" "-")
$1 = "x- xa xa"
scheme@(guile-user)> (irregex-replace/all "x(?=a)" "xa xa xa" "-")
$2 = "-a -a -a"
The look-behind processing is at:
https://github.com/ashinn/irregex/blob/353b8db8472f9b36a7b08ca21a4227d827750d93/irregex.scm#L3240
((look-behind neg-look-behind)
(let ((check
(lp (sre-sequence
(cons '(* any) (append (cdr sre) '(eos))))
n
flags
(lambda (cnk init src str i end matches fail) i))))
(lambda (cnk init src str i end matches fail)
(let* ((cnk* (wrap-end-chunker cnk src i))
(str* ((chunker-get-str cnk*) (car init)))
(i* (cdr init))
(end* ((chunker-get-end cnk*) (car init))))
(if ((if (eq? (car sre) 'look-behind) (lambda (x) x) not)
(check cnk* init (car init) str* i* end* matches
(lambda () #f)))
(next cnk init src str i end matches fail)
(fail))))))
The irregex strategy for matching a look-behind is to rescan the entire input stream from the start to the current position with the concatenation of an iterated any-matcher, the look-behind and an end-of-stream anchor. This will work but whether it's a good idea on the "text-buffer opened onto a 500MB file" from the documentation is another matter, especially with repeated matching like that of irregex-replace/all. But for reasonably sized inputs it's not a real problem, and it is also not the source of the bug. We'll add some verbosity to the end of the let* to see the locals:
(end* ((chunker-get-end cnk*) (car init)))
(unused (simple-format #t "look-behind ~A ~A ~A ~A ~A | ~A ~A ~A\n" init src str i end str* i* end*)))
The failing test:
scheme@(guile-user)> (irregex-replace/all "(?<=x)a" "xa xa xa" "-")
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 0 8) xa xa xa 0 8 | xa xa xa 0 0
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 0 8) xa xa xa 1 8 | xa xa xa 0 1
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 2 8) xa xa xa 2 8 | xa xa xa 0 8
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 2 8) xa xa xa 3 8 | xa xa xa 0 8
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 2 8) xa xa xa 4 8 | xa xa xa 0 8
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 2 8) xa xa xa 5 8 | xa xa xa 0 8
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 2 8) xa xa xa 6 8 | xa xa xa 0 8
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 2 8) xa xa xa 7 8 | xa xa xa 0 8
look-behind ((xa xa xa 0 8) . 0) (xa xa xa 2 8) xa xa xa 8 8 | xa xa xa 0 8
$1 = "x- xa xa"
As soon as the first "a" was matched and 'src' switched from (xa xa xa 0 8) to (xa xa xa 2 8), 'end*' stopped following the current index 'i' and jumped straight to 8. The problem is this jump in 'end*'. This also means that we can make a much stranger failing case:
scheme@(guile-user)> (irregex-replace/all "(?<=x)a" "xa aaa x" "-")
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 0 8) xa aaa x 0 8 | xa aaa x 0 0
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 0 8) xa aaa x 1 8 | xa aaa x 0 1
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 2 8) xa aaa x 2 8 | xa aaa x 0 8
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 2 8) xa aaa x 3 8 | xa aaa x 0 8
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 4 8) xa aaa x 4 8 | xa aaa x 0 8
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 5 8) xa aaa x 5 8 | xa aaa x 0 8
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 6 8) xa aaa x 6 8 | xa aaa x 0 8
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 6 8) xa aaa x 7 8 | xa aaa x 0 8
look-behind ((xa aaa x 0 8) . 0) (xa aaa x 6 8) xa aaa x 8 8 | xa aaa x 0 8
$3 = "x- --- x"
This happens because after the first replacement the end-of-stream anchor at 'end*' moves the look-behind check to the end of the string. This 'end*' is computed by chunker-get-end which is a vector-ref into a vector of lambdas computed by wrap-end-chunker:
https://github.com/ashinn/irregex/blob/353b8db8472f9b36a7b08ca21a4227d827750d93/irregex.scm#L330
(define (wrap-end-chunker cnk src i)
(make-irregex-chunker
(lambda (x) (and (not (eq? x src)) ((chunker-get-next cnk) x)))
(chunker-get-str cnk)
(chunker-get-start cnk)
(lambda (x) (if (eq? x src) i ((chunker-get-end cnk) x)))
(chunker-get-substring cnk)
(chunker-get-subchunk cnk)))
This is a wrapper around the input stream that aims to stop at the current position 'i' of the current chunk 'src'. The chunk documentation is at:
http://synthcode.com/scheme/irregex/#SECTION_3.4
This is to be matched by the regex with the iterated any-matcher, the look-behind and an end-of-stream anchor. The wrapper returns #f for a next chunk request after the current one, and overrides the current chunk's end position to 'i'. The current chunk is tested for by eq?, as specified in the documentation.
There are two important constraints on the <get-next> procedure. It must return an eq? identical object when called multiple times on the same chunk, and it must not return a chunk with an empty string (start == end).
The lambda that returns the wrong 'end*' is the fourth item. A steady return of 'i' would be needed for the current chunk, so the eq? fails when it shouldn't. Plain string inputs only ever have one chunk, constructed in irregex-basic-string-chunker, so this eq? should not fail.
https://github.com/ashinn/irregex/blob/353b8db8472f9b36a7b08ca21a4227d827750d93/irregex.scm#L1861
(define irregex-basic-string-chunker
(make-irregex-chunker (lambda (x) #f)
car
cadr
caddr
(lambda (src1 i src2 j)
(substring (car src1) i j))))
The 'x' and 'src' in the eq? of wrap-end-chunker are the (car init) and 'src' from the look-behind section. The debug print shows them diverging when 'end*' stops behaving. The 'init' and 'src' values are passed by an external driver to the lambda produced by the look-behind section. In the case of irregex-replace/all this driver is irregex-fold/fast.
https://github.com/ashinn/irregex/blob/353b8db8472f9b36a7b08ca21a4227d827750d93/irregex.scm#L3780