[1/3]
The final markup bug for this group of posts involves m3 of between-code? >>27. The following properties hold: the m3 scan is unconditional in between-code?, it is a linear scan potentially to the end of s2 and it is called in a loop by string->sxml-rec. As a result, the overall behavior of m3 within string->sxml is quadratic. This is a bad idea because it invites a stress test on the loop count, which can be controlled by packing spoilers tightly together. Calling append via append-element in a loop also yields quadratic behavior, which should be replaced by the usual backward consing followed by reverse, but unlike the case of m3 the counts involved are insufficient to become a problem. Here is a subset of markup procedures that can be used to exercise Bitdiddle's m3:
$ cat test-m3.scm
(define (timeit proc)
(with-timings proc
(lambda (run-time gc-time real-time)
(write (internal-time/ticks->seconds run-time))
(write-char #\space)
(write (internal-time/ticks->seconds gc-time))
(write-char #\space)
(write (internal-time/ticks->seconds real-time))
(newline))))
(define (string->sxml markup s)
(define (string->sxml-rec s res)
(let ((match (irregex-search (regex markup) s)))
(cond ((string-null? s)
res)
((not match)
(append-element res s))
(else
(let* ((start (irregex-match-start-index match))
(end (irregex-match-end-index match))
(substr (irregex-match-substring match))
(s1 (substring s 0 start))
(s2 (substring s end (string-length s))))
(if (string-null? s1)
(string->sxml-rec
s2
(append-element res ((transform markup) substr)))
(if (and (eq? (name markup) 'del) ;; exception to escape spoiler inside code
(between-code? s1 s2))
(string->sxml-rec "" (append-element res (string-append s1 substr s2)))
(string->sxml-rec
s2
(append-element res s1 ((transform markup) substr))))))))))
(string->sxml-rec s '()))
(define (append-element l . e)
(append l e))
;; edge false positive (between-code? "==code== ==code==" "==")
;; could add another pass of spoiler, but ok good-enough
(define (between-code? s1 s2)
(let ((m1 (irregex-search (irregex ".*==$|.*==[^ ]") s1)) ;opening code in s1
(m2 (irregex-search (irregex ".*[^ ]==") s1)) ;closing code in s1
(m3 (irregex-search (irregex "^==|.*?[^ ]==") s2)) ;closing code in s2
(imei irregex-match-end-index))
(if (and m1 m3 (or (not m2) (>= (imei m1) (imei m2))))
#t
#f)))
(define (transform-rule name regex transform)
(define (dispatch op)
(cond ((eq? op 'name) name)
((eq? op 'regex) regex)
((eq? op 'transform) transform)))
dispatch)
(define (transform markup) (apply markup '(transform)))
(define (regex markup) (apply markup '(regex)))
(define (name markup) (apply markup '(name)))
(define code
(transform-rule
'code
(irregex "==[^ ].*?[^ ]==|==[^ ]==")
(lambda (sub) `(code ,(substring sub 2 (- (string-length sub) 2))))))
(define del
(transform-rule
'del
(irregex "~~[^ ].*?[^ ]~~|~~[^ ]~~")
(lambda (sub) `(del ,(substring sub 2 (- (string-length sub) 2))))))
Here are some timings for input sizes much smaller than the original post size limit:
$ mit-scheme --load deps/irregex.scm --load test-m3.scm
[...]
Release 9.1.1 || Microcode 15.3 || Runtime 15.7 || SF 4.41
LIAR/x86-64 4.118 || Edwin 3.116
;Loading "deps/irregex.scm"... done
;Loading "test-m3.scm"... done
1 ]=> (timeit (lambda () (string->sxml del (apply string-append (make-list 20 "a~~xx~~"))) 'ok))
1.31 .02 1.33
;Value: ok
1 ]=> (timeit (lambda () (string->sxml del (apply string-append (make-list 30 "a~~xx~~"))) 'ok))
3.98 .03 4.008
;Value: ok
1 ]=> (timeit (lambda () (string->sxml del (apply string-append (make-list 40 "a~~xx~~"))) 'ok))
9.71 .13 9.744
;Value: ok
1 ]=> (timeit (lambda () (string->sxml del (apply string-append (make-list 50 "a~~xx~~"))) 'ok))
17.34 .2 17.538
;Value: ok
At a count of 40 spoiler processing already takes ten seconds, and much higher runtimes can be obtained by increasing the count. The solution is to never rescan a section of between-code?:s2 that has already been covered, which brings the order of growth down to linear. However, since the four types of markup bugs covered in this series are interrelated, they require changes to the same parts of lib/markup.scm, so here is a combined fix for all four:
[2/3]
--- lib/markup.scm 2020-06-18 10:33:06.355741000 +0000
+++ lib/markup-fix.scm 2020-08-04 13:36:44.240919125 +0000
@@ -94,40 +94,58 @@
(define (string->sxml markup s)
- (define (string->sxml-rec s res)
- (let ((match (irregex-search (regex markup) s)))
- (cond ((string-null? s)
- res)
- ((not match)
- (append-element res s))
- (else
- (let* ((start (irregex-match-start-index match))
- (end (irregex-match-end-index match))
+ (define (string->sxml-iter s res isdel bc-close-failed)
+ (if (string-null? s)
+ res
+ (let ((match (irregex-search (regex markup) s)))
+ (if (not match)
+ (string->sxml-iter "" (cons s res) isdel bc-close-failed)
+ (let* ((start (irregex-match-start-index match))
+ (end (irregex-match-end-index match))
(substr (irregex-match-substring match))
- (s1 (substring s 0 start))
- (s2 (substring s end (string-length s))))
- (if (string-null? s1)
- (string->sxml-rec
- s2
- (append-element res ((transform markup) substr)))
- (if (and (eq? (name markup) 'del) ;; exception to escape spoiler inside code
- (between-code? s1 s2))
- (string->sxml-rec "" (append-element res (string-append s1 substr s2)))
- (string->sxml-rec
- s2
- (append-element res s1 ((transform markup) substr))))))))))
- (string->sxml-rec s '()))
+ (s1 (substring s 0 start))
+ (s2 (substring s end (string-length s))))
+ (cond ((string-null? s1)
+ (string->sxml-iter s2 (cons ((transform markup) substr) res) isdel bc-close-failed))
+ ((and isdel (not bc-close-failed)) ;; exception to escape spoiler inside code
+ (let ((pos (between-code? s1 s2)))
+ (if (< pos 0)
+ (string->sxml-iter s2 (cons ((transform markup) substr) (cons s1 res)) isdel (= pos -2))
+ (string->sxml-iter (string-tail s2 pos) (cons (string-append s1 substr (string-head s2 pos)) res) isdel #f))))
+ (else
+ (string->sxml-iter s2 (cons ((transform markup) substr) (cons s1 res)) isdel bc-close-failed))))))))
+ (reverse (string->sxml-iter s '() (eq? (name markup) 'del) #f)))
-;; edge false positive (between-code? "==code== ==code==" "==")
-;; could add another pass of spoiler, but ok good-enough
(define (between-code? s1 s2)
- (let ((m1 (irregex-search (irregex ".*==$|.*==[^ ]") s1)) ;opening code in s1
- (m2 (irregex-search (irregex ".*[^ ]==") s1)) ;closing code in s1
- (m3 (irregex-search (irregex "^==|.*?[^ ]==") s2)) ;closing code in s2
- (imei irregex-match-end-index))
- (if (and m1 m3 (or (not m2) (>= (imei m1) (imei m2))))
- #t
- #f)))
+ (if (between-code-scan-open s1)
+ (if (and (>= (string-length s2) 2)
+ (substring=? s2 0 2 "==" 0 2))
+ 2
+ (let ((m (irregex-search between-code-irx-close s2)))
+ (if (irregex-match-data? m)
+ (irregex-match-end-index m)
+ -2)))
+ -1))
+
+(define between-code-irx-open
+ (irregex "==[^ ]"))
+
+(define between-code-irx-close
+ (irregex "[^ ]=="))
+
+(define (between-code-scan-open s)
+ (define (iter pos end irx)
+ (let ((match (irregex-search irx s pos end)))
+ (if (not match)
+ pos
+ (iter (irregex-match-end-index match)
+ end irx))))
+ (let* ((slen (string-length s))
+ (pos (iter 0 slen (regex code)))
+ (open (irregex-search between-code-irx-open s pos slen)))
+ (or (irregex-match-data? open)
+ (and (>= (- slen pos) 2)
+ (substring=? s (- slen 2) slen "==" 0 2)))))
(define (lines->sxml markup l)
(append-map (lambda (e)
@@ -152,25 +170,25 @@
(define bold
(transform-rule
'bold
- (irregex "\\*\\*[^ ].*?[^ ]\\*\\*|\\*\\*[^ ]\\*\\*")
+ (irregex "\\*\\*[^ ](?!\\*\\*).*?[^ ]\\*\\*|\\*\\*[^ ]\\*\\*")
(lambda (sub) `(b ,(substring sub 2 (- (string-length sub) 2))))))
(define italic
(transform-rule
'italic
- (irregex "__[^ ].*?[^ ]__|__[^ ]__")
+ (irregex "__[^ ](?!__).*?[^ ]__|__[^ ]__")
(lambda (sub) `(i ,(substring sub 2 (- (string-length sub) 2))))))
(define code
(transform-rule
'code
- (irregex "==[^ ].*?[^ ]==|==[^ ]==")
+ (irregex "==[^ ](?!==).*?[^ ]==|==[^ ]==")
(lambda (sub) `(code ,(substring sub 2 (- (string-length sub) 2))))))
(define del
(transform-rule
'del
- (irregex "~~[^ ].*?[^ ]~~|~~[^ ]~~")
+ (irregex "~~[^ ](?!~~).*?[^ ]~~|~~[^ ]~~")
(lambda (sub) `(del ,(substring sub 2 (- (string-length sub) 2))))))
(define quotelink
[3/3]
This fix is only for the four issues above while staying within the rules inferred from the existing code, not a complete rewrite with new rules. New rules are also possible but those are outside the scope of this fix. Here are the new timings:
1 ]=> (timeit (lambda () (string->sxml del (apply string-append (make-list 20 "a~~xx~~"))) 'ok))
.01 0. .01
;Value: ok
1 ]=> (timeit (lambda () (string->sxml del (apply string-append (make-list 30 "a~~xx~~"))) 'ok))
.01 0. .01
;Value: ok
1 ]=> (timeit (lambda () (string->sxml del (apply string-append (make-list 40 "a~~xx~~"))) 'ok))
.01 0. .011
;Value: ok
1 ]=> (timeit (lambda () (string->sxml del (apply string-append (make-list 50 "a~~xx~~"))) 'ok))
.01 0. .011
;Value: ok
1 ]=> (timeit (lambda () (string->sxml del (apply string-append (make-list 1000 "a~~xx~~"))) 'ok))
.12 0. .129
;Value: ok
The runtime for a count of 40 goes down from ten seconds to one hundredth of a second, for a speedup factor of roughly one thousand. The speedup factor increases with the loop count, and 40 is well below what fits within the original post size limit. The post-code spoilers >>27 work:
1 ]=> (string->sxml del "~~one~~ two ~~three~~ ==ab ~~cd~~ ef== gh ~~four~~ ij")
;Value 13: ((del "one") " two " (del "three") " ==ab ~~cd~~ ef==" " gh " (del "four") " ij")
The single-character content >>29 works:
1 ]=> (string->sxml del "~~M~~agneto~~H~~ydro~~D~~ynamics")
;Value 14: ((del "M") "agneto" (del "H") "ydro" (del "D") "ynamics")
And the two types of false positives >>30 are gone:
1 ]=> (string->sxml del "==code== ==code==~~spoiler~~==")
;Value 15: ("==code== ==code==" (del "spoiler") "==")
1 ]=> (string->sxml del "==one==, ~~two~~, ==three==")
;Value 16: ("==one==, " (del "two") ", ==three==")
The other two upgrades pending integration that others running their own instances may wish to apply are the performance enhancement using incremental HTML generation https://textboard.org/prog/39#t39p291 and the fix and additional checks >>21 for the string-split exploit >>20.