Now that tickets are accessible and I can actually read the issue, here's what happens. In https://bbs.jp.net/sexp/prog/39 the text of >>194 starts with "お疲れさん.", whatever that is, sent as the bytes:
0002ca50 6f 6e 74 65 6e 74 20 28 70 20 28 61 20 28 40 20 |ontent (p (a (@ |
0002ca60 28 68 72 65 66 20 22 2f 70 72 6f 67 2f 33 39 2f |(href "/prog/39/|
0002ca70 31 39 32 22 29 29 20 22 3e 3e 31 39 32 22 29 20 |192")) ">>192") |
0002ca80 28 62 72 29 20 22 e3 5c 32 30 31 5c 32 31 32 e7 |(br) ".\201\212.|
0002ca90 5c 32 32 36 b2 e3 5c 32 30 32 5c 32 31 34 e3 5c |\226..\202\214.\|
0002caa0 32 30 31 5c 32 32 35 e3 5c 32 30 32 5c 32 32 33 |201\225.\202\223|
0002cab0 2e 22 20 28 62 72 29 20 22 4e 6f 77 2c 20 49 27 |." (br) "Now, I'|
0002cac0 76 65 20 62 65 65 6e 20 6d 65 61 6e 69 6e 67 20 |ve been meaning |
The relevant bytes are:
>>> s = "e3 5c 32 30 31 5c 32 31 32 e7 5c 32 32 36 b2 e3 5c 32 30 32 5c 32 31 34 e3 5c 32 30 31 5c 32 32 35 e3 5c 32 30 32 5c 32 32 33 2e"
>>> b = bytes (int (t, base = 16) for t in s.split ())
>>> b
b'\xe3\\201\\212\xe7\\226\xb2\xe3\\202\\214\xe3\\201\\225\xe3\\202\\223.'
The original string in utf8 is:
>>> "お疲れさん.".encode ("utf8")
b'\xe3\x81\x8a\xe7\x96\xb2\xe3\x82\x8c\xe3\x81\x95\xe3\x82\x93.'
so it is obvious that we have high bytes followed by backslashed octal escapes. In the bytes of >>64 a textual backslash can be seen to be doubled.
0000deb0 6e 20 20 28 6c 65 74 2a 20 28 28 72 31 20 28 73 |n (let* ((r1 (s|
0000dec0 74 72 69 6e 67 2d 73 70 6c 69 74 20 72 61 6e 67 |tring-split rang|
0000ded0 65 20 23 5c 5c 2c 29 29 5c 6e 20 20 20 20 20 20 |e #\\,))\n |
So we just need to process the octals before the utf8 decoding:
>>> f = lambda b: bytes (int (b [4*k+1 : 4*k+4].decode ("ascii"), base=8) for k in range (len (b) // 4))
>>> g = lambda b: re.sub (rb"([\x80-\xff])((\\[0-7]{3})+)", lambda mo: mo.group (1) + f (mo.group (2)), b).decode ("utf-8")
>>> g (b)
'お疲れさん.'
Just do the equivalent of this in elisp and you can have your weeb characters. Someone might send this to the sbbs.el person.
Imagine there is a line with
>>> import re
anywhere before the g(b) call >>263, for the re.sub in g. It didn't make it through the copypasting but it was obviously there in the original because the g(b) call returned a result rather than raising a NameError.
To convert all the honeypot links on a page like
https://www.fossil-scm.org/fossil/rptview?rn=1
to ticket links:
Array.from (document.getElementsByTagName ("a")).filter (e => e.hasAttribute ("data-href") && /\/honeypot$/.test (e.getAttribute ("href"))).forEach (e => { e.setAttribute ("href", e.getAttribute ("data-href")); })
Obviously the hostiles >>262 they are so afraid of will be nice enough to refrain from reading the data-href attribute.
>>263
sbbs.el person here, your code is incomprehensible for non-pythonistas. Can anyone explain what's going on or at least write it out normally? "process the octals before" is a bit vauge.
Thanks to whoever linked >>263 in the ticket.
https://fossil.textboard.org/sbbs/tktview?name=ee2e075a98
non-pythonistas
What is a pythonista?
your code is incomprehensible
Input: raw byte array
Output: unicode characters
1. ([\x80-\xff])((\\[0-7]{3})+)
Scan the input and identify locations where a byte over 0x80 is followed by one or more groups of "\DDD" where the Ds are octal digits.
2. Pass everything else through.
3. For each location, emit that first byte over 0x80, then loop over the "\DDD" groups.
4. For each group dump the backslash, take DDD to be an ascii string of three characters, parse that string as an integer in base 8, emit that integer as a byte.
5. After each location has been procesed decode the resulting byte array as utf-8.
*processed
sorry
What is a pythonista?
A python programmer?
And thanks for the explanation, I get the original code now too, but it's still super cryptic. Shouldn't take long to translate into working elisp.
I didn't realize the complete Monapo font was so huge. It's been replaced with a lighter version that should suffice for SJIS-art.
>>263
>>269
sbbs can now render SJIS-art, though it looks weird without the right font: https://fossil.textboard.org/sbbs/info/17bd3b26618a4f16
sbbs can now render SJIS-art
UTF-8 too, Nice!
>>263,267
Why doesn't the admin produce proper UTF-8 files? That seems much better than processing them after the fact. What even is this encoding? Seems like a bug.
>>288
That but is called MIT Scheme.
http://web.mit.edu/scheme_v9.2/doc/mit-scheme-ref/Unicode.html
>>287
The sexp files are written in bbs.scm:post-message:
(call-with-output-file path (lambda (port) (write t port)))
This 'write' is a built-in of MIT/GNU Scheme and therefore the bug is not the admin's.
http://web.mit.edu/scheme_v9.2/doc/mit-scheme-ref/Output-Procedures.html#index-write-2117
Rest assured that if the bug had been Bitdiddle's, this would have been stated explicitly.
Why doesn't the admin produce proper UTF-8 files?
To answer this question narrowly, the reason is that there is no built-in pair that reads/writes general scheme objects with proper utf-8 support. If you wish to submit such a pair of functions yourself, the admin will probably accept them if they pass correctness stress tests and the efficiency loss is not too great. But that is by no means a small undertaking. Patching the decoding was far easier.
The HTML files are written in actual utf-8, as far as we've seen.
>>288-290
Ah, that's unfortunate. I don't know Scheme so nope, no chance of me submitting some kind of patch.
>>263,267
I'm a bit late to the party and I don't know elisp very well but if I take the string of >>194 in the sexp file, I can get back the utf-8 representation like this:
ELISP> (string-as-multibyte (apply #'unibyte-string (mapcar 'multibyte-char-to-unibyte "ã\201\212ç\226²ã\202\214ã\201\225ã\202\223.")))
"お疲れさん."
>>296
sbbs person here. My knowledge of encoding in Emacs is quite limited, since just like most people I stick to multibyte buffers all the time. As far as I see, this approach would also work, the only thing that annoys me is that I don't see a direct way to translate your expression into procedural code that works on buffers. This would be necessary to avoid converting the response into a string and back again, that just strains the garbage collector and slows everything down in larger threads (such as this one). If you find anything, post a note here or in the ticked linked above.