Now that tickets are accessible and I can actually read the issue, here's what happens. In https://bbs.jp.net/sexp/prog/39 the text of >>194 starts with "お疲れさん.", whatever that is, sent as the bytes:
0002ca50 6f 6e 74 65 6e 74 20 28 70 20 28 61 20 28 40 20 |ontent (p (a (@ |
0002ca60 28 68 72 65 66 20 22 2f 70 72 6f 67 2f 33 39 2f |(href "/prog/39/|
0002ca70 31 39 32 22 29 29 20 22 3e 3e 31 39 32 22 29 20 |192")) ">>192") |
0002ca80 28 62 72 29 20 22 e3 5c 32 30 31 5c 32 31 32 e7 |(br) ".\201\212.|
0002ca90 5c 32 32 36 b2 e3 5c 32 30 32 5c 32 31 34 e3 5c |\226..\202\214.\|
0002caa0 32 30 31 5c 32 32 35 e3 5c 32 30 32 5c 32 32 33 |201\225.\202\223|
0002cab0 2e 22 20 28 62 72 29 20 22 4e 6f 77 2c 20 49 27 |." (br) "Now, I'|
0002cac0 76 65 20 62 65 65 6e 20 6d 65 61 6e 69 6e 67 20 |ve been meaning |
The relevant bytes are:
>>> s = "e3 5c 32 30 31 5c 32 31 32 e7 5c 32 32 36 b2 e3 5c 32 30 32 5c 32 31 34 e3 5c 32 30 31 5c 32 32 35 e3 5c 32 30 32 5c 32 32 33 2e"
>>> b = bytes (int (t, base = 16) for t in s.split ())
>>> b
b'\xe3\\201\\212\xe7\\226\xb2\xe3\\202\\214\xe3\\201\\225\xe3\\202\\223.'
The original string in utf8 is:
>>> "お疲れさん.".encode ("utf8")
b'\xe3\x81\x8a\xe7\x96\xb2\xe3\x82\x8c\xe3\x81\x95\xe3\x82\x93.'
so it is obvious that we have high bytes followed by backslashed octal escapes. In the bytes of >>64 a textual backslash can be seen to be doubled.
0000deb0 6e 20 20 28 6c 65 74 2a 20 28 28 72 31 20 28 73 |n (let* ((r1 (s|
0000dec0 74 72 69 6e 67 2d 73 70 6c 69 74 20 72 61 6e 67 |tring-split rang|
0000ded0 65 20 23 5c 5c 2c 29 29 5c 6e 20 20 20 20 20 20 |e #\\,))\n |
So we just need to process the octals before the utf8 decoding:
>>> f = lambda b: bytes (int (b [4*k+1 : 4*k+4].decode ("ascii"), base=8) for k in range (len (b) // 4))
>>> g = lambda b: re.sub (rb"([\x80-\xff])((\\[0-7]{3})+)", lambda mo: mo.group (1) + f (mo.group (2)), b).decode ("utf-8")
>>> g (b)
'お疲れさん.'
Just do the equivalent of this in elisp and you can have your weeb characters. Someone might send this to the sbbs.el person.
Thanks to whoever linked >>263 in the ticket.
https://fossil.textboard.org/sbbs/tktview?name=ee2e075a98
non-pythonistas
What is a pythonista?
your code is incomprehensible
Input: raw byte array
Output: unicode characters
1. ([\x80-\xff])((\\[0-7]{3})+)
Scan the input and identify locations where a byte over 0x80 is followed by one or more groups of "\DDD" where the Ds are octal digits.
2. Pass everything else through.
3. For each location, emit that first byte over 0x80, then loop over the "\DDD" groups.
4. For each group dump the backslash, take DDD to be an ascii string of three characters, parse that string as an integer in base 8, emit that integer as a byte.
5. After each location has been procesed decode the resulting byte array as utf-8.