[ prog / sol / mona ]

prog


Everything is Unicode, until the exploits started rolling in

9 2021-01-22 15:10

>>5,7
Using UTF-8 for everything is sort of like using lists for everything without building higher abstractions or considering alternative data-structures. If you're lucky it works but it makes mistakes far easier and is often inefficient (without a high-level architecture). The arguments made by Verisimilitudes and accepted by at least a few others (including myself) is that using a well designed binary encoding is more correct than using character streams with serialization, parsing, coercion, etc.

Concerning written language, at first glance characters would seem to be the most natural and correct representation, but if one is willing to look one will find that that there can be much more efficient encodings (depending to some extent on the morphology of the language). In English and other nearly isolating languages encoding text as words rather than characters for example would be far more efficient in the base case, while encoding as characters could be preserved for the exceptional situation of encoding language which can't be found in a dictionary, for example the colloquial speech often used in Mark Twain's dialogues.

>>6
BCD is an issue yes, as is the (insane) complexity of normalization, rendering, and properties but even beyond the implementation of Unicode there are considerable issues with the fundamental principle. Unifying the encoding system of many different languages horizontally results in some languages being considerably more efficient than others, even when using hacks to increase efficiency of the inefficient languages. This is best seen in CJK, where despite Han Unification, which fuses Chinese, Japanese, and Korean making it impossible to mix any combination of the three languages in a single file (unless you're writing Korean or Japanese without Han Characters) the encoding of these languages still takes up an extra byte (or two!) more than necessary per glyph, as can be seen by for example Shift-JIS.

51


VIP:

do not edit these