prog

frontpage - thread list - new thread - preferences - ?

Everything is Unicode, until the exploits started rolling in

20 2021-01-23 14:19

Honestly, you do realize words come and go, right? People who complain that emojis are taking up precious space in the unicode standard ought to be aware that exactly this would happen with a dictionary-styled encoding.

I don't have an issue with adding words to the dictionary, old words are still useful for when they show up in old texts, and new when they show up in new. Archaic spellings I would have an issue with, but spelling reforms are relatively rare, and can often be handled by a translation layer.

Moreover, are we really that pressed in space (memory/storage) that we need to optimize language for it?

The reason I had to buy 4gb of ram for my machine and upgrade in the first place was because how exceptionally inefficient MathJax/MathML are. My machine could play high definition video just fine but a hundred posts in a discussion forum using mathematical notation would force me into swap and take several minutes. My understanding is that Verisimilitudes also plans on extending the encoding to high-level typesetting information.

Encoding root morphemes might be a better idea. But then you realize that upon mixing they transform in so many different ways (a change of vowel here, dropping/adding a consonant, etc) that implementing that would be 1000x more cumbersome that the whole unicode standard.

I believe there are plans to add some degree of morpheme comprehension for non-isolating languages. Irregularities in natural languages are common, and I agree that this would be the most complex part. The difference between this and the complexity of the Unicode standard (in e.g. character properties) is that this information is to some extent necessary. So the information encoded here is often going to exist on your machine anyway for semantic analysis, and makes some sorts of analysis trivial, such as spell checking, and word counting. Additionally there are problems with Unicode that the user does not need to worry about in this system, namely normalization, the call site would likely still be simpler.

>>17

Wouldn't text rendering still be expensive?

This mostly just depends on implementations, in the most basic case I think you would drop support for exclusively vertical scripts, and have no combining characters (ideally the latter would not exist in the most complex case either) etc. You're then doing three things, you're being assigned an encoding for a textblock and based on that encoding either rendering from right to left or left to right, you're then looking up words in the dictionary, and lastly you're rendering the glyphs as a bitmap. I don't recall Verisimilitudes commenting on this so this is just my personal thoughts, and in general the things I say on this topic are just my interpretation of his ideas.

>>18,19

tfw you can't invent neologisms or new compound words due it not being in the 3byte word collection and have to revert back to alphabetic encodings(Not Safe for China)
tfw Communist party reallocates politically incorrect words from their positions to /dev/ null and you can't talk about them anymore since they don't exist.

tfw US businesses box up and ship all of US industry to China, and then buy up all the news and media outlets along with most of the modes of communication and convince the US that it was China's fault that the US no longer has skilled labor, commodities are trash, poisonous, etc. They then manage to convince the US that any overt use of force by any non-US-proxy state is grounds for foreign intervention and loads of war-booty for the khan and his cronies directing it all.
Joking aside part of the beauty of this system is that it treats different subencodings differently so for Chinese characters the fall back encoding wouldn't be pinyin but simply a 16-bit encoding of Chinese characters. Avoiding “regulatory capture” if you like is not really the responsibility of the standard but of the standard body and while perhaps an interesting discussion, is not a discussion I have a particular opinion on. Pinyin would likely be an entirely different subencoding all together.

51