>>35
As I see it is keeping a list of code points (like Unicode but simplified) and then using dictionaries of words for compression of semi-rich-text documents. The bloat may not come from the system built-in dictionaries since they can be compressed and structured in a back&forward-compatible way. The additional dictionaries bundled with each document may add up to a great amount but the total storage efficiency is still to be estimate.