[ art / civ / cult / cyb / diy / drg / feels / layer / lit / λ / q / r / sci / sec / tech / w / zzz ] archive provided by lainchan.jp

lainchan archive - /λ/ - 19540



File: 1477025531819.png (99.06 KB, 300x200, 3000px-Chi_uc_lc.svg.png)

No.19540

This is the thread to discuss the various manner of encoding characters, character information systems, and other means that we generally use to represent text and related information.

This is the thread to debate the positives and the negatives of each such system and their various platforms of use.

  No.19541

To start the discussion, I find that the fairly dominant character encoding systems, ASCII and Unicode, are very inefficient and not particularly suited to use. They currently dominate by social power, not technical.

ASCII is from the times of telegraphs and this is reflected in one fourth of the 128 characters it defines being almost entirely useless on a modern system. This alone is enough reason to discredit ASCII as grossly inefficient. Being a seven bit code doesn't count against it, because an eight bit byte is another arbitrary social convention based on old hardware.
Unicode is large; complicated; and contains many frivolities, such as over one thousand emojis and dead Mayan glyphs. It also contains ASCII. This wanton desiring to add increasingly more discredits Unicode. Unicode also has several implementations, making it an even further mess.

Currently disregarding computer systems only needing some and not all characters from the repertoires for proper functioning, it is rather easy to say that a system encompassing the Latin, Greek, and Cyrillic alphabets along with the Euro-Arabic numerals and various punctuation would be sufficient for the majority of use in much of the world.

Such a set would fit within eight, nine, or ten bits, depending on how diacritics; the amount of punctuation characters; historical characters; and other variations, such as uppercase letters, are treated.

It is easy to say that Chinese characters would easily fit within seventeen bits. Each language encompassing variations of this, such as Japanese and Korean, could use their own encodings and define a simple mapping.

Considering that most of the texts in the world are, presumably, written in one language and that texts containing more than one language are often prepared in a special typographical system, it seems fair to delegate the mixing of languages to a higher system, rather than requiring the character encoding to support this at the expense of efficiency and other considerations.

It's obvious that a system not using all characters in a language should probably define its own encoding for both efficiency and error-correcting purposes.

  No.19542

we've had this discussion many times before, and basically it boils down to this: you have three unfounded beliefs:

1) you believe that, if something is purpose-made, it will be better than the general-purpose solution. This is false. I make the example of cryptographic libraries: home-rolled crypto libraries are almost universally vulnerable to side-channel attacks. For text encoding it's different: you want to be able to read a file created on one system, in another system. That is not just a nice thing to have, it is a critical feature.

2) saving a few bits per character is important. It's really not. For instance, suppose we could use just 57 glyphs and that's all we need for a particular application. That's just 6 bits. Now, you essentially have two choices: use up all 8 bits in a byte for each character, wasting those bits anyway, or pack all your strings for optimum size. Unpacking that string will be a huge pain in the ass, and in the end you'll have to move it to bytes anyway for internal computation. If you want to save space, you could do better by just using UTF-8 and then compressing your files with LZW.

3) Because unicode is large, it is inefficient. This is false. Unicode pushes relatively unused characters towards the end of the spectrum. Your complaint about Maya writing, for instance, is practically moot, because it lies somewhere around the 0x1550 mark, which means in most encodings(except UTF-32) whether it exists in the standard really makes no difference to the efficiency of a string that doesn't use them as characters.


Finally, the main reason why you should be using unicode is because it is the *only* solution that allows you to write several different scripts in the same file without switching encodings. That's really important, because if you have to write different libraries for displaying text in different encodings for the same application, that is a *massive pain in the ass*. Rather than making anything easier or more efficient, having multiple encodings for different scripts makes everything less efficient.

And as for unicode encodings, the obvious choice is UTF-8 since it's backwards-compatible with ASCII and is very space-efficient.

So that's basically it. The reasons for are very strong, the reasons against are very weak.

  No.19543

>>19542
correction: mayan glyphs start at 0x15500.

  No.19547

>>19541
The actual trouble with encodings is and has always been people that write programs that don't care about encodings at all, and therefore use ASCII because it is the native OS/programming language encoding and because they don't know nor care about any other language. People use ASCII because they don't care. They don't even know they're using it, and couldn't care less. Your solution to this problem is to add even more incompatible encodings to the mix? Really?

UTF-8 not only solves the encoding problem for nearly every situation with a single encoding, it also accomodates the vast amount of programs written by lazy people by virtue of being compatible with ASCII. That solves so many problems it could be considered a miracle.

Your technical reasons for not using unicode don't matter. The amount of bits the encoding uses? Utterly irrelevant since you can compress data when that actually matters. Why does it matter whether or not Unicode has a bunch of emojis? The code point range is already set in stone, might as well use it. That only makes it even more useful really. No, it's not fair at all to delegate the mixing of languages to some "higher typographical system", the resulting drop in usability for even simple text editors and programs would be unacceptable and all for what, a bunch of bits? It's already annoying enough that you need soykaf like Latex to encode mathematical formulas and such, do you actually think it's acceptable to force this kind of bullsoykaf on everyone?

In the Lisp thread it was said Scheme standards chose not to specify unicode as the underlying encoding. I cannot think of a bigger lost opportunity

  No.19550

>>19541
In the math thread there's a Slack channel where, among other things, people study Japanese. So they talk to each other in English and often use Japanese characters in their messages.

That would be pretty much impossible without Unicode.

  No.19553

>>19542
>1) you believe that, if something is purpose-made, it will be better than the general-purpose solution. This is false.
It can be false. Regardless, we're probably using different definitions for ``better''.
>I make the example of cryptographic libraries: home-rolled crypto libraries are almost universally vulnerable to side-channel attacks.
I'm so glad you mentioned this. The ``don't roll your own crypto advice is a great example of advice people are congratulated for not following, as every cryptographic procedure is ``home-rolled by someone. Those people are then generally trusted, as with OpenSSL, no matter the incompetence. It's better to be hurt by one's own ignorance and learn, than be hurt by another's and learn nothing.
>For text encoding it's different: you want to be able to read a file created on one system, in another system. That is not just a nice thing to have, it is a critical feature.
Don't worship interoperability.

>2) saving a few bits per character is important. It's really not.

I don't like the idea of the modern memory cache, but think of that. Efficiency isn't bad.
>For instance, suppose we could use just 57 glyphs and that's all we need for a particular application. That's just 6 bits. Now, you essentially have two choices: use up all 8 bits in a byte for each character, wasting those bits anyway, or pack all your strings for optimum size.
This depends very heavily on the program being developed. Those two bits could be used to store valuable information. As an example, if I will be using three character strings often and also have a related byte of which half of which is being used for flags for each string, I could store those flags within those extra bits.
Of course, packing may also be the appropriate solution, which could make a difference after several hundreds or thousands of strings.
>Unpacking that string will be a huge pain in the ass, and in the end you'll have to move it to bytes anyway for internal computation. If you want to save space, you could do better by just using UTF-8 and then compressing your files with LZW.
The unpacking wouldn't be that bad. It could be done with masking and shifting from a single thirty-two bit register in just a few instructions.
Who said anything about files? It would be queer to use this LZW on in-memory arrays, wouldn't you write?

3) Because unicode is large, it is inefficient. This is false.
We're probably using different definitions for ``efficient''. Unicode is inefficient in how it's created, stored, documented, and manipulated, to name several qualities.
>Unicode pushes relatively unused characters towards the end of the spectrum. Your complaint about Maya writing, for instance, is practically moot, because it lies somewhere around the 0x1550 mark, which means in most encodings(except UTF-32) whether it exists in the standard really makes no difference to the efficiency of a string that doesn't use them as characters.
I take issue with the very idea that an encoding we're supposed to be using for our everyday language tasks itself with encompassing dead languages and pointless glyphs.

  No.19554

>>19542
>Finally, the main reason why you should be using unicode is because it is the *only* solution that allows you to write several different scripts in the same file without switching encodings.
Correction: Unicode is the only encoding with in-band control and the characters to allow for such.
The fact that the modern file is nothing more than an unintelligent bag of bytes is inconsequential. Files with proper metadata could easily store the several different encodings I've proposed out-of-band.
Simply because some systems make Unicode slightly easier to deal with halfway correctly is no excuse to not think of and use better systems.
>That's really important, because if you have to write different libraries for displaying text in different encodings for the same application, that is a *massive pain in the ass*. Rather than making anything easier or more efficient, having multiple encodings for different scripts makes everything less efficient.
You'll need to elaborate on this.

>So that's basically it. The reasons for are very strong, the reasons against are very weak.

Your reasons amount to an appeal to the general, inefficiency, not inefficient, and social push. I find none of your arguments to be particularly strong.

  No.19555

>>19547
>The actual trouble with encodings is and has always been people that write programs that don't care about encodings at all, and therefore use ASCII because it is the native OS/programming language encoding and because they don't know nor care about any other language. People use ASCII because they don't care. They don't even know they're using it, and couldn't care less. Your solution to this problem is to add even more incompatible encodings to the mix? Really?
Yes. I don't believe that one should give in to all of this garbage in computing simply because a collection of fools made an organization to push their ideas on the rest of the world.
There will never be one encoding to rule them all.

>UTF-8 not only solves the encoding problem for nearly every situation with a single encoding, it also accomodates the vast amount of programs written by lazy people by virtue of being compatible with ASCII. That solves so many problems it could be considered a miracle.

You're equating not using Unicode with laziness. You're also glossing over the many flaws and vulnerabilities created by Unicode, such as the need to escape control characters and sequences of characters resembling other characters, greatly complicating blocking systems.
Another issue with ASCII and Unicode is that they both contain control characters. The very idea is a relic of the telegraph and other hardware that is merely emulated nowadays.

>Your technical reasons for not using unicode don't matter. The amount of bits the encoding uses? Utterly irrelevant since you can compress data when that actually matters.

Compression is no panacea.
>Why does it matter whether or not Unicode has a bunch of emojis? The code point range is already set in stone, might as well use it. That only makes it even more useful really.
It is drivel that will only cause issues with age.
>No, it's not fair at all to delegate the mixing of languages to some "higher typographical system", the resulting drop in usability for even simple text editors and programs would be unacceptable and all for what, a bunch of bits? It's already annoying enough that you need soykaf like Latex to encode mathematical formulas and such, do you actually think it's acceptable to force this kind of bullsoykaf on everyone?
I don't believe we should hold our modern text editors and programs as an ideal to follow. It's not as if many of them are written or designed particularly well. We should strive for better. Perhaps future systems shouldn't offer a system without these higher typographical abilities, so that they can't be bemoaned by someone still using a very basic editor.

>In the Lisp thread it was said Scheme standards chose not to specify unicode as the underlying encoding. I cannot think of a bigger lost opportunity

Scheme is older than Unicode and will probably outlive it. It would be a shame for such a language to die because it married a silly character set. The best languages generally define a custom character set to be used at the bare minimum.

>>19550
>In the math thread there's a Slack channel where, among other things, people study Japanese. So they talk to each other in English and often use Japanese characters in their messages.
Slack is already a mess of JSON. This is such a higher typographical system, although not a good one.
>That would be pretty much impossible without Unicode.
You're unimaginative.

  No.19556

>>19553
>``don't roll your own crypto advice is a great example of advice people are congratulated for not following, as every cryptographic procedure is ``home-rolled by someone.

Not true. Most good crypto libraries(OpenSSL, GPG, libsodium) are made by professionals who spend hundreds of hours making sure they've got it right. It's not only a waste of time to do that, it's also dangerous for a novice to try.

>Don't worship interoperability.


I wouldn't, if it weren't *the entire purpose of text encoding*

>Unicode is inefficient in how it's created, stored, documented, and manipulated, to name several qualities.


explain.

>I take issue with the very idea that an encoding we're supposed to be using for our everyday language tasks itself with encompassing dead languages and pointless glyphs.


Alright, then *don't use them*. You're going to die on this hill for no fuarrrking reason besides that you don't like big numbers.

>Files with proper metadata could easily store the several different encodings I've proposed out-of-band.


And for incomplete reads? For corrupted files? How about for in-memory strings? Should you have to store a big-ass hunk of metadata for every string in memory, to make sure it works correctly if it has multiple encodings in it?

>You'll need to elaborate on this.


C already has the problem that dealing with unicode requires its own set of libraries, because the stdlib doesn't support it by default. Now imagine if there were a hundred different encodings, all of which must be understood by a general-purpose text processing program. Do you realize how much complexity that adds to a program?

>Your reasons amount to an appeal to the general, inefficiency, not inefficient, and social push. I find none of your arguments to be particularly strong.


My reasons are all *very* strong. I mean, if we wanted to boil things down further, you just think that unicode is bad because it's old. X86 is bad because it's old the Von Neumann architecture is bad because it's old. That's all bullsoykaf, it's just an appeal to novelty.

>>19555
>It is drivel that will only cause issues with age.

any reason why it will cause issues, or are you just so sure of it that it's unimaginable that it could be otherwise?

>

  No.19559

>>19553
>Those people are then generally trusted, as with OpenSSL, no matter the incompetence.
Cryptography part of OpenSSL is written by Eric Young, Sun Microsystems cryptographer.
>I could store those flags within those extra bits
So you save 25% of space by introducing huge data complexity? Good job.

  No.19600

>>19556
Your post seemed unfinished, but I guess you are.

>Not true. Most good crypto libraries(OpenSSL, GPG, libsodium) are made by professionals who spend hundreds of hours making sure they've got it right. It's not only a waste of time to do that, it's also dangerous for a novice to try.

Anyone can implement a cryptographic procedure correctly. It's important to use a variety of implementations, if only so one vulnerability in one implementation doesn't touch billions of people.

>I wouldn't, if it weren't *the entire purpose of text encoding*

The purpose of text encoding is to encode characters.

>explain.

The Unicode standard can't fit on one page. Use of computer storage is similarly large. A Unicode manipulation program is inherently large.
As an example, try writing a Unicode program to display all of the characters and return case conversions and whatnot, without any libraries.

>Alright, then *don't use them*. You're going to die on this hill for no fuarrrking reason besides that you don't like big numbers.

That's not a good argument. If something exists in a standard, it will inevitably become a problem at some point for someone and require understanding.
Look at how web software can regularly be attacked by Unicode characters that aren't properly escaped.

>And for incomplete reads? For corrupted files?

Is it ever desirable for a file to be read improperly? No, the system shouldn't allow it.
>How about for in-memory strings? Should you have to store a big-ass hunk of metadata for every string in memory, to make sure it works correctly if it has multiple encodings in it?
This is very specific to the type of program being written. I've merely proposed one strategy. There's no reason every program would use the same one.

>C already has the problem that dealing with unicode requires its own set of libraries, because the stdlib doesn't support it by default. Now imagine if there were a hundred different encodings, all of which must be understood by a general-purpose text processing program. Do you realize how much complexity that adds to a program?

Any text processing program, such as Emacs, must already support the many different encodings and whatnot.
It's already a mess.

>My reasons are all *very* strong. I mean, if we wanted to boil things down further, you just think that unicode is bad because it's old. X86 is bad because it's old the Von Neumann architecture is bad because it's old. That's all bullsoykaf, it's just an appeal to novelty.

I don't like Unicode because it's complex and makes many decisions I disagree with. I don't like the idea of a character set having multiple implementations. The x86 processor is similarly full of cruft, even containing vulnerabilities baked into the silicon.
Your arguments amount to an appeal to authority. You would rather everyone else make a decision for you. It's one thing for a programming language to have a standard, if it's a good one, but it's entirely different to have a standard I don't like shoved down my throat.
So, follow standards you like and vehemently oppose those you don't.

>any reason why it will cause issues, or are you just so sure of it that it's unimaginable that it could be otherwise?

As written earlier, if something exists in a standard, it will inevitably become a problem at some point for someone and require understanding.
Why are these Unicode code points these stupid, strange symbols? Oh, it was a fad in the 2010s to have a symbol for feces and other inanities.

>>19559
>So you save 25% of space by introducing huge data complexity? Good job.
On constrained systems, it's worthwhile. It depends on the system.

  No.19623

Is it ever acceptable for an encoding to not have the numerals ordered and adjacent? Its just something that stuck out to me in C book. Is it the encoding responsibility yo make sure that '9' - '1' is 8, or should the programmer be using some sort of encoding-agnostic abstraction?

  No.19626

>>19600
>try writing a Unicode program to display all of the characters and return case conversions and whatnot
Why would anyone ever need a program that converts cases in all possible alphabets ever?
But hey, I cannot think of any alphabet that has both uppercase and lowercase, other than the latin alphabet, oh and greek. I don't know about russian, but I know for sure that japanese, chinese, korean, tibetan, and all the indic alphabets are single-case. Middle east languages probably are single-case too.
I still don't see a reason to make a piece of software that handles the nuances of every language ever. I don't type chinese so I don't have caijing input installed.
>I don't like Unicode because it's complex and makes many decisions I disagree with
Can't plese everybody. If there's one thing I've learned (particularly with lainchan) is that nothing will ever please everybody. There's always people who'll oppose it. Standardization, however, is a good thing, it really is. Didn't Common Lisp get stanardized because the community so desperately needed a common, unified dialect? C did too. But at the end no standard can ever possibly cover all cases and make all people happy.
To further my example, there's some user here in lainchan who uses CL entirely because it's cross portable across every system ever, at the cost of useful features that weren't included in the standard and are now implementation-specific.

  No.19631

For whatever reason I'm designing my own encoding. Seems pretty fun so far, the current idea is pretty much ASCII but with less retarded control characters and more useful characters.

  No.19632

>>19623
>Is it ever acceptable for an encoding to not have the numerals ordered and adjacent?
I would think it would be, if there were a compelling reason. Do you have a compelling reason for why this wouldn't hold? It's useful for printing numbers.
>Its just something that stuck out to me in C book. Is it the encoding responsibility yo make sure that '9' - '1' is 8, or should the programmer be using some sort of encoding-agnostic abstraction?
Here's a fun fact: Common Lisp also requires that numerical characters are ordered. Many languages simply pick an encoding, rather than allowing any encoding meeting certain restrictions.
I wouldn't rely on character encoding for a program.

>>19626
>Why would anyone ever need a program that converts cases in all possible alphabets ever?
That's not the point. My point is that manipulating Unicode is complicated.
>But hey, I cannot think of any alphabet that has both uppercase and lowercase, other than the latin alphabet, oh and greek. I don't know about russian, but I know for sure that japanese, chinese, korean, tibetan, and all the indic alphabets are single-case. Middle east languages probably are single-case too.
The program would need to handle that.
>I still don't see a reason to make a piece of software that handles the nuances of every language ever. I don't type chinese so I don't have caijing input installed.
So, we agree that it's perfectly fine for a program to only operate with, say, English?

>Can't plese everybody. If there's one thing I've learned (particularly with lainchan) is that nothing will ever please everybody. There's always people who'll oppose it. Standardization, however, is a good thing, it really is. Didn't Common Lisp get stanardized because the community so desperately needed a common, unified dialect? C did too. But at the end no standard can ever possibly cover all cases and make all people happy.

This is true, but I find the ANSI Common Lisp standard to be much better than Unicode. I've not compared the two documents, but the Unicode standard may be larger.
>To further my example, there's some user here in lainchan who uses CL entirely because it's cross portable across every system ever, at the cost of useful features that weren't included in the standard and are now implementation-specific.
That's me. To clarify, I use other languages, but Common Lisp is one of my favorite. I do try to avoid anything nonstandard unless it can't be helped. I'm more yielding to fundamental language features, like threading and networking, added by libraries that are still inherently portable across types of machines, such as Bordeaux Threads.
So, I can use a library such as Bordeaux Threads on any operating system, so long as the implementation supports the set or a superset of the model it uses. Its portability is only an issue of implementation, rather than tying a program to a specific operating system or UNIX shared object.

  No.19633

>What do you wish language designers paid attention to?
>#1 answer
>Unicode support

http://softwareengineering.stackexchange.com/a/33639

  No.19641

>>19600
>The purpose of text encoding is to encode characters.
And then to be able to read those characters back. Because networks exist, characters might be encoded on one machine and decoded on another. It seems foolish to say that interoperability isn't the purpose of text encoding. Interoperability is the purpose of all data formats.