If we had another chance at a character encoding standard, what would you propose?

If we had another chance at a character encoding standard, what would you propose?

Attached: mesa.jpg (1000x1000, 50.29K)

Other urls found in this thread:

en.wikipedia.org/wiki/TRON_(encoding)
tronweb.super-nova.co.jp/unicoderevisited.html
hkgolden.com/
discuss.com.hk/
twitter.com/NSFWRedditGif

English only.

kys

UTF-8 is the only standard we need

An extremely grammatically simple language designed for usage on a computer. Each of 256 letters (including spaces et al) is encoded as one character. Because it is intended for computer usage, things like line breaks, EOFs, and even advanced markup could be represented with words that fit very nicely into the language's grammar. Simpler renderers could ignore special words and it would still be human-readable.

People would learn the language in addition to learning how to use a computer. Because of how global the Internet is, it would make sense for such a universal language to exist for computers.

Unicode + UTF-8 is about as good as it gets. Maybe some more characters should be differentiated - for example, 'I'.lower() != 'i' in a Turkish locale - but my impression is that most problems now are related to compatibility and the inherent difficulty of formalizing human notation.

Seconding this. We don't need to stick all the fucking memes and emojis into unicode, but UTF-8 is a really good format.

unicode is dickballs

the emojis aren't even the problem with unicode. it has about 3000 problems and emojis amount to about 0.1% of the problem. of course seeing them add a bunch of pointless emojis is still insufferable

UTF-8 is cancer. It's ambiguous and can code invalid strings. No one handles normalization correctly and only a few people even understand what the problem is. Normalization isn't reversible so there is information loss. Normalization requires massive tables of bullshit that will have to be kept up to date forever which is why there are dozens of copies of the massive ICU.DLL on your Windows system, dozens of libicu on your smartphones, and even a couple redundant copies on a Linux distro. How did they fuck up something so badly that should have been so simple.

ASCII did literally nothing wrong and oughtta be enuff for anybody

I'm going to assume you mean character code and not encoding. Unicode had the right idea, but they fucked up by adding everything and the kitchen sink (there is probably a kitchen sink emoji). Do what Unicode did, but limit it to only semantic characters, which excludes emoji and similar shit.


You are probably just shitposting, but there are people out there who really seem to be under the impression that English is the only language that matters.

Like everyone else, pretty much UTF-8.

So can pretty much every other encoding
Normalization is necessary for any encoding which wishes to span a large character set. It's an unavoidable complexity.
and this is a problem why? If you really need it unnormalized, just keep it that way until you do a compare operation or whatever.
They aren't on my system. I only have two copies of them, a 32 bit and 64 bit version.

There should be nothing more than extended ASCII. If you do not write in a latin alphabet you're shit.
Also just make your own encoding for non-latin alphabets. It doesn't have to be universal.

How do you write APL then faggot.

this, ascii is more than enough. let the niche muneroon languages deal with it themselves if it matters so much.

Plain ASCII is good. It even works on the smallest 8-bit computers.

What if I want to use mathematical notation, or mix my Plato with my English, in the same document or even the same sentence?
Do you really think it's less messy to have a trillion encodings?

You mean unicode?

I think it's less messy to use a markup language (or Emacs, or Word) for special cases like that. Having a small, easy-to-understand *normal*, *canonical*, *default* encoding is important if you want to accept even usernames without going completely mad. Unicode's allowance of mixing what should logically be completely separate encodings has only resulted in bullshit--unicode smilies, unicode that leaves vertical streaks down your screen, unicode that phishes you by being visually identical to something else. It's an understandable desire, like adding CJK to Unicode, and (like adding CJK to Unicode) it's a bad desire and pursuing it has only given us bad things.

Like all the other APL programmers, in a windows IDE.

Allowing only ASCII in usernames and domain names is fine enough, but why not allow unicode in the places where you would allow that markup language?

This is the state of Zig Forums.

Is that post actually equating the two? I don't see it.

If you're going to use unicode at all, at least keep it contained to the contents of actual documents, and preferably documents that must be in or contain non-english language. If you start using it for everything, including OS level stuff, you're bound to run into problems eventually. One example someone gave on openbsd-misc was copying files from one system to another. If the filenames are plain ASCII, there's no problems, but if they're in unicode, you can't be sure what they'll end up like. Actually I've often seen files inside ZIP files get extracted with filenames full of ?????? characters all over the place. So it's not just a hypothetical, it's actually happening out there right now.

I want to see something BESIDES unicode. All I see are a shit ton of ways to encode unicode.

That wouldn't be a problem if there weren't multiple encodings floating around.
Solving that problem by never using unicode isn't actually more feasible than solving that problem by always using UTF-8.

System level shit should just use raw bytes for everything instead of a particular encoding. If someone wants to name their file with a jpeg of a pepe fuck it let them.

Filenames are part of the user interface, and they should take that into account. End users will be confronted with them. Arbitrary bytes is a bad idea, case insensitive unicode with a lot of (control) characters blacklisted might be good.
Anything that requires the ability to use arbitrary bytes in file names is probably a really bad idea in the first place, and will still have to avoid using path separators and (usually) null bytes. Newlines alone make correct shell scripting a lot harder.
Forcing the entire world to use the latin alphabet in filenames might seem like a good idea if you're an edgy imageboard poster but it doesn't fly in real life.

You want to know about real life? Its windows 7, photoshop, ubuntu server, and javascript.

What's your point? I honestly don't understand what you're trying to say.
For what it's worth, I think Windows's filename convention is closest to what I described.

Please go back to the "real world" of macs, windows, iphones, and other shit designed to restrict users.

SYSTEMD encoding.
Decoding requires the file to be piped through SYSTEMD-chard
Binary only files encrypted and compressed at creation time and linked to hardware addresses

OK here's my idea.
Null terminated characters.

I think unicode is good in general.
"In general" includes things that are designed to restrict you. It also includes things that are designed to give you freedom.
Freedom is orthogonal to unicode support.

Each character is a 64-bit number. That number stores the black and white pixels of an 8x8 pixel square. Each character code is equivalent to the 64-bit number that prints a pattern that looks like the character. An optional pretty font layer is added on top for vector rendering/whatever.

retard

All you need is TTS+opus.

Well it did work like that IRL for quite a long time, since that's how Unix did things for decades. In fact people even tended to shy away from using whiespace or shell metacharacters in filenames (but the system didn't enforce that).

Unix went from "anything other than '/'" in the very early days to "anything other than '/' or '\0'" (probably once it was rewritten in C).
That's pretty bad for what's supposed to be a human-readable identifier. It's typical for Unix. Keeping it that simple made sense fifty years ago at Bell Labs, but not today with kernels that are already huge and complex.

This is the worst fucking thing. I have had to handle character conversion several times, in several languages, and I could not tell you anything about it. It always resorts to a copy-paste job followed by trying things until it seems to work. I understand many things but the verbage and everything else around character encoding is bonkers to me.
fucking stop
☃☃☃☃

and then people keep fucking adding shit

literally this would be less aids than unicode. and it would still be highly compressible. running a simple compression algo is much less bad than having unicode supporting behemoth

If you allow arbitrary data like Unix you also allow unicode, not just the latin alphabet.

It creates security issues. Again, I don't expect any of you to understand this. When you need two systems to agree on something like the permissions of a path, the question becomes how to get them to agree on the strings being equivalent. There is only one way to go due to there being information loss in the translation, towards fully normalized string comparisons, but you can't fully normalize in a future-proof way since it requires tables that will be changed over time by the committee. One side of that normalization is some day going to be a newer version that understands newer characters and it's going to open a hole.
Linux developers encountering this stupid clusterfuck raged about it and then decided to just say fuck it and do raw, unnormalized memory comparisons of UTF-8 strings since any other option was deemed unworkable. Other operating systems decided to half-ass something and partially handle normalization (which will likely require pinning compatibility to some old version of unicode since I doubt they can ever add new normalized forms without destroying existing filesystems and servers). This causes a lot of pain for projects that have to talk between the two. SMB shares on Linux and Mac OS have a lot of issues because of this as each OS handles this differently and there are infinite bugs in the middle since the future will add them even if they aren't there today.

I don't like UTF-8 but in practice I don't believe it can be improved. The ASCII-compatibility is somewhat of a cancer (for the same reason UTF-16 is shit), but without it most software wouldn't support international text at all. Most developers are too stupid to deal with international text, so they had to be tricked.

People didn't allow that, even though Unix did. They followed naming conventions, even to the point of avoiding whitespace and using underscore characters instead.
I think that self-control started to vanish when Win95 long filenames got popular and began to show up everywhere.

NO


Unicode is half-shit. They fucked up with CJKV.
t. Chinese person who wants this BS to end

Is there a multilingual standard that is less fucked?
Big Five + HKSCS and Shift-JIS are good enough (GB is for Niggers)

Agreed, diacritics and right-to-left are cancer and are reserved for kikes, poos and sandniggers.


English-only for system-programming level, multi-lingual for document level

THIS. Reorder everything to TRON would make the world less unbearable en.wikipedia.org/wiki/TRON_(encoding)

tronweb.super-nova.co.jp/unicoderevisited.html

...

Get a load of this chink, all smug thinking he is more civilized for switching to left to right in the last hundred(or so) years.
Its a good thing you main input method is full of them then.

Attached: 0e56f3fae7f3f8727925f6153f17e9e3.jpg (808x1080, 99.48K)

Right-to-left is Mudslime or (((Jewish))), and we are not going there
What are you, Indian? Vietnamese?

Huh? I was saying that pinyin is full of diacratics.
(你)是中国人吗?Because I'm starting to doubt it.

you draw it and let the compiler parse the .bmp

屌你老母拼乜撚音呀?

>>>/bog/
>>>/auschwitz/
>>>/gulag/
>>>/out/

Then what's the alternative? Keeping a copy of ICU in kernel? That way at least you can use any language in filenames and the kernel doesn't need to care.

UTF-8 but without emoji, beep and other shit that exists only to cause trouble.

That's not part of what UTF-8 does. UTF-8 is a way to represent unicode code points, the meaning of those code points is defined elsewhere.

Do you know of any chinese imageboards? I feel like the general autism posting to imageboards commands combined with the cultural differences would be rather amusing.

Attached: 1454640392815.gif (200x204, 39.58K)

We need an alternative to unicode not utf8

In Cantonese: "The fuck you using diacritics for?"

No, note really (BBS is rare as well)
hkgolden.com/
discuss.com.hk/

turn cjkv into something sensible.
make all the han characters into a series of composed radicals like it should be.
turn hangul blocks into hangul letters and have a combining mark or something better
also a mark that turns a han character into its corresponding emoji

The bytes will balloon in that case. At least 5~7 bytes per character on average.

I were talking about both.
UTF-8 is a way to represent unicode code points, and it's not used for representing anything else. Of course if the Unicode allocation table is revisited, UTF-8 would also decode differently.
And I also insist that all variants of UTF-16 must vanish, too. They are cancer and the worst of both ends (UTF-8 and fixed-width UTF-32): byte-order dependent and variable length => maximum complexity, and for no real gain (only pain).

In hindsight it could be phrased a little bit cleaner:
1: The one and only character-to-number mapping should be the modified Unicode, which is as follows:
1.1: All emoji and all characters that are only used for purposes other than displaying themselves in 2D monochrome glyphs or modifying adjacent characters appearance while keeping them as 2D monochrome glyphs (that is, combining characters), should be discarded and their code points shall be considered "unassigned" (these numbers are unused).
1.2: All other (valid) characters shall keep their code points as is, so that good software won't need to be altered to stay correct.
2: The one and only variable length encoding method of the one and only character-to-number mapping shall be UTF-8 (modified if possible to strip any excess complexities if they become obsolete after p.1)
3: Trying to press people to add any nonsensical characters to the new Unicode (that is, anything other than commonly used typographic symbols and real world languages) shall be a criminal offense.

The only actionable suggestion here is to drop support for characters you don't like.
Encodings other than UTF-8 are already almost exclusively used for legacy reasons or not exposed.
All other parts of your suggestion are either noops or inane.

Don't make it "absolute radical" because we can cut everything down to less than 2500 codepoints and ~3 character average when encoding rebuses. "Absolute radical" would go like

UTF-32.

Python fucked up.

wingdings

oh really?

Did it?
It only uses UTF-32 for strings that contain four-byte code points. Strings that are purely ASCII (or even LATIN-1) take up one byte per character.
>>> import sys>>> size = lambda s: sys.getsizeof(s * 10_000) // 10_000>>> size('a')1>>> size('ß')1>>> size('ij')2>>> size('\N{fish}')4
Unfortunately, even a single character is enough to blow up the size:
>>> sys.getsizeof('a' * 9_999 + '\N{fish}') // 10_0004
But that seems hard to avoid if you want O(1) indexing, which is very desirable.

Daily reminder that big O does not correlate with reality in many situations. Non constant time access to cache is much faster than constant time outside of cache. UTF32 is 4x bigger than UTF8 in most situations. Thats 4 GB vs 1GB. Your constant time access will be slower under most access patterns.

The kernel 'solved' the problem by saying that you can use UTF-8 but it will be treated like a binary string instead of unicode. It's a giant fuck you to the committee, but breaking compatibility is the correct engineering decision when given a broken standard. Microsoft even agreed unicode was fucked and uses UTF-16 as a binary string similar to what Linux does, however they do have mystery proprietary normalizations across their software like in SQL Server but maybe those are just Microsoft being Microsoft. Apple is all over the place with older versions of their OS doing the impossible task of normalization in kernel, then they started doing normalization at the application level instead, and I have no idea what the status of that mess is today but I'm glad I don't have to deal with it.
There's nothing worse than a flawed standard.

Attached: apple engineering.png (764x577, 197.24K)

It makes more sense for python to have O(1) indexing as python programmers don't understand algorithms; that better protects them from creating poorly performing code if they handle strings in insane ways.

You're an absolute retard. O(N), when N is 10,000, will be slower than O(1), regardless of the cache. Now kill yourself.

What about O(2*N) when N is 10000?

UTF-4096

this is factually incorrect, and I will also kick you in the nuts for this if you dare to say it in my face.

Most of your strings will just be ASCII or LATIN-1 and take up a single byte per character.
If you do have a 1 GB string then O(1) indexing is likely to be useful.


That's not true, and not how complexity works. For any value of N, a O(N) and O(1) algorithm exist so that the O(N) algorithm is faster than the O(1) algorithm for that value of N (in the worst case). However, for any such combination, there also exists a value M so that the O(1) algorithm is faster than the O(N) algorithm for values of N > M.


O(2*N) and O(N) are the same thing.


Many Python programmers do understand algorithms (plenty of them don't), but there's a valid point buried in there, which is that Python programmers shouldn't have to worry about the underlying implementation too much. A lot of Python's design is about reducing mental load so programmers can focus on the program instead of the programming language.
There are cases in which memory is very expensive and it's reasonable to expect the programmer to know every implementation detail of the language and functions they're working with so they can micromanage efficiency, but Python doesn't aim to cover those cases. For Python it's more sensible to build up expectations like "indexing is cheap" (O(1) for lists and strings, O(log N) for dicts), so value = obj[ind]if value is None: return defaultreturn value and if obj[ind] is None: return defaultreturn obj[ind] are about equally performant for non-weird types of obj.

...sez the tripfag. Oh the irony.

You totally misread the post or simply don't understand how it works. That is absolutely not the case. UTF32 and UTF8 DO NOT have the same properties.

...

I think you don't understand how CPython's strings work.
It picks an encoding based on the content (this is ok because they're immutable). If your string's characters fit in LATIN-1, it uses LATIN-1. If they don't all fit in LATIN-1 it uses more bytes per character. Strings always use fixed-width encodings, but not all strings use the same encoding, so the memory use tends to be ok in most cases.
See

Ascii has letters with diacritical marks for all of the other languages that matter. We're not programming in Korean you fucking gooklover.

Look at this idiom:
for i in range(len(s)): if s[i] == 'a': n += 1
Notice two things: 1. If indexing is O(1), there are no cache misses. 2. if indexing is O(n), then this runs in O(n^2).
So, for a one gigabyte string, assuming 1ns to read and compare a byte, this goes from one second to run to being a half a billion seconds.

stop larping anytime

That's not how Big O works, shut the fuck up

If indexing the string takes i operations, then the whole process takes 0.5*len(s)*(len(s)+1) of such operations, which is O(n^2).

5-bit encodings.

Csan anyone confirm this?

All english characters come first, then symbols, then nippon's moonrunes. Anything else goes into later parts of the encoding as to fuck with people I don't like, such as needing several times the storage that one english letter would need for one arabic or hebrew character.

What about Gook runes and Taiwan runes?

Nobody would write this so bad
It's actually:
n = sum(1 for l in s if l == 'a')

Horse shit, it's even more sensible to not expect any specific optimization if you don't actually need it. (And if you need it, you go and check the docs)
The typical example — if you just need to go through all items, use a fucking iterator. It will always work in an optimal way unless the author of that goddamn library deliberately hates you.

Whats an iterator user? Is that a bloated way of saying map or fold?

You are both retarded if you haven't realized that ~90% are shared. The only thing that varied was that the gooks simplified some characters differently than the chinks (or that taiwan and places speaking cantonese didn't simplify at all).

We should just get rid of all characters except for the Chinese ones. Everyone will be speaking Chinese soon anyways.

ASCII, fuck foreginers.
This is American property!

But I need muh chinese cartoon games and I don't want xir to localize them.

This particular expectation is actually ingrained in the language. If you define a getitem and a len then a iter is implicit, unless you provide it yourself.
Making indexing expensive would make only certain strings more compact, at the cost of breaking the expectations of people coming from Python 2 (where strings are bytes), or people coming from C, while also bothering people who want to go through two strings character-by-character and don't know how to use zip yet.
It has a huge cost and little gain.

See