If we had another chance at a character encoding standard, what would you propose?

Question

If we had another chance at a character encoding standard, what would you propose?

Robert Adams

Attached: mesa.jpg (1000x1000, 50.29K)

May 20, 2018 - 20:58

Other urls found in this thread:

en.wikipedia.org/wiki/TRON_(encoding)
tronweb.super-nova.co.jp/unicoderevisited.html
hkgolden.com/
discuss.com.hk/
twitter.com/NSFWRedditGif

Jose Price

English only.

May 20, 2018 - 21:01

Aaron Myers

kys

UTF-8 is the only standard we need

May 20, 2018 - 21:04

Isaiah Cook

An extremely grammatically simple language designed for usage on a computer. Each of 256 letters (including spaces et al) is encoded as one character. Because it is intended for computer usage, things like line breaks, EOFs, and even advanced markup could be represented with words that fit very nicely into the language's grammar. Simpler renderers could ignore special words and it would still be human-readable.

People would learn the language in addition to learning how to use a computer. Because of how global the Internet is, it would make sense for such a universal language to exist for computers.

May 20, 2018 - 21:05

Julian Edwards

Unicode + UTF-8 is about as good as it gets. Maybe some more characters should be differentiated - for example, 'I'.lower() != 'i' in a Turkish locale - but my impression is that most problems now are related to compatibility and the inherent difficulty of formalizing human notation.

May 20, 2018 - 21:40

Michael Perry

Seconding this. We don't need to stick all the fucking memes and emojis into unicode, but UTF-8 is a really good format.

May 20, 2018 - 22:23

Dominic Davis

unicode is dickballs

the emojis aren't even the problem with unicode. it has about 3000 problems and emojis amount to about 0.1% of the problem. of course seeing them add a bunch of pointless emojis is still insufferable

May 20, 2018 - 22:41

Dylan Green

UTF-8 is cancer. It's ambiguous and can code invalid strings. No one handles normalization correctly and only a few people even understand what the problem is. Normalization isn't reversible so there is information loss. Normalization requires massive tables of bullshit that will have to be kept up to date forever which is why there are dozens of copies of the massive ICU.DLL on your Windows system, dozens of libicu on your smartphones, and even a couple redundant copies on a Linux distro. How did they fuck up something so badly that should have been so simple.

May 20, 2018 - 22:50

Nicholas James

ASCII did literally nothing wrong and oughtta be enuff for anybody

May 20, 2018 - 23:05

Ryan Campbell

I'm going to assume you mean character code and not encoding. Unicode had the right idea, but they fucked up by adding everything and the kitchen sink (there is probably a kitchen sink emoji). Do what Unicode did, but limit it to only semantic characters, which excludes emoji and similar shit.

You are probably just shitposting, but there are people out there who really seem to be under the impression that English is the only language that matters.

May 20, 2018 - 23:50

Robert Miller

Like everyone else, pretty much UTF-8.

So can pretty much every other encoding
Normalization is necessary for any encoding which wishes to span a large character set. It's an unavoidable complexity.
and this is a problem why? If you really need it unnormalized, just keep it that way until you do a compare operation or whatever.
They aren't on my system. I only have two copies of them, a 32 bit and 64 bit version.

May 21, 2018 - 02:07

Charles Wood

There should be nothing more than extended ASCII. If you do not write in a latin alphabet you're shit.
Also just make your own encoding for non-latin alphabets. It doesn't have to be universal.

May 21, 2018 - 02:12

William Cooper

How do you write APL then faggot.

May 21, 2018 - 02:18

Brody Ross

this, ascii is more than enough. let the niche muneroon languages deal with it themselves if it matters so much.

May 21, 2018 - 05:31

Carson Anderson

Plain ASCII is good. It even works on the smallest 8-bit computers.

May 21, 2018 - 06:56

Jayden Myers

What if I want to use mathematical notation, or mix my Plato with my English, in the same document or even the same sentence?
Do you really think it's less messy to have a trillion encodings?

May 21, 2018 - 07:13

Connor Hughes

You mean unicode?

May 21, 2018 - 07:15

Nicholas Mitchell

I think it's less messy to use a markup language (or Emacs, or Word) for special cases like that. Having a small, easy-to-understand *normal*, *canonical*, *default* encoding is important if you want to accept even usernames without going completely mad. Unicode's allowance of mixing what should logically be completely separate encodings has only resulted in bullshit--unicode smilies, unicode that leaves vertical streaks down your screen, unicode that phishes you by being visually identical to something else. It's an understandable desire, like adding CJK to Unicode, and (like adding CJK to Unicode) it's a bad desire and pursuing it has only given us bad things.

May 21, 2018 - 08:20

Cooper Nguyen

Like all the other APL programmers, in a windows IDE.

May 21, 2018 - 08:21

Jeremiah Mitchell

Allowing only ASCII in usernames and domain names is fine enough, but why not allow unicode in the places where you would allow that markup language?

May 21, 2018 - 08:27

Camden Carter

This is the state of Zig Forums.

May 21, 2018 - 08:49

Adrian Edwards

Is that post actually equating the two? I don't see it.

May 21, 2018 - 09:18

Isaac Myers

If you're going to use unicode at all, at least keep it contained to the contents of actual documents, and preferably documents that must be in or contain non-english language. If you start using it for everything, including OS level stuff, you're bound to run into problems eventually. One example someone gave on openbsd-misc was copying files from one system to another. If the filenames are plain ASCII, there's no problems, but if they're in unicode, you can't be sure what they'll end up like. Actually I've often seen files inside ZIP files get extracted with filenames full of ?????? characters all over the place. So it's not just a hypothetical, it's actually happening out there right now.

May 21, 2018 - 10:30

Adam Green

I want to see something BESIDES unicode. All I see are a shit ton of ways to encode unicode.

May 21, 2018 - 10:38

William Carter

That wouldn't be a problem if there weren't multiple encodings floating around.
Solving that problem by never using unicode isn't actually more feasible than solving that problem by always using UTF-8.

May 21, 2018 - 10:43

Tyler Wright

System level shit should just use raw bytes for everything instead of a particular encoding. If someone wants to name their file with a jpeg of a pepe fuck it let them.

May 21, 2018 - 10:45

Ryder King

Filenames are part of the user interface, and they should take that into account. End users will be confronted with them. Arbitrary bytes is a bad idea, case insensitive unicode with a lot of (control) characters blacklisted might be good.
Anything that requires the ability to use arbitrary bytes in file names is probably a really bad idea in the first place, and will still have to avoid using path separators and (usually) null bytes. Newlines alone make correct shell scripting a lot harder.
Forcing the entire world to use the latin alphabet in filenames might seem like a good idea if you're an edgy imageboard poster but it doesn't fly in real life.

May 21, 2018 - 10:55

Ryder Parker

You want to know about real life? Its windows 7, photoshop, ubuntu server, and javascript.

May 21, 2018 - 11:05

Oliver Jenkins

What's your point? I honestly don't understand what you're trying to say.
For what it's worth, I think Windows's filename convention is closest to what I described.

May 21, 2018 - 11:09

Jeremiah Evans

Please go back to the "real world" of macs, windows, iphones, and other shit designed to restrict users.

May 21, 2018 - 11:28

Zachary Harris

SYSTEMD encoding.
Decoding requires the file to be piped through SYSTEMD-chard
Binary only files encrypted and compressed at creation time and linked to hardware addresses

May 21, 2018 - 11:35

Aaron Roberts

OK here's my idea.
Null terminated characters.

May 21, 2018 - 13:33

Joseph Allen

I think unicode is good in general.
"In general" includes things that are designed to restrict you. It also includes things that are designed to give you freedom.
Freedom is orthogonal to unicode support.

May 21, 2018 - 14:35

Dominic Young

Each character is a 64-bit number. That number stores the black and white pixels of an 8x8 pixel square. Each character code is equivalent to the 64-bit number that prints a pattern that looks like the character. An optional pretty font layer is added on top for vector rendering/whatever.

May 21, 2018 - 14:55

Cameron King

retard

May 21, 2018 - 15:49

Brandon Gutierrez

All you need is TTS+opus.

May 21, 2018 - 16:10

Owen Mitchell

Well it did work like that IRL for quite a long time, since that's how Unix did things for decades. In fact people even tended to shy away from using whiespace or shell metacharacters in filenames (but the system didn't enforce that).

May 21, 2018 - 17:08

Hunter Myers

Unix went from "anything other than '/'" in the very early days to "anything other than '/' or '\0'" (probably once it was rewritten in C).
That's pretty bad for what's supposed to be a human-readable identifier. It's typical for Unix. Keeping it that simple made sense fifty years ago at Bell Labs, but not today with kernels that are already huge and complex.

May 21, 2018 - 17:30

Kayden Brown

This is the worst fucking thing. I have had to handle character conversion several times, in several languages, and I could not tell you anything about it. It always resorts to a copy-paste job followed by trying things until it seems to work. I understand many things but the verbage and everything else around character encoding is bonkers to me.
fucking stop
☃☃☃☃

and then people keep fucking adding shit

May 21, 2018 - 17:44

Isaiah Thomas

literally this would be less aids than unicode. and it would still be highly compressible. running a simple compression algo is much less bad than having unicode supporting behemoth

May 21, 2018 - 17:46

Jayden Ward

If you allow arbitrary data like Unix you also allow unicode, not just the latin alphabet.

May 21, 2018 - 19:07

Josiah Nelson

It creates security issues. Again, I don't expect any of you to understand this. When you need two systems to agree on something like the permissions of a path, the question becomes how to get them to agree on the strings being equivalent. There is only one way to go due to there being information loss in the translation, towards fully normalized string comparisons, but you can't fully normalize in a future-proof way since it requires tables that will be changed over time by the committee. One side of that normalization is some day going to be a newer version that understands newer characters and it's going to open a hole.
Linux developers encountering this stupid clusterfuck raged about it and then decided to just say fuck it and do raw, unnormalized memory comparisons of UTF-8 strings since any other option was deemed unworkable. Other operating systems decided to half-ass something and partially handle normalization (which will likely require pinning compatibility to some old version of unicode since I doubt they can ever add new normalized forms without destroying existing filesystems and servers). This causes a lot of pain for projects that have to talk between the two. SMB shares on Linux and Mac OS have a lot of issues because of this as each OS handles this differently and there are infinite bugs in the middle since the future will add them even if they aren't there today.

May 21, 2018 - 19:32

Landon Jackson

I don't like UTF-8 but in practice I don't believe it can be improved. The ASCII-compatibility is somewhat of a cancer (for the same reason UTF-16 is shit), but without it most software wouldn't support international text at all. Most developers are too stupid to deal with international text, so they had to be tricked.

May 21, 2018 - 19:48

Jaxon James

People didn't allow that, even though Unix did. They followed naming conventions, even to the point of avoiding whitespace and using underscore characters instead.
I think that self-control started to vanish when Win95 long filenames got popular and began to show up everywhere.

May 21, 2018 - 20:08

Austin Torres

NO

Unicode is half-shit. They fucked up with CJKV.
t. Chinese person who wants this BS to end

Is there a multilingual standard that is less fucked?
Big Five + HKSCS and Shift-JIS are good enough (GB is for Niggers)

Agreed, diacritics and right-to-left are cancer and are reserved for kikes, poos and sandniggers.

English-only for system-programming level, multi-lingual for document level

May 23, 2018 - 04:32

Samuel Hall

THIS. Reorder everything to TRON would make the world less unbearable en.wikipedia.org/wiki/TRON_(encoding)

May 23, 2018 - 04:39

Jaxson Nguyen

tronweb.super-nova.co.jp/unicoderevisited.html

May 23, 2018 - 04:54

Mason Russell

...

May 23, 2018 - 05:43

Zachary Davis

Get a load of this chink, all smug thinking he is more civilized for switching to left to right in the last hundred(or so) years.
Its a good thing you main input method is full of them then.

Attached: 0e56f3fae7f3f8727925f6153f17e9e3.jpg (808x1080, 99.48K)

May 23, 2018 - 06:02

Alexander Gonzalez

Right-to-left is Mudslime or (((Jewish))), and we are not going there
What are you, Indian? Vietnamese?

May 23, 2018 - 06:09

Thomas Bennett

Huh? I was saying that pinyin is full of diacratics.
(你)是中国人吗？Because I'm starting to doubt it.

May 23, 2018 - 07:01

Jacob Turner

you draw it and let the compiler parse the .bmp

May 23, 2018 - 08:01

Liam Martinez

屌你老母拼乜撚音呀？

May 23, 2018 - 08:06

Wyatt Lewis

>>>/bog/
>>>/auschwitz/
>>>/gulag/
>>>/out/

May 23, 2018 - 08:12

William Sullivan

Then what's the alternative? Keeping a copy of ICU in kernel? That way at least you can use any language in filenames and the kernel doesn't need to care.

May 23, 2018 - 08:20

Thomas Martin

UTF-8 but without emoji, beep and other shit that exists only to cause trouble.

May 23, 2018 - 17:43

Levi White

That's not part of what UTF-8 does. UTF-8 is a way to represent unicode code points, the meaning of those code points is defined elsewhere.

May 23, 2018 - 19:41

Isaac Torres

Do you know of any chinese imageboards? I feel like the general autism posting to imageboards commands combined with the cultural differences would be rather amusing.

Attached: 1454640392815.gif (200x204, 39.58K)

May 23, 2018 - 19:48

Carter Ross

We need an alternative to unicode not utf8

May 23, 2018 - 19:56

Eli Ortiz

In Cantonese: "The fuck you using diacritics for?"

No, note really (BBS is rare as well)
hkgolden.com/
discuss.com.hk/

May 24, 2018 - 01:14

Juan Allen

turn cjkv into something sensible.
make all the han characters into a series of composed radicals like it should be.
turn hangul blocks into hangul letters and have a combining mark or something better
also a mark that turns a han character into its corresponding emoji

May 24, 2018 - 01:20

Jack Sanchez

The bytes will balloon in that case. At least 5~7 bytes per character on average.

May 24, 2018 - 08:18

Isaac Rivera

I were talking about both.
UTF-8 is a way to represent unicode code points, and it's not used for representing anything else. Of course if the Unicode allocation table is revisited, UTF-8 would also decode differently.
And I also insist that all variants of UTF-16 must vanish, too. They are cancer and the worst of both ends (UTF-8 and fixed-width UTF-32): byte-order dependent and variable length => maximum complexity, and for no real gain (only pain).

In hindsight it could be phrased a little bit cleaner:
1: The one and only character-to-number mapping should be the modified Unicode, which is as follows:
1.1: All emoji and all characters that are only used for purposes other than displaying themselves in 2D monochrome glyphs or modifying adjacent characters appearance while keeping them as 2D monochrome glyphs (that is, combining characters), should be discarded and their code points shall be considered "unassigned" (these numbers are unused).
1.2: All other (valid) characters shall keep their code points as is, so that good software won't need to be altered to stay correct.
2: The one and only variable length encoding method of the one and only character-to-number mapping shall be UTF-8 (modified if possible to strip any excess complexities if they become obsolete after p.1)
3: Trying to press people to add any nonsensical characters to the new Unicode (that is, anything other than commonly used typographic symbols and real world languages) shall be a criminal offense.

May 24, 2018 - 10:12

Landon Ortiz

The only actionable suggestion here is to drop support for characters you don't like.
Encodings other than UTF-8 are already almost exclusively used for legacy reasons or not exposed.
All other parts of your suggestion are either noops or inane.

May 24, 2018 - 10:29

Jacob Reed

Don't make it "absolute radical" because we can cut everything down to less than 2500 codepoints and ~3 character average when encoding rebuses. "Absolute radical" would go like

May 24, 2018 - 10:48

Joshua Barnes

UTF-32.

May 24, 2018 - 10:55

Caleb Gray

Python fucked up.

May 24, 2018 - 11:58

Thomas Mitchell

wingdings

May 24, 2018 - 12:41

Jose Taylor

oh really?

May 24, 2018 - 12:46

Brayden Carter

Did it?
It only uses UTF-32 for strings that contain four-byte code points. Strings that are purely ASCII (or even LATIN-1) take up one byte per character.
>>> import sys>>> size = lambda s: sys.getsizeof(s * 10_000) // 10_000>>> size('a')1>>> size('ß')1>>> size('ĳ')2>>> size('\N{fish}')4
Unfortunately, even a single character is enough to blow up the size:
>>> sys.getsizeof('a' * 9_999 + '\N{fish}') // 10_0004
But that seems hard to avoid if you want O(1) indexing, which is very desirable.

May 24, 2018 - 17:01

Ayden Reyes

Daily reminder that big O does not correlate with reality in many situations. Non constant time access to cache is much faster than constant time outside of cache. UTF32 is 4x bigger than UTF8 in most situations. Thats 4 GB vs 1GB. Your constant time access will be slower under most access patterns.

May 24, 2018 - 17:32

Nolan Taylor

The kernel 'solved' the problem by saying that you can use UTF-8 but it will be treated like a binary string instead of unicode. It's a giant fuck you to the committee, but breaking compatibility is the correct engineering decision when given a broken standard. Microsoft even agreed unicode was fucked and uses UTF-16 as a binary string similar to what Linux does, however they do have mystery proprietary normalizations across their software like in SQL Server but maybe those are just Microsoft being Microsoft. Apple is all over the place with older versions of their OS doing the impossible task of normalization in kernel, then they started doing normalization at the application level instead, and I have no idea what the status of that mess is today but I'm glad I don't have to deal with it.
There's nothing worse than a flawed standard.

Attached: apple engineering.png (764x577, 197.24K)

May 24, 2018 - 17:50

Nicholas Morgan

It makes more sense for python to have O(1) indexing as python programmers don't understand algorithms; that better protects them from creating poorly performing code if they handle strings in insane ways.

May 24, 2018 - 17:55

Matthew Hill

You're an absolute retard. O(N), when N is 10,000, will be slower than O(1), regardless of the cache. Now kill yourself.

May 24, 2018 - 18:17

Tyler Jackson

What about O(2*N) when N is 10000?

May 24, 2018 - 18:23

Hunter Perry

UTF-4096

May 24, 2018 - 18:28

Grayson Wright

this is factually incorrect, and I will also kick you in the nuts for this if you dare to say it in my face.

May 24, 2018 - 18:31

Bentley Green

Most of your strings will just be ASCII or LATIN-1 and take up a single byte per character.
If you do have a 1 GB string then O(1) indexing is likely to be useful.

That's not true, and not how complexity works. For any value of N, a O(N) and O(1) algorithm exist so that the O(N) algorithm is faster than the O(1) algorithm for that value of N (in the worst case). However, for any such combination, there also exists a value M so that the O(1) algorithm is faster than the O(N) algorithm for values of N > M.

O(2*N) and O(N) are the same thing.

Many Python programmers do understand algorithms (plenty of them don't), but there's a valid point buried in there, which is that Python programmers shouldn't have to worry about the underlying implementation too much. A lot of Python's design is about reducing mental load so programmers can focus on the program instead of the programming language.
There are cases in which memory is very expensive and it's reasonable to expect the programmer to know every implementation detail of the language and functions they're working with so they can micromanage efficiency, but Python doesn't aim to cover those cases. For Python it's more sensible to build up expectations like "indexing is cheap" (O(1) for lists and strings, O(log N) for dicts), so value = obj[ind]if value is None: return defaultreturn value and if obj[ind] is None: return defaultreturn obj[ind] are about equally performant for non-weird types of obj.

May 24, 2018 - 18:48

Juan Evans

...sez the tripfag. Oh the irony.

May 24, 2018 - 20:58

Ayden Roberts

You totally misread the post or simply don't understand how it works. That is absolutely not the case. UTF32 and UTF8 DO NOT have the same properties.

May 27, 2018 - 17:19

Nolan Morales

...

May 27, 2018 - 17:20

Julian Hall

I think you don't understand how CPython's strings work.
It picks an encoding based on the content (this is ok because they're immutable). If your string's characters fit in LATIN-1, it uses LATIN-1. If they don't all fit in LATIN-1 it uses more bytes per character. Strings always use fixed-width encodings, but not all strings use the same encoding, so the memory use tends to be ok in most cases.
See

May 27, 2018 - 20:23

Brody Carter

Ascii has letters with diacritical marks for all of the other languages that matter. We're not programming in Korean you fucking gooklover.

May 28, 2018 - 02:34

Ethan Flores

Look at this idiom:
for i in range(len(s)): if s[i] == 'a': n += 1
Notice two things: 1. If indexing is O(1), there are no cache misses. 2. if indexing is O(n), then this runs in O(n^2).
So, for a one gigabyte string, assuming 1ns to read and compare a byte, this goes from one second to run to being a half a billion seconds.

May 28, 2018 - 02:50

Gabriel Lopez

stop larping anytime

May 28, 2018 - 02:53

Benjamin Nelson

That's not how Big O works, shut the fuck up

May 28, 2018 - 03:25

Chase Richardson

If indexing the string takes i operations, then the whole process takes 0.5*len(s)*(len(s)+1) of such operations, which is O(n^2).

May 28, 2018 - 04:33

Jaxson Edwards

5-bit encodings.

May 29, 2018 - 13:50

Tyler Ortiz

Csan anyone confirm this?

May 30, 2018 - 01:31

Jason Jackson

All english characters come first, then symbols, then nippon's moonrunes. Anything else goes into later parts of the encoding as to fuck with people I don't like, such as needing several times the storage that one english letter would need for one arabic or hebrew character.

May 30, 2018 - 02:46

Nathan Hughes

What about Gook runes and Taiwan runes?

May 30, 2018 - 05:04

Jordan Perez

Nobody would write this so bad
It's actually:
n = sum(1 for l in s if l == 'a')

May 30, 2018 - 05:55

Dominic Ward

Horse shit, it's even more sensible to not expect any specific optimization if you don't actually need it. (And if you need it, you go and check the docs)
The typical example — if you just need to go through all items, use a fucking iterator. It will always work in an optimal way unless the author of that goddamn library deliberately hates you.

May 30, 2018 - 05:58

Luis James

Whats an iterator user? Is that a bloated way of saying map or fold?

May 30, 2018 - 06:07

Aiden Campbell

You are both retarded if you haven't realized that ~90% are shared. The only thing that varied was that the gooks simplified some characters differently than the chinks (or that taiwan and places speaking cantonese didn't simplify at all).

May 30, 2018 - 06:44

Christian Fisher

We should just get rid of all characters except for the Chinese ones. Everyone will be speaking Chinese soon anyways.

May 30, 2018 - 07:01

Sebastian Reed

ASCII, fuck foreginers.
This is American property!

May 30, 2018 - 07:14

Nathaniel Foster

But I need muh chinese cartoon games and I don't want xir to localize them.

May 30, 2018 - 07:16

Levi Mitchell

This particular expectation is actually ingrained in the language. If you define a getitem and a len then a iter is implicit, unless you provide it yourself.
Making indexing expensive would make only certain strings more compact, at the cost of breaking the expectations of people coming from Python 2 (where strings are bytes), or people coming from C, while also bothering people who want to go through two strings character-by-character and don't know how to use zip yet.
It has a huge cost and little gain.

May 30, 2018 - 07:31

Jace Wood

See

May 31, 2018 - 03:25

1 2 ... 10 Next

If we had another chance at a character encoding standard, what would you propose?

Last threads