Characters

Quick: which character is '\u03b5'?

Let’s see, well, since it’s between 0370 and 03ff it’s clearly Coptic or Greek. The Greek lower-case characters start at 03b1 so it would be the lower case form of the fifth Greek character. That would be, let’s see, ε! Man, that Unicode stuff is just so logical!

I don’t think there’s anything wrong with Unicode. But as soon as people have to use the character codes directly, for instance when using Unicode character constants in languages like Java and C#, there’s trouble. There are tens of thousands of Unicode code points and I can’t even remember which ASCII character is line feed and which one is carriage return. On the rare occasions where I either read or write code that uses Unicode constants or ASCII control characters, I usually have to open a browser look up the values myself. That sucks.

The fortress language improves on things: there, instead of writing the code of a character in an identifier, you can write the name. It just seems so obvious really: the Unicode standard defines a name for all the characters so why should I have the trouble of looking up the character code?

In neptune you’re welcome to still use character codes to specify Unicode characters. Writing '\u03b5' means greek small letter epsilon just as it does in Java and C#. But inspired by fortress we’ve added another syntax for specifying Unicode characters by name: writing \<name> specifies the Unicode character called name. So instead of writing '\u03b5' you can simply name the character: '\'. If you feel that 'x' is too straightforward you can specify it equivalently as '\'. This works both in character literals and text strings:

"From \ to \"

which means the same as, but is a lot easier to understand than

"From \u0391 to \u03A9"

Besides Unicode names, we also allow the ASCII control characters to be specified by their short and long names. This means that I don’t have to remember that line feed is 10, not 13; instead I can just write '\' or '\'.

The first time you want to write the character ε you probably still have to look it up to see that it’s called greek small letter epsilon. But there’s a lot more logic to the name than to the raw character code and there’s a better chance you’ll remember next time. And it will of course be obvious to anyone reading the code which character it is. The only problem I see is that the names tend to be very long. Fortress allows you to use shorter forms of some characters: you can for instance leave out words like “letter” in character names. If the length of the names turns out to be a problem we might add something like that at some point.

Either way, I think this is a pretty nifty feature. And I wouldn’t be surprised if it turns out that there are other languages that have similar mechanisms.

Leave a Reply

Your email address will not be published. Required fields are marked *


*