Language Technology

Face-to-face interactions—as well as audio/video to an extent—unify the "language interface". It is in writing and with different scripts that languages can start to really differ radically in form! Historically, we can see that the form of different scripts is informed by how the glyphs of that language were formed. For example, in East Asia, we see brush-based scripts. Cuneiform is based off wedges being pressed into clay. Europe saw the use of quill/feather pens. With computers in the 21st century, all this diversity is smashed together into the same glowing screen interface...

The focus of this page is on the language technology of writing in the context of Information Technology (IT), including personal computers and smartphones.

Crafting the Multi-lingual Web

Whatever technique you are using to produce web content (e.g. manually writing HTML as I am here, or using a framework like React), you will want to make sure you have control over the lang HTML attribute. Learn more about this from the MDN web docs. For example, if you need to use a paragraph of Korean in a webpage, you ought to do set lang="ko" in your paragraph tag.

Setting the lang attribute is better than ad-hoc solutions to make a webpage look the right way because you are specifying the semantics of some content. It is not enough to just have some words, we must tell our clients' browsers the language we're writing in! This can make it more likely that tools like screen readers, automatic translation services, and even correct font rendering (still a problem often, unfortunately) succeed.

Make text big enough to read!

Some scripts need bigger characters to read than others. My eyesight isn't terrible, but I still zoom in to read East Asian texts so I don't have to needlessly suffer. Using the lang attribute, as described above, you can make certain languages display with larger text!. Samples below:

沒想到,打字比手寫方便多了。

125%のサイズで読みやすいですよ。

타자하기는 정말 재미 있어요

No es necesario magnificar texto español.

Considering script-particular attributes can differentiate your hand-crafted localization work from careless automated localization. Additional areas where language specific issues may come up include languages read right to left (e.g. Hebrew, Arabic), the presence of rare characters (writing a page on Egyptian hieroglyphics?) and whether or not suitable font faces are available to support your site design in all of your target languages.

Magnifying Glass Over Hebrew Text
Users shouldn't be expected to zoom in.

Try out your sites on multiple browsers

...and if there is any fancy feature that may break your design/UI on any common browser, reconsider if it is worth using at all. It may be better to have your site not remembered at all than remembered for breaking.

CRT Monitor
Support many browsers and machines.

Do prepare a "mobile-friendly" site

You never know who will want to read your webpages while poopin'. Making a mobile-friendly site is also a good way to get thinking about you want to modularize the content of your site. This will make refactoring sections easier in the future if you decide to do that.


Text Input

How can a keyboard with a hundred or so keys be used to type thousands of different Chinese characters? How do I type accent marks correctly so I can pass Spanish class? These questions and more will be answered on this page.

Keyboard Basics

If you (want to) write a lot of another language, you should learn a dedicated keyboard layout for multi-lingual input. For typing Spanish/French/German/etc. I use the US International keyboard. Using a language like Russian or Arabic will require learning a new keyboard layout. You should learn the standard layouts for these languages rather than trying to learn some "phonetic" QWERTY-like variant, given the choice.

Chinese, Japanese, and Korean all have complicated scripts that must incorporate an additional software level to read in keypresses and then compose (in the case of Korean) or select (in the case of Chinese and Japanese) glyphs. I wrote a descriptive article on Korean Input Methods, which describes some specifics of typing Korean, which makes use of its own non-Latin alphabet.

If you don't have to type that many special characters, making use of a compose key can be a non-intrusive way to get multilingual text powers. On Linux, I map this key to the right Windows key on my (full-sized) keyboard.

Platform Specific Notes

The more different scripts you make use of, the more complicated your setup will necessarily become. If it is not too much trouble to do, if is often easier to isolate the contexts in which you use particular languages (e.g. set up Korean input on one account/computer and do Arabic only from your phone). The world can handle bilingualism, but jumping between too many languagse will likely have you working against your technology.

Windows

Windows has pretty good default mutli-lingual support. Yet, in many ways it seems to be a mess, reflecting different histories of keyboard support with different languages.

Four Different Default Input Method Switching Keys

These are default keys on MS Windows; for some languages you can modify them, for others you can't

  • Chinese: Shift key toggles Pinyin/Chinese and underlying QWERTY
  • Japanese: without a Japanese keyboard, Alt + backtick toggles Japanese input and underlying QWERTY; there are ways to toggle Katakana/Hiragana but I don't think people do this much—they justly rely on predictive text
  • Korean right Alt key is to toggle between Korean/QWERTY input. The right Ctrl key does Korean hangul to hanja (Chinese character) conversion; this is rarely used by most people. Because the right modifiers are used for language-related things, doing stuff like programing and the like, you have to use ctrl + alt on the left hand
  • Russian (and other European languages): to switch keyboard layouts, you will have to press Windows + Space; this is also how you switch between Russian and other languages (such as those listed above)

Linux

Switching between different "underlying" keyboard layouts (e.g. English, Russian) can be done using included software of most distributions. To enter in complex input like Chinese/Japanese/Korean, you'll probably need to make use of fcitx, ibus or some other software package. I have not had success getthing these packages to also work with changing an underlying keyboard layout—everything gets complicated and my computer crashes. So, now I only do European languages on Linux. Windows is my borg Asian language machine 🤖

Emacs

Emacs, which can be installed on all major operating systems, provides multilingual input out-of-the-box, provided you have the correct fonts installed. This can be useful for working with other languages and not changing too many configs. Emacs also contains input methods for doing things like typing Hànyǔ Pīnyīn (Romanized Chinese standard) which is not easily done with default OS's multilingual support. Google's Noto fonts are freely available and work well with pretty much any language/script.

The *-prefix input methods are similar to using the US international keyboard (with "dead keys"). Using the spanish-prefix method, for instance, I can type pi~nata and it will give me piñata.

Combined with a tool like pandoc, this is a great way to smash out assignments in a foreign language with very little config, especially if you're already using/familiar with emacs for other things.