Hacker News — vinext + Cloudflare Workers

new
past
show
ask
show
jobs
submit

▲Digital Printing of Arabic: explaining the problem (2017) (digitalorientalist.com)

73 points by a_t48 4 days ago | 60 comments

abdullahkhalids 17 hours ago [-]

This problem is not limited to Arabic. Variants of the arabic alphabet are used by Persian (including Iranian and Dari dialects), Mazanderani, Qashqai, Luri, Gilaki, Kurdish (excluding Kurds in Turkey), Talysh, Azerbaijani (in Iran), Pamir languages, Pashto, Urdu, Balochi, Sindhi (in Pakistan), Punjabi (in Pakistan), Uzbek (in Afghanistan), Turkmen (in Afghanistan), Saraiki, Hindko, Brahui, languages spoken in Kashmir.

Whole languages are dying out because people are unable to express them properly on computers. Even popular software that dominate these speakers does not care to improve their experience. For example, Urdu has traditionally been written in the Nastaliq form [1], but is usually is rendered everywhere in the Naskh form [2]. There is no way to change this, for example, in Android without basically rooting it and changing the system fonts.

[1] https://en.wikipedia.org/wiki/Nastaliq

[2] https://en.wikipedia.org/wiki/Naskh_(script)

helterskelter 10 hours ago [-]

> There is no way to change this, for example, in Android without basically rooting it and changing the system fonts.

I am really surprised Android won't let the user select their own system font. This is a huge accessibility problem, especially for dyslexics.

Gander5739 2 hours ago [-]

You can do it on some vendors' versions, sometimes requiring third party apps like zfont.

Conscat 14 hours ago [-]

I feel like I've never gotten a compelling explanation for why Nastaliq is hard/unavailable. I'm not an expert on abjads, but it doesn't look harder to render then Naskh (and it self-evidently is possible since the fonts exist). Does anyone here know why they make it difficult? Urdu is much less obscure than, say, Sharada or other languages with Unicode support. I think Punjabi is also often written in Nastaliq when it's not in Gurmukhi or Roman.

bradrn 8 hours ago [-]

In Naskh, each letter has only four forms (for the most part — there are a few ligatures etc. but I think ‘only four forms’ remains basically true). The choice between forms is determined almost entirely by position within a word (initial/medial/final/isolated). All the letters are aligned along the baseline and connect to each other in basically the same way.

By contrast Nastaliq is a much more complicated style. Many letters and letter combinations take on several different forms depending on which other letters surround them. Letter joins are usually diagonal, so letters earlier in a word need to be shifted above the baseline by a variable amount. Having to shift letters vertically as well as horizontally greatly complicates other aspects of the style too.

(I recall seeing a nice table some time ago showing all the various different possibilities for letter joins in Nastaliq. Unfortunately I can’t seem to find it again. Still, you might get some idea by consulting the documentation of one of the existing Nastaliq fonts, e.g. Awami Nastaliq: https://software.sil.org/awami/what-is-special/)

linmer 7 hours ago [-]

Yeah, but the difficulty isn't in rendering the fonts, it's for the font creator. So once the font is ready with all the combinations it rendering and using a Nastaliq font doesn't differ with rendering a Naskh. Nastaliq fonts are available in Persian, not sure if true for other languages, but it's just more complexity on making the font. For using a ready font the only thing needed is permission to change the font.

bradrn 6 hours ago [-]

Yep, that’s what I meant; thanks for clarifying the point.

(Though that said, a sibling post linked this interesting talk on limitations in OpenType itself: https://www.tiro.com/John/TypeCon2014_Hudson_DECK.pdf)

ablob 6 hours ago [-]

afaik this is a non-issue with modern text rendering engines. Modern font files include rulesets to determine the forms and shaping engines apply these rules to eventually reach the desired "shape" (i.e. order, position and which glyphs to render). For example, if you use HarfBuzz it should be able to calculate the Glyphs and offsets you need for a properly set script.

I personally spent way to much time trying to understand it, but at least according to this video (https://www.youtube.com/watch?v=VaA0v0V4RsU) it really is not that difficult if you leave out all the font-selection and emoji shenanigans.

I think at least FreeType (glyph rendering) and HarfBuzz (text shaping) make it needlessly complex through their documentation. It is extensive in describing what the parts do, but the only way to figure out what you need is by fiddling around. As soon as you want to do more complex stuff you're on your own. Especially figuring out which parts you don't need is annoying.

yorwba 7 hours ago [-]

SIL's Nastaliq font uses their own Graphite engine, which is included in Firefox but not other browsers (Demo page: https://graphite.sil.org/graphite_fontdemo ), but e.g. Noto Nastaliq Urdu also exists https://fonts.google.com/noto/specimen/Noto+Nastaliq+Urdu?pr... and does a decent job in non-Graphite engines, certainly better than Awami Nastaliq without Graphite.

So the real question is why Android doesn't make it easy to put Noto Nastaliq Urdu in the font stack.

ValdikSS 7 hours ago [-]

>compelling explanation for why Nastaliq is hard/unavailable

https://www.tiro.com/John/TypeCon2014_Hudson_DECK.pdf

abdullahkhalids 5 hours ago [-]

There are many high quality Nastaliq fonts available. You can install them on your computer and use them easily in whatever software (example office apps) allows you to set the font.

There are no technical reasons preventing the use of Nastaliq fonts everywhere. Only product design decisions by big tech.

mchaver 11 hours ago [-]

My guess would be line height is a challenge and Naskh already exists. Then probably because these scripts are not used often in the places that are centers of software/OS development.

mohamedkoubaa 17 hours ago [-]

I don't know why people look down their noses at Arabizi

abdullahkhalids 16 hours ago [-]

Because people don't want to abandon hundreds or thousands of years of culture for a completely solvable problem.

linmer 7 hours ago [-]

I looked at Arabizi and the numbers are really annoying for formal text etc. Finglish is better in my opinion, however it causes problems like being able to read the same text in two ways. like "dar". The a can be like 'a' in 'dad' or it can be 'a' like in 'car'. with different pronunciation it means 'door' and 'gallow' which can be very annoying in Arabic languages that unlike Persian write _ُ_ِ_ٌ_ً_ٍ_ّ_. Instead of numbers it uses combinations like 'kh' for 'خ', 'gh' for 'ق' and 'غ'. In some methods they use 'aa' for 'a' sound like 'bar' and single 'a' for 'a' sound like 'lad'.

vessenes 15 hours ago [-]

I don’t know either, but I am aware that in glyph based languages (and this article makes the case that Arabic has some glyph-like features), there is considerable social discussion about the equivalents, like pinyin. Detractors worry that sound-based (where sounds are based on the latin / western orthography) approaches to writing change something fundamental in people’s brains as distinct from more native versions.

In Chinese for instance, you can use a keyboard that combines radicals - parts of a character, or you can use a keyboard that combines phonemes. Those seem likely to change literally how you think in your language. There may be related concerns for Arabic.

That said, one of the complaints in the blog is that two different codepoints render to the same exact letter / phrase / word — this is not a problem unique to Arabic in Unicode, and there are known approaches: I’d expect (I’m not a Unicode expert by any means) that more work on the tech stack for rectification (I’m sure there’s a technical Unicode word for this process of matching codepoints for e.g. search and uniqueness of rendering) would likely be useful for Arabic, and relatively seamlessly flow in many places.

e28eta 13 hours ago [-]

> I’m sure there’s a technical Unicode word for this process of matching codepoints for e.g. search and uniqueness of rendering

That’d be Unicode Normalization. I don’t have an opinion on the best source for more details, so here’s a link from unicode.org https://www.unicode.org/reports/tr15/

I don’t know enough to know whether or not there are still Arabic-specific issues, either in the spec or the implementations.

The example in the article of copy/paste/search is interesting. I think it’s equally likely to be a RtL issue as a normalization bug, but I haven’t done anything significant with either topic.

mchaver 11 hours ago [-]

Probably because it's a work around and not what most people want to do. Imagine someone telling you you have to type English in Cyrillic. I know if I could no longer type out Chinese characters and had to use pinyin it would feel very odd and like something was taken away.

pseingatl 16 hours ago [-]

For a while, Arabizi was wildly popular and universally used on feature phones. When mobiles became smarter, it was used less. Japanese has romaji and Mandarin has pinyin. Arabic's Arabizi would increase literacy rates and solve all these digital problems.

avadodin 13 hours ago [-]

Romanization is a separate issue to using fixed glyphs.

There was a theory in the XIX / early XX century that full literacy was impossible without the Latin script but such claims are ridiculous especially for Arabic which is an alphabetic script already. China has higher literacy rates than Vietnam, for example.

I don't think the many composition rules of Unicode are really necessary, though. Maybe as an extension for academic work or artistic compositions but not for computing.

If all we had were movable types, all of these language users would find a way to write their language that wouldn't require a Turing-complete computer on each glyph. Now the Unicode gods pander to some of these computer-hostile scripts making the users of different scripts feel slighted.

cyphar 16 hours ago [-]

The vast majority of Japanese and Mandarin speakers are also not in favour of replacing their current writing systems (which give them a link to thousands of years of their own history) in favour of simplified systems. I suspect it is the same for Arabic speakers.

throwaway27448 12 hours ago [-]

I generally agree with what you're saying, but there is rather famously a simplified form of chinese that was designed specifically to increase literacy rates.

numpad0 14 hours ago [-]

Romaji/pinyin are widely used for typing the actual written scripts. They're not seen as alternate written scripts outside of edge case scenarios(like chats in FPS)

mxchelsemaan 4 hours ago [-]

It's aesthetically revolting and allows for multiple renderings of the same word.

smitty1e 10 hours ago [-]

This seems an esoteric problem for the outsider.

But consider how cursive is dying out in (at least American) English, and how many centuries of writing will become unintelligible to the casual reader as a result.

All of these important cultural artifacts require maintenance.

RetroTechie 5 hours ago [-]

> All of these important cultural artifacts require maintenance.

This. Arabic users can complain about eg. Unicode not covering their writing in a suitable manner. And I (as a non-Arabic) can certainly see the problems described in the article.

But -going back to earlier days of computing- what stopped Arabic countries from devising a system that does that better than Unicode? (and covers other written languages like Hangul, Japanese or traditional Chinese, better than Unicode covers them)

Seems like that didn't happen? Either too few Arabic people cared, or solution(s) they came up with had shortcomings of their own & weren't implemented widely enough, or Unicode was good enough that few Arabic developers cared to go beyond that.

abdullahkhalids 5 hours ago [-]

It's likely the same problem as in Pakistan. Due to the history of colonialism/control by European powers, in these countries personal economic success is usually tied to command of English or French. So even within each of these countries, the rich, educated and those in power prefer latin script. Consequently, there was never any strong push to develop computing technology for local languages.

The other reason is that it's not technologically simple to solve all the issues highlighted in the TFA. Unicode actually does a pretty decent job of setting a uniform standard, but a lot of software has to be written on top of it to get the entire system working: (1) your software must support bidi text, (2) good fonts must be available to display the text in multiple languages (3) textual data needs to be properly stored in unicode and transmitted as is at every point in the OS (4) search engines must deal with the complications of non-breaking spaces and legacy unicode characters.

You have to kind of rewrite the entire stack from top to bottom. Preferential Arabic/Persian/Urdu speakers never had the technical skills and the political power to drive those changes in software largely written in different continents.

cenamus 9 hours ago [-]

This has pretty much already happened for the older style of German cursive, called Kurrent. Partly also because the Nazis got rid of it.

https://en.wikipedia.org/wiki/Kurrent

Tons of old documents written in it, basically impossible to decipher for anyone that only learned to write "modern" cursive or even print.

linmer 8 hours ago [-]

Not only writing and printing is hard, so is selection and moving your cursor. Because in most tools, the right and left arrow keys don't mean right and left in Arabic, Persian etc. It's reversed in RTL languages, so right arrow moves the cursor to end direction (left in LTR, right in RTL) and left arrow moves you towards start direction (right in LTR, left in RTL). So in bidirectional text, for example when majority of the text is English and you have a short RTL phrase, you are holding right arrow and then then when you reach the RTL part the cursor suddenly jumps to the start of RTL text, then it goes to left and it SEEMS like you are going backward to the start, not forward. That is until you reach the end of RTL phrase and you teleport to start of next LTR part.

mohsen1 7 hours ago [-]

It's 2026 and things like kashida in CSS is not possible. Long way to go to support the Arabic script properly on the web.

And as the article says, since most of the writing is happening on computers, stuff like kashida are going to be forgotten soon.

linmer 6 hours ago [-]

Hmm. by kashida do you mean something like 'ایـــــــــــــن'?

I don't think you mean this, because I don't know how would you do it in CSS. Looks more like a problem to be solved with different types of character than styling.

You may want to explain more.

harshreality 16 hours ago [-]

(2017)

How much of this is still a problem with modern software/font stacks and harfbuzz?

ValdikSS 7 hours ago [-]

It still is a problem in general, that's why fonts now have webassembly in it.

This repository has a good outlook: https://github.com/harfbuzz/harfbuzz-wasm-examples

RetroTechie 5 hours ago [-]

Oh great. Embed executable <anything> into a format perceived to contain static data - as a core feature. That always worked so well!

IsTom 60 minutes ago [-]

True type already has hinting virtual machine. It's not an entirely new thing.

Karliss 14 hours ago [-]

The fact that the article was able to show correct version in regular text is pretty good indicator that if done correctly those are more or less solved problems. I don't disagree that there are probably plenty of times when those mistakes are repeated and solutions not used widely enough (more often for Arabic scripts than other languages), but even for 2017 it feels more like anecdotal examples of what can go wrong ignoring existing technical details. But those mistakes largely come down to having someone who cares and understands the language and technology not for the lack of solutions. There are probably plenty of interesting edge cases that might not be handled perfectly even though solutions for basic cases exist, but article doesn't come even close to discussing those technical details especially if it's only conclusion is "computers introduced more problems, notably because of Unicode".

> The inflexibility persisted and has arguably only become more aggravated in the 20th century

What about 21th century? Digital printing can overlap characters just fine. And modern fonts support context sensitive ligatures and glyph substitutions.

Second/third example those seemed to be caused by more by someone who doesn't understand the language copy pasting stuff.

PDF -> that's just PDF being bad. Text and text search in PDFs tends to mes up even or English.

> with unicode number U+0623, but one can also type أ, which is an alif and a high hamza, represented by unicode numbers U+0627 and U+0654.

That's what Unicode normalization and locale settings are for. Same thing applies to large fraction of latin based scripts other than English, anything which has letters with diacritic marks.

> for كثيره and كثيرة will in most cases yield different results

Similar thing in almost any non English language for example cafe and café or ABC and ⒶⒷⒸ. Although at least some systems handle it reasonably. Not sure how much it is heuristics based on large data (hard to scale across software), and how much it's good application of Unicode character decomposition/normal form tables. Which Arabic letters lack appropriate Unicode decomposition (and other) tables and what are the best practices of unicode normalization/decomposition/locale handling for search (applicable for all languages) are more interesting and actionable topics.

> Not even the simple idea of CJK has been implemented.

Many users of CJK language would argue that CJK unification was a mistake. If different languages prefer different forms of the glyph, they should better be separate characters. Having separate Chinese and Japanese fonts because CJK unified too much just introduces additional points of failure.

ablob 6 hours ago [-]

> Many users of CJK language would argue that CJK unification was a mistake.

Luckily it's not a decision without turning back. In most relevant contexts you should know the input language and can select a Font specifically using said variations. Of course this information will not be present in plain text, but if it turns out to become an issue I'd wager, since language codes do exist, that a control code-point for language selection can be added to the specification. There's already so many special cases in Unicode that it shouldn't be a huge issue (apart from backwards-incompatibility that would lead to tofu instead of no rendered glyph).

a_t48 9 hours ago [-]

Oh hey, second chance queue, nice. I'd originally searched this up because I was curious about how Arabic worked on early computers.

pseingatl 16 hours ago [-]

yorwba 11 hours ago [-]

Many Arabic-speaking countries already have very high literacy rates. Meanwhile Somali is a related language officially written in the Latin alphabet, but Somalia has a literacy rate around 50%: https://ourworldindata.org/grapher/cross-country-literacy-ra...

So adopting Arabizi without increasing access to education can be expected to do roughly nothing for literacy, whereas with a good education system, people can learn to read and write in Arabic script just fine.

7 hours ago [-]

jojobas 8 hours ago [-]

Almost as if the fonts, rendering systems and that all has a vanishingly low percentage of Arabic native-speakers. Why is left as an exercise for the reader.

numpad0 14 hours ago [-]

It's... interesting how the author sees Han Unification as a feature, when it's just a longstanding and politically charged bug. CJK languages are mutually unintelligible, so displaying CJK texts in wrong fonts won't do anything meaningful; it won't make texts in one language readable to speakers of other languages.

yorwba 11 hours ago [-]

Indeed displaying CJK texts in wrong fonts won't do anything to change the meaning and people who can read it in one font can read it in any font. They might complain that it looks ugly because that one stroke should be slightly longer and have a different angle, but those are ultimately aesthetic preferences that don't affect readability.

Even before Unicode, it was established practice that documents mixing Chinese and Japanese would use the same encoding for both and roughly nobody would bother to pick an ugly font for the foreign-language text to make it look appropriately different.

Unicode rightly decided that the fine details of appearance are left to fonts. Otherwise you'd also need e.g. a bunch of extra codepoints so that early-20th-century handwritten letters in German can have their look accurately preserved: https://en.wikipedia.org/wiki/S%C3%BCtterlin

numpad0 9 hours ago [-]

You can't mix encodings in a single file. A file has one encoding only. It was not possible before Unicode to mix two languages in a single file, whether the languages involved were Chinese or Japanese or French(English was an exception).

Now, if a file was encoded in Unicode, and/or if it was in such document format that support inline font specification, such as HTML, then you could mix two languages without having to stick to one language by e.g. wrapping <font face=Helvetica>paragraphs and words</font> <font face=Futura>with tags</font>.

My point is, it seems that the author is not aware that each of CJK languages are only understood within each countries, in both writings and speeches, and that's somewhat peculiar.

ablob 5 hours ago [-]

You may not be able to mix encodings, but mixing languages has always been possible. If you used a French encoding you would be able to write in English, but not the other way around. I'd wager there are similar cases for cyrillic text. What Unicode gave us is its universality (heh). You don't have to carefully select an encoding able to represent the languages you wish to use anymore.

Panzerschrek 13 hours ago [-]

Why do people still use this horrible-looking and hard to process alphabet? Why not switching to latin (as some countries did) or at least to reform it so that it's easier to type and to read?

cenamus 9 hours ago [-]

Because pretty much no language besides Latin actually maps nicely to the Latin script. Pretty much all languages use digraphs, diacritic symbols or completely new letters in the Latin script (even English, œ and æ, for some time at least).

Panzerschrek 8 hours ago [-]

Having latin alphabet with diacritics and digraphs is still better than something like Arabic. Some non-latin alphabet with letters separate from each other (like Greek, Cyrillic, Armenian, Georgian, etc.) would be fine too.

RetroTechie 4 hours ago [-]

Why should all writing be forced into 1 way of doing things? That just sounds disrespectful of other peoples' culture.

mxchelsemaan 4 hours ago [-]

His username is a Nazi anti-tank weapon. I suspect "disrespectful of other peoples' culture" is an understatement for OP.

ah27182 11 hours ago [-]

What’s the point of insulting a script as “horrible looking”. What a silly comment, please grow up.

Panzerschrek 8 hours ago [-]

It's objectively so horrible. Letters have similar form to each other, joints between them makes it harder to take them apart. As I understand, Arabic alphabet wasn't designed to be practical for daily use and only for writing sacred texts.

linmer 8 hours ago [-]

If you think Arabic horrible because it has connected letters, so is cursive. But I like both cursive and Arabic. You can easily distinguish separate words as the words' letters are connected, and you don't have to put less space to show that some letters are making a word. It's not optimized for printing and digital fonts I agree. But you can't say it's not useful for daily use. It's so much easier on paper.

Panzerschrek 6 hours ago [-]

> But I like both cursive and Arabic

Just a personal esthetic (unpractical) preference.

> It's not optimized for printing and digital fonts I agree. But you can't say it's not useful for daily use.

These two sentences contradict each other. Daily use means printing and reading from screen.

> It's so much easier on paper.

Only if one writes by-hand. Which is unpractical since like press printing has been invented.

theobreuerweil 8 hours ago [-]

How can you describe something as “objectively” horrible-looking? Your opinion on the way that something looks is precisely that: an opinion. I’d add that neither the Arabic nor Latin alphabets were designed for anything. They both evolved organically from other previous alphabets.

Panzerschrek 6 hours ago [-]

> an opinion

I didn't read my comment properly. I have written, why Arabic alphabet is horrible. It's all cursive and letters aren't different enough from each other.

> I’d add that neither the Arabic nor Latin alphabets were designed for anything. They both evolved organically from other previous alphabets.

Many alphabets evolved from a need for daily stuff like accounting, where practicality matters. Arabs didn't have widespread alphabet until they needed writing their sacred texts, so, they have invented an alphabet primary designed for that. Other alphabetical systems were already widespread in the regions currently dominated by the Arabic alphabet and many of them look much better.

mxchelsemaan 8 hours ago [-]

>As I understand, Arabic alphabet wasn't designed to be practical for daily use and only for writing sacred texts.

You clearly don't. Arabic was the premier language for philosophy, science, and mathematics in the middle ages. Algebra, algorithms, zero, cipher, average, and so on are all etymologically Arabic. One might start to suspect bigotry from a "Panzerschrek."

>Letters have similar form to each other, joints between them makes it harder to take them apart.

Your brain is not powerful enough to pattern-match, but your incapacity is not universal.

>It's objectively so horrible.

Literal billions in the world disagree. I might equally claim that the Arabic abjad is infinitely more beautiful than the pedestrian Latin alphabet, especially when expressed in the ugly and diseased Orc-tongue that is called German. [1] Your lack of taste is not universal.

https://www.youtube.com/watch?v=wS_-7B1FAbE&pp=ygUaY2hpbmVzZ...

Panzerschrek 6 hours ago [-]

> Arabic was the premier language for philosophy, science

But this was not for ordinary people (peasants) or even accountants, where practicality matters.

> Your brain is not powerful enough to pattern-match, but your incapacity is not universal

With some training it's possible to read Arabic texts. But this requires more mental load and practice compared to other alphabetical systems.

> Literal billions in the world disagree

Argumentum ad populum. There also billions of people using even worse writing systems - non-alphabetical ones. This doesn't mean that they are as good as alphabetical systems like latin.

The video you linked has nothing to do with practicality. It's about calligraphy, which is an art form. It may look good, but it doesn't matter for daily use when one needs to read and type a lot.

ablob 5 hours ago [-]

I wonder by which metric you measure these scripts. Clearly it can't be on pronounciation or information density. If "amount of letters" is your pick, then Latin might be "objectively" the best system - you'd just be using a very bad metric.

If you're going to unify all the worlds language into one script, then you'd better pick a good measure for that. If everyone on the world learns it, then it doesn't matter if there are 50 or even 100 different characters.You will have to capture _all_ of the nuances of the languages without blowing them out of proportion in size. Good luck with that.

mxchelsemaan 6 hours ago [-]

> But this was not for ordinary people (peasants) or even accountants, where practicality matters.

All peasant societies were illiterate, including Latin script adopters. Completely irrelevant to Arabic.

"Or even accountants" - apparently, Arabs didn't trade! It's not like Muhammad himself was, you know, a trader and an accountant...

> With some training it's possible to read Arabic texts. But this requires more mental load and practice compared to other alphabetical systems.

Perhaps your brain is too slow. I and many other bilinguals read Arabic even more quickly and efficiently than Latin-script languages. The words are terser and are read as units as opposed to the inefficient character-by-character Latin system. Speed-reading doesn't exist in Arabic because Arabic is already speedily read.

> It may look good, but it doesn't matter for daily use when one needs to read and type a lot.

You just said it was objectively horrible looking above. But consistency cannot be expected of a jumbled mind.

Rendered at 19:34:47 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.