Some languages are written left-to-right, others are right-to-left. Still others are written vertically, though I didn't notice the code for handling that. So today I'll describe how to determine the direction in which text should be written!

Pango uses libfribidi to implement this algorithm, which (from memory) is specified by The Unicode Consertium.


It starts by classifying each character, than their "brackets", then the "embedding levels", then the ordering. Which it then applies.


Those arrays are allocated either on the stack or the heap, depending on their length. And the initial character classification is determined via a lookup table.

Locating the bracket characters involves a couple of more sophisticated & compact lookup tables. One of which is only used when the other (and the initial type classification) indicates it is indeed an open or close bracket.

Determining embedding levels is much more involved...


Show thread

It starts by grouping equivalently-classified non-bracket characters into a linked-list of "runs". Then determine the direction of the first run that cares, fallbacks to a specified default.

Then it iterates over the runs, resolving any:
* "strong" directions by pushing to the stack
* "PDF" character types by possibling popping the stack & rearranging the runs
* "PDI" by popping the overflow or isolate stacks
* Isolate types by computing a new level & possibly pushing
* End with paragraph


Then it iterates over the runlists again to check the isolation levels & zero out erroneous connections. After which it compacts the run list, merging nodes at the same level.

Now we need to resolve "weak directions". Which involves iterating over the newly-compacted runs, getting it's adjacent text, & checking it's type to determine which it should merge direction with.

After some optional debugging it does so again for some even weaker directionality. Then what *really* doesn't care.


Show thread

Then libfiribidi iterates over the runs again to resolve the text direction of brackets & those truly neutral characters. Building a pairing node list, which it mergesorts after this loop.

It then searches those pairings for a strong direction & applies it to any brackets which needs it. It can now free those pairing nodes.

After finalizing the brackets, there's more neutral types to resolve (relatively trivial), & then based on even or odd embedding levels. Afterwhich it compacts again.


Show thread

At this point this multi-pass logic starts tidying up after itself.

It reinserts characters for setting directionality it has removed. It resets some embedding levels whilst reordering the runs. Populates the output array. And optionally outputs debugging information.

Now that libfribidi has embedding levels information, it computes an array specifying the new index for each character, starting out as an identity transformation.


Show thread

The embedding levels may then need to be refined for Arabic text for it's "joinings". Which starts with a lookup table, that's converted with a bit of logic into flags which are then applied in groups via lookup tables. Then some characters need to be replaced with their mirrored equivalent, because Arabic can apparantly be both left-to-right & right-to-left.

Then to finalize & apply the character ordering, it ignores the embedding of certain initial characters, ...


Show thread

iterates over the embedding levels reversing characters or their indices, finds the max embedding level, & iterates over them to find any inner embeddings that need to be reversed again.

This yields the output Pango uses to split line uses with the text in a unified direction. Which it returns after freeing any of it's temporary memory.

8/8 Fin. Tomorrow: character->glyph conversion via Harfbuzz

Show thread

@alcinnz I implemented limited Unicode bidi in Emacs Lisp years ago in order to get a grip on it. Little did I know how different a solution for Emacs would have to be, since its redisplay engine doesn’t work like that at all – and by the time I was finished I realized that my C was too weak to understand the Emacs redisplay engine. So it was all in vain. Well, at least I learned something about Unicode bidi. And the code is still online on the wiki…

@kensanata Actually, I'm only seeing a short blurb about your code. The link to it's source is broken! didn't help.

@alcinnz Yeah, noticed it as well and went through the pages, cleaning up some stuff and linking to the git repo I created a while ago:

@kensanata @alcinnz a version of your code, converted to Common Lisp is part of McCLIM, so it wasn't entirely wasted.

Sign in to participate in the conversation

For people who care about, support, or build Free, Libre, and Open Source Software (FLOSS).