PangoLayoutIter Bidi Support

This page is intended to facilitate my patch to pango bug #89541. As OwenTaylor had said, this issue involves some "ill-defined" work, so I use this page to explain my understanding of the underlying issues as clearly as I can. In case you think I'm wrong on some point, please feel free to correct it directly on this page.

Current implementation

The current implementation of PangoLayoutIter works well for LTR-only text. However, although it does seem to contain code which is specific for RTL/bidi text, when trying to use it for real RTL text (even very simple cases) you quickly get "assertion errors". These errors indicate that the LayoutIter encounters situations which the programmer does not expect to happen (i.e.: lower-level Pango functions such as pango_shape and pango_itemize return GlyphItems whose structure is not understood by PangoLayoutIter).

According to the documentation, a PangoLayoutIter is supposed to iterate the text in "visual order". This usually means something like with lexicographicly increasing (x,y) positions, but as I explain below, there's still some room left for interpretations. I'm not at all sure what the current implementation was trying to achieve (it's hard to follow the code's reasoning, because it does not seem to produce sensible results), but I still try to make my patches match what I do understand as far as possible.

OwenTaylor: I don't consider this to be ambiguous. There is a well-defined mapping between characters in the text and positions on the screen. This is done by, for each cluster in the text, dividing it equally between the characters in the clusters. If the run direction is LTR, the characters advance from left to right within the cluster, if the run direction is RTL, the characters advance from right to left within the cluster. See, for example, the code in pango_glyph_string_index_to_x()
- AmitAronovitch: memo: I should use this to write a test case (iterate some text, and compare the iterator's extent to what you get by string_index_to_x())
When iterating text in visual order, the characters must be iterated in an order such that their positions on the screen (as defined above) move from left to right.
- AmitAronovitch: This only defines the order of cursor positions on the screen. It does not tell me anything about the order in which characters within a cluster should be enumerated. Keep in mind that Hebrew/Arabic points are placed above/inside/below the base char and there's no horizontal ordering (many times there's more than one mark on a single base). You are just saying that the cursor should move in some direction, but because of the arbitrary(?)logical(?) division into equal horizontal slices, the positions no longer correspond to real glyph pixels anymore (even if they did, there's no general mechanism for mapping the glyphs to characters) - so you must explicitly define to which character each such position corresponds.
  - OwenTaylor: What I'm saying is that I don't want multiple ideas of the extents of a character in Pango. What is used for hit-testing and cursor placement should also be used during iteration. Pango currently has no way of knowing how the characters within a cluster correspond to glyphs, so it uses the simplest mechanism ... linear progression in the reading direction.
    - AmitAronovitch: You probably mean to say 'visual order' - that's reversed reading direction in the RTL case (especially in the case of ligatures). We realy do read from right to left I really should have had a look in the cursor placement code, it would have cleared some of the ambiguities for me. I just hadn't thought of that.
    It's important to note that in the case of base characters + combining marks, the division of a cluster into characters has no meaning for cursor positioning, because the cluster makes up a single grapheme and cursors can only be positioned at grapheme boundaries.
Note that a cluster isn't just base + combining characters, but might can also be a ligature ... a special way of writing to two adjacent characters as a single glyph. (In English, fi is frequently written as a ligature.) If you had the same thing in a RTL language .... an "IF" ligature, then it should be pretty clear that iterating the characters in visual order must first go to I and then go to F.
- AmitAronovitch: So, if I understand correctly - your rule is: decreasing indices for RTL runs, increasing for LTR, regardless of the ordering of the Glyphs? You have to understand that compared to the very common case of combining marks (they actually represent the vowels in the relevant languages), ligatures are a remote case.
  - OwenTaylor: It's going to depend a lot on the script. Standard writing of Arabic has one mandatory ligature (lam-aleph) and no vowels.
    - AmitAronovitch: In fact, Arabic is the same as Hebrew in that respect (vowels are points, and they are ommitted in most texts). Lam-Aleph actually has it's own Unicode entry (U+FEFB and the surrounding chars), and it's own key on the keyboard (placed in a central place, next to other common letters) - I never looked at the code of recent Arabic editors, but I suspect that this ligature is handled at the character level - for the same historical reasons that made the Unicode Committee include this character in the first place (note: I'm not saying we should not handle it - just that we might not normally encounter it in the texts we get from the app).
    OwenTaylor: But the thing here to note is that if there are no internal cursor positions (combining mark case) then the division of the cluster into characters doesn't matter. For the ligature case, the linear division works a lot better.
    - AmitAronovitch: True. That's why I had to go as far as Thai (which is LTR) to demonstrate that this linear division can sometimes be worse than no division at all...
  AmitAronovitch: I was (naturally?) assuming that "visual order" of characters should match the order of the glyphs in the GlyphString - at least in the case where there's 1-1 correspondence (and this is the common, almost exclusive, case). Suppose that we had a glyph iterator (we don't - see comment below) - people would get very confused if iterating by glyphs would produce radically different results than when iterating by characters...
  - OwenTaylor: Someone iterating by glyphs has to expect that the order of the glyphs is arbitrary.
The way that a cluster is defined in Pango is simply as the minimal set of characters that interact when shaping ... depending on the situation, you might have:
- one character => one glyph N characters => one glyph one character => M Glyphs N characters => M glyphs As long as you can't divide that into two unrelated groups, it's one cluster.
  - AmitAronovitch: Now that you raised the char <=> glyph issue, note that the LayoutIterator does not provide any Glyph interface (there's next_char & get_index, but no next_glyph). In fact, in the general N => M case, it's just not possible to provide both char and glyph interface - pango holds no information on the internal char <=> glyph mapping within a cluster, so if you advance to the next glyph you'll have no idea where to move the character pointer (well, you could make next_glyph disable next_char and vice versa until you reach the end of the cluster - but you might be better off just implementing another iterator type).
    - OwenTaylor: At a minimum you'd have to say that you can only iterate by characters or by glyphs, but not mixed. I don't know a use for a glyph iterator. When someone has one, we can figure that out

Order of enumeration

The iterator works with text which was already reordered by lower level funcs into "visual order". Lines are divided into runs, which are divided into clusters. Each cluster is composed of a sequence of glyphs, and corresponds to a specific sub-sequence of the original text string (in many cases, each glyph corresponds to a single character, but this is not always the case). You should note that the basic unit of reordering is the cluster - not character or glyph. A cluster is composed of a 'base-character', and zero or more 'combining marks' (in Hebrew and Arabic these are usually 'points'). The points are rendered above/below/inside the base character, so they all have the same logical extents, and no natural "visual" ordering.

So, in which order do we expect PangoLayoutIter::next_char to enumerate them? The standard is inconclusive:

L3. Combining marks applied to a right-to-left base character will at this point precede their base character. If the rendering engine expects them to follow the base characters in the final display process, then the ordering of the marks and the base character must be reversed.

The Hebrew shaper leaves the points in their original position (after the base char). I'm not sure about the other bidi engines, but from a brief look at the code, it seems like the Arabic and Syriac engines reverse the whole run (so points should come before the base char). I don't think that any of this matters, since both ways are not really "visual", nor "natural" (visual iteration on rtl text is opposite the "natural" order anyways).

OwenTaylor: It doesn't really matter what the shaping engine does. The position of characters within a cluster is defined by linear advance in the run direction ... where the glyphs actually end up on screen or how the glyphs are ordered in the glyph string doesn't matter.
- AmitAronovitch: They do matter if you want your 'visual order' to match them. It seems that you don't, which confirms my current understanding of your iteration rule. We should probably just state the rule clearly in the API docs. p.s. - when we're done with the patch, maybe I should edit this page into some kind of documentation.
  p.s.p.s - when talking about character iteration, we should avoid using left/right terminology. use incresing/decreasing-index or before/after instead.
  - OwenTaylor: Having some of this information incorporated into the Pango docs would be wonderful. Because PangoLayoutIter iterates in *visual* order, left/right is actually meaningful for it. If it iterated in *logical* order, then you'd want different terminology.
    - AmitAronovitch: I mean in discussions about definition of "visual order" (left-right ness) of characters. When you talk about the character 'to the right' of character at index 14, you don't know if it's the one at index 12 or 16 unless you already know visual ordering.
      Specificly, I was referring to this paragraph (sorry - I may be messing up the diff. look in the wiki if it causes trouble). I believe that when you said "if the run direction is RTL, the characters advance from right to left" you actually ment to say something like "the characters advance in decreasing index order". Phrased as it was, your sentence carried no information - which was the main reason of my misunderstanding, and prompted me to write the paragraph that followed. Anyways - thanks for taking the time to respond here. It cleared a lot of ambiguities for me, and I think I now clearly understand what you want the iterator to do.
In fact, within a cluster, there is no definition of what character maps to what glyph, all we know is that all the characters in the cluster mapped to all the glyphs in the cluster.

The current PangoLayoutIter implementation implements next_char by calling g_utf8_next_char on ltr text, while using g_utf8_prev_char for rtl text. This makes the code require special 'if' clauses for handling cluster boundaries depending on directionality. It is also inconsistant with what the hebrew engine does (Comment: I did not know what the Arabic engine does when I started to work on this patch).

OwenTaylor: This needs to be preserved.

My personal feeling is that since there is no real "natural" ordering, we should prefer the "utf8_next_char always" approach, because it requires less special-casing and makes the code easier to understand.

More ordering: the x coordinates

Another nitpicking question we have to consider, is what should be considered the "logical extents" of characters within a cluster (i.e. what should
PangoLayoutIter::get_char_extents return. My first guess after reading the documentation was that all chars within a cluster would get the same extents. What actually happens in the Iterator code is that the cluster's logical width is divided equally between the cluster's characters, and the parts are iterated from right to left (i.e. - the first char (the one that comes latest in the utf8 string - see above) would get the highest x value).
- OwenTaylor: Hopefully the ligature example makes it clear why the current behavior makes more sense then assigning every character in the string the same extents. In the ligature case, and in other cases, we actually have *cursor positions* within the cluster, which is another reason why we need for the positions to advance ... if all the characters had the same position, then you couldn't tell where the cursor was.
  - AmitAronovitch: But except for the ligature case, the 'cursor positions' do not have a general known horizontal ordering. For all you know, the chars might have been reordered within the cluster (e.g.: Thai is LTR, but many Thai vowels are displayed to the left of their consonant - you'd get the cursor positioned right on top on the consonant, while in fact really pointing to the vowel and vice versa. This shows that sometimes arbitrary position indication is much worse than none at all). Still, I admit that while following the ordering of the glyphs would probably achieve better results, Pango has no means to do that. Besides the fact that there's no mapping available, the rendering engines don't even guarantee a consistant ordering (so trying to follow their behaviour is futile and not well-defined). So, while your rule may seem arbitrary to speakers of some non-latin-alphabet languages, it does at least have the advantage of being well defined...
    - OwenTaylor: In the combining mark case, there is no 1-dimensional ordering in general. Is a mark that's above a character before or after the character? And if you make some decision there, there is no reason that the glyphs have to be ordered the same way in the glyph string... that's just a question of font technology.
On the other hand, if we check the width property of the glyphs, we find that the hebrew engine (the fallback method at least, in hebrew-fc.c) assigns the whole logical width to the last glyph in the cluster, and zero width to all others.
- OwenTaylor: The width property on the glyphs is used when drawing text to the screen to figure out where to place glyphs... the glyph origin is advanced by the width after drawing a glyph.
  - AmitAronovitch: Hmm... This means that with combining marks you should always place the width on the *last* glyph of the cluster, no matter if it's the base char or a point. Nice to know
    - OwenTaylor: It really depends on how the font was designed. In some fonts, marks are placed to the left of the origin, in other fonts to the right.
  It's also true that the width of a cluster is the sum of the widths of the individual *glyphs*. But it has no effect on where Pango considers the *characters* in the cluster to be.
I don't know what is the 'right' answer to this. It probably depends on what you use the iterator for. Currently, my patch does the same thing as the current implementation.
- OwenTaylor: This sounds correct.

One more thing which is inexplicable to me, is that although
PangoLayoutIter::next_cluster does seem to follow the order in the GlyphString (which should be visual by now - i.e. increasing x coordinates), the iterator's cluster_x property is decreased by the first glyph's width when using rtl text. (Note: are they expecting the glyphs to have negative width? Should check if other engines do such things)
- OwenTaylor: This looks like buggy code that has never been tested. When you move to the next cluster, cluster_x should be increased by the sum of the widths of all the glyphs in the cluster. Probably the sign difference comes from the definition of 'cluster_x' in a comment: position where cluster begins, where "begins" means left orright side according to text direction. But you'd still want to increase cluster_x even if that is correct. The only difference is whether you'd increase it by the width of the old or new cluster.
  - AmitAronovitch: That definition of 'begins' (left side on ltr, right side on rtl) matches what pango_glyph_string_index_to_x() does. However, it does not match your reasoning for the sub-cluster character positioning above. If you want the cursor to have a different position for each character, the cursor must always be placed on the same side (i.e. left). Otherwise the position of the (visually) last cluster in an embedded RTL run, would be the same as that of the first one in the following LTR run. My new patch (will submit soon) always puts cluster_x on the left side (as it seems more logical this way to me). However, I might add commented out code to support the "string_index_to_x() way", for easy integration if you want it that way.

The next_cluster_start issue

The PangoLayoutIter class has a (private) member named next_cluster_start. It holds the index of the first Glyph of the next cluster, relative to the PangoGlyphString representing the current run.

This property is used in several places, for two purposes:

To mark the 'end boundary' of the cluster within the current run's GlyphString (in ::get_cluster_extents).
For updating the cluster_start pointer (ptr to 1st glyph of current cluster) when iterating into the next cluster.
For calculating current cluster's text boundaries (after converting to utf-8 index using cluster_end_index)

The first two purposes are consistant, but consider what happens in the case of an RTL run, when using next_cluster_index for the third purpose:

Suppose the run was 'ABCD' (caps representing RTL chars). Since each hebrew/arabic letter is 2-bytes long in UTF-8, we'll use 'AABBCCDD' to represent the UTF8 text. Also, since this is a RTL run, the clusters are reordered to 'DCBA'. Suppose that the iterator now points to the third cluster of the run (the 'B'), so cluster_index is 2, and next_cluster_index is 3 - pointing to the 'A'.

When the iterator tries to calculates the unicode index boundaries of the current cluster, it uses the pointers stored in the GlyphString. So, for 'start' it takes 2 (place of the first byte of B in the UTF8 text), and for 'end' it takes 0 - the start of the 'A'. Now, the code is aware of the fact that 'end' comes out lower than 'start' in rtl runs, and takes care of it by swapping them (this probably made it pass some tests). However, this is still a bug! If you look at the UTF8 string, you'll see that even after swapping, the boundaries (0 to 2) correspond to the 'A' which is the previous character, not the current one.

This behaviour comes about because when we want to refer to the unicode text, we actually need the start index of the next logical cluster, not the next visual one. The next visual cluster is only relevant for use within the reordered PangoGlyphItem.

OwenTaylor: The way I would implement things is to keep the following quantities in the PangoLayoutIter structure for the cluster:
- next_cluster_glyph: The first glyph in the next cluster. (May be glyphs->num_glyphs) cluster_end_index: The byte index of the *last* character in the cluster in visual order. (So, for the common case of the single character cluster, it's also the first character in the cluster.) cluster_x: The x offset of the left side of the current cluster cluster_width: The width of the current cluster
And then the following items for the current character
- index: byte index of the current character (already there) character_position: the visual position of the character in the cluster (so, always increases 0,1,2,3... as we iterate through a cluster)
Note that the size of the PangoLayoutIter structure doesn't have to be preserved, since it's internal to pangolayout.c. With the above quantities, then next_char(), next_cluster(), char_extents(), cluster_extents() are all pretty easy to implement as I defined them above.
- AmitAronovitch: Work in progress... It seems to me that with the above struct, char_extents() would still be cumbersome, because you must calculate the cluster's number of characters (so division would be right). Better to keep cluster_num_chars on the struct. This has the extra advantage that now you don't need cluster_end_index. (I'm assuming that cluster_end_index was kept so that next_char() would know when we're done. But now I can use cluster_num_chars for that). (following is older comment:)
  AmitAronovitch: That's a big relief (allowing changes in the struct). I was quite annoyed by having to recalculate stuff - it's quite hard to make the iterator behave nicely if you stick to current available fields, but I was more worried about making my patch acceptable and not breaking anything.

My fixes

(Lines 96c97)

As explained above, the next_cluster_start member of the PangoLayoutIter struct is used for two conflicting purposes. The solution to that should be splitting it to two properties with different names.

I decided on the name cluster_end to distinguish the third usecase (cluster's end boundary for utf8-text lookup) from the second one (next value for cluster_start). To make the struct remain the same size (as a form of binary compatability), I decided to keep just one of the two properties (cluster_end) stored on the struct - for the other cases the next_cluster_start function is called directly when needed.

(Lines 4318,4336c4350,4365)

The function cluster_end_index (uses the stored property to get the utf8 index relative to start of the run) is kept much the same. However, note that now the RTL is no special case - end of the run is indicated by item->length (not 0) as in LTR. This matches the fact that now next_char (4649c4678) also always increases in utf8 order.

(Lines 4317a4318,4334)

The function next_cluster_start is kept as is. Because it is not stored on the struct, it is now called directly for the first or second usecases explained above. To calculate the value of the new property, a new function cluster_end was added. With LTR it uses a.m. next_cluster_start, but for RTL it uses it's counterpart - next_logical_cluster_start_rtl. This new method is needed because in RTL text the next logical cluster comes before (to the left) of this one.

(Lines 4390,4394c4419)

As mentioned above, I believe x coordinates should go in "visual" order, no matter the run's directionality - so we should start on the left.

(Lines 4415c4441)

Note that when we enter a new RTL run, the clusters had been reversed. iter->index (current utf8 index) should point to the starting character of the current cluster, and this is usually NOT the first char in the run.

(Lines 4486,4516c4512,4547)

pango_layout_get_iter builds a new iterator for the PangoLayout. In case the layout text have both RTL and LTR runs on the first line, their order within the line might have been exchanged (e.g. if it's a RTL paragraph, first 'visual' run will be the last RTL run which fits into the line). At the beginning, iter->index should point to the first visual run, which is not always 0.

(Lines 4629c4660)

To avoid confusion, and emphasize the new property introduced above, I renamed next_cluster_index to abs_cluster_end_index.

(Lines 4644,4661c4675,4682)

As explained above, I iterate 'always forwards' within a cluster.

(Lines 4698,4704c4721,4728)

This includes the increasing x issue. More importantly: note the fact that I take the cluster width from the last glyph of the cluster - this is because that's where the hebrew engine stores it (I did NOT check if it works for multi-glyph clusters of other engines - just thought that maybe hebrew engine does this to comply with common practice - we should check it).

(Lines 4723c4747, Lines 4747a4772)

In the process of reading the code, I renamed prev_run_end to next_run_start. This makes more sense to me, but revert if you think otherwise. As before, start of the run is taken from the run itself.

(Lines 4853,4859c4878)

As explained above, the current range calculation is wrong, even if you do swap start & end.

(Lines 4888c4914)

I left the strange in-cluster logical behaviour intact (except a minor fix to make it match at the edge) - see above. p.s. - you can remove my comment-to-self a few lines above that.

Terminology

Bidi Algorithm

Text in right-to-left languages usually contains embedded left-to-right stuff (such as numbers, or latin words and phrases). The "Bidi Algorithm" is the algorithm employed to reorder the memory representation (logical order) of such bidirectional text into "visual order" (e.g. sorted by increasing (x,y) coordinates).

Unicode specifies a standard for this. Roughly, this works by dividing the paragraph into runs of consecutive characters with matching directionality. Clusters within each run are layed out from right to left or LTR according to the run's directionality. Then, the runs themselves are layed out within each line, to match the paragraphs directionality.

Character

Unicode character (code point). Unicode code points are numbered 0 to 0x10FFFF, and don't fit into a single byte or short-int. To hold them in memory, Pango uses the UTF-8 encoding, in which each character is represented by a fixed sequence of 1-4 bytes. UTF-8 is backwards compatable to ASCII (ASCII chars are represented by a single byte). For details, see

Unicode: Section 3.4

Glyph

An entity that matches a single font-entry (image). A font might use several

images to compose a single character, or use a single image for representing a sequence of two characters. This definition is similiar, but not identical to Unicode's Glyph.

Cluster

A minimal set of characters (or the matching sequence of glyphs) which interact graphicly when shaping. A cluster does not always consist of a single character. One common case is when the cluster corresponds to a single "letter", which is composed of a "base character" plus zero or more "combining characters" (e.g. Hebrew or Arabic "Points"). Another common case is Ligatures. The Bidi Algorithm reorders clusters as units. This is similiar, but not identical to Unicode's Grapheme Cluster.

Run

A sequence of clusters (or the corresponding glyphs or characters that represent them) that have matching values for attributes relevant to the bidi algorithm (language and directionality). A run corresponds to a single "level run" of Unicode's bidi algorithm. The PangoGlyphItem class (and it's alias PangoLayoutRun) is used for holding info relevant to a single run.

Ligature

A ligature is a sequence of consecutive characters, rendered as a single glyph. This is similiar to Unicode's Ligature.