This site has been retired. For up to date information, see handbook.gnome.org or gitlab.gnome.org.


[Home] [TitleIndex] [WordIndex

PangoLayoutIter Bidi Support

This page is intended to facilitate my patch to pango bug #89541. As OwenTaylor had said, this issue involves some "ill-defined" work, so I use this page to explain my understanding of the underlying issues as clearly as I can. In case you think I'm wrong on some point, please feel free to correct it directly on this page.

Current implementation

The current implementation of PangoLayoutIter works well for LTR-only text. However, although it does seem to contain code which is specific for RTL/bidi text, when trying to use it for real RTL text (even very simple cases) you quickly get "assertion errors". These errors indicate that the LayoutIter encounters situations which the programmer does not expect to happen (i.e.: lower-level Pango functions such as pango_shape and pango_itemize return GlyphItems whose structure is not understood by PangoLayoutIter).

According to the documentation, a PangoLayoutIter is supposed to iterate the text in "visual order". This usually means something like with lexicographicly increasing (x,y) positions, but as I explain below, there's still some room left for interpretations. I'm not at all sure what the current implementation was trying to achieve (it's hard to follow the code's reasoning, because it does not seem to produce sensible results), but I still try to make my patches match what I do understand as far as possible.

Order of enumeration

The iterator works with text which was already reordered by lower level funcs into "visual order". Lines are divided into runs, which are divided into clusters. Each cluster is composed of a sequence of glyphs, and corresponds to a specific sub-sequence of the original text string (in many cases, each glyph corresponds to a single character, but this is not always the case). You should note that the basic unit of reordering is the cluster - not character or glyph. A cluster is composed of a 'base-character', and zero or more 'combining marks' (in Hebrew and Arabic these are usually 'points'). The points are rendered above/below/inside the base character, so they all have the same logical extents, and no natural "visual" ordering.

So, in which order do we expect PangoLayoutIter::next_char to enumerate them? The standard is inconclusive:

L3. Combining marks applied to a right-to-left base character will at this point precede their base character. If the rendering engine expects them to follow the base characters in the final display process, then the ordering of the marks and the base character must be reversed.

The Hebrew shaper leaves the points in their original position (after the base char). I'm not sure about the other bidi engines, but from a brief look at the code, it seems like the Arabic and Syriac engines reverse the whole run (so points should come before the base char). I don't think that any of this matters, since both ways are not really "visual", nor "natural" (visual iteration on rtl text is opposite the "natural" order anyways).

The current PangoLayoutIter implementation implements next_char by calling g_utf8_next_char on ltr text, while using g_utf8_prev_char for rtl text. This makes the code require special 'if' clauses for handling cluster boundaries depending on directionality. It is also inconsistant with what the hebrew engine does (Comment: I did not know what the Arabic engine does when I started to work on this patch).

My personal feeling is that since there is no real "natural" ordering, we should prefer the "utf8_next_char always" approach, because it requires less special-casing and makes the code easier to understand.

More ordering: the x coordinates

The next_cluster_start issue

The PangoLayoutIter class has a (private) member named next_cluster_start. It holds the index of the first Glyph of the next cluster, relative to the PangoGlyphString representing the current run.

This property is used in several places, for two purposes:

The first two purposes are consistant, but consider what happens in the case of an RTL run, when using next_cluster_index for the third purpose:

Suppose the run was 'ABCD' (caps representing RTL chars). Since each hebrew/arabic letter is 2-bytes long in UTF-8, we'll use 'AABBCCDD' to represent the UTF8 text. Also, since this is a RTL run, the clusters are reordered to 'DCBA'. Suppose that the iterator now points to the third cluster of the run (the 'B'), so cluster_index is 2, and next_cluster_index is 3 - pointing to the 'A'.

When the iterator tries to calculates the unicode index boundaries of the current cluster, it uses the pointers stored in the GlyphString. So, for 'start' it takes 2 (place of the first byte of B in the UTF8 text), and for 'end' it takes 0 - the start of the 'A'. Now, the code is aware of the fact that 'end' comes out lower than 'start' in rtl runs, and takes care of it by swapping them (this probably made it pass some tests). However, this is still a bug! If you look at the UTF8 string, you'll see that even after swapping, the boundaries (0 to 2) correspond to the 'A' which is the previous character, not the current one.

This behaviour comes about because when we want to refer to the unicode text, we actually need the start index of the next logical cluster, not the next visual one. The next visual cluster is only relevant for use within the reordered PangoGlyphItem.

My fixes

As explained above, the next_cluster_start member of the PangoLayoutIter struct is used for two conflicting purposes. The solution to that should be splitting it to two properties with different names.

I decided on the name cluster_end to distinguish the third usecase (cluster's end boundary for utf8-text lookup) from the second one (next value for cluster_start). To make the struct remain the same size (as a form of binary compatability), I decided to keep just one of the two properties (cluster_end) stored on the struct - for the other cases the next_cluster_start function is called directly when needed.

The function cluster_end_index (uses the stored property to get the utf8 index relative to start of the run) is kept much the same. However, note that now the RTL is no special case - end of the run is indicated by item->length (not 0) as in LTR. This matches the fact that now next_char (4649c4678) also always increases in utf8 order.

The function next_cluster_start is kept as is. Because it is not stored on the struct, it is now called directly for the first or second usecases explained above. To calculate the value of the new property, a new function cluster_end was added. With LTR it uses a.m. next_cluster_start, but for RTL it uses it's counterpart - next_logical_cluster_start_rtl. This new method is needed because in RTL text the next logical cluster comes before (to the left) of this one.

As mentioned above, I believe x coordinates should go in "visual" order, no matter the run's directionality - so we should start on the left.

Note that when we enter a new RTL run, the clusters had been reversed. iter->index (current utf8 index) should point to the starting character of the current cluster, and this is usually NOT the first char in the run.

pango_layout_get_iter builds a new iterator for the PangoLayout. In case the layout text have both RTL and LTR runs on the first line, their order within the line might have been exchanged (e.g. if it's a RTL paragraph, first 'visual' run will be the last RTL run which fits into the line). At the beginning, iter->index should point to the first visual run, which is not always 0.

To avoid confusion, and emphasize the new property introduced above, I renamed next_cluster_index to abs_cluster_end_index.

As explained above, I iterate 'always forwards' within a cluster.

This includes the increasing x issue. More importantly: note the fact that I take the cluster width from the last glyph of the cluster - this is because that's where the hebrew engine stores it (I did NOT check if it works for multi-glyph clusters of other engines - just thought that maybe hebrew engine does this to comply with common practice - we should check it).

In the process of reading the code, I renamed prev_run_end to next_run_start. This makes more sense to me, but revert if you think otherwise. As before, start of the run is taken from the run itself.

As explained above, the current range calculation is wrong, even if you do swap start & end.

I left the strange in-cluster logical behaviour intact (except a minor fix to make it match at the edge) - see above. p.s. - you can remove my comment-to-self a few lines above that.

Terminology


2024-10-23 11:37