Unicode BiDi Algorithm-tq08g2z-ChinaUnix博客

hanpfeiwolfcs.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

tq08g2z

博客访问： 830328
博文数量： 117
博客积分： 2583
博客等级：少校
技术积分： 1953
用户组：普通用户
注册时间： 2008-12-06 22:58

个人简介

Coder

文章分类

全部博文（117）

操作系统（3）
脚本语言（1）
网络开发（0）
Android开发（8）
业界巨擘（2）
图形图像（7）
C/C++（8）
Linux 环境开发（11）
Linux内核开发（72）

Linux文件系统（19）

Linux设备驱动（7）

嵌入式Linux（18）
未分配的博文（5）

文章存档

2013年（1）

2012年（10）

2011年（12）

2010年（77）

2009年（13）

2008年（4）

我的朋友

相关博文

Unicode BiDi Algorithm

分类： C/C++

2012-06-29 21:46:03

Contents

1
2
- 2.1
- 2.2
- 2.3
- 2.4
3
- 3.1 : , , , , , ,
- 3.2
- 3.3
  - 3.3.1 : , ,
  - 3.3.2 : , , , , , , , , ,
  - 3.3.3 : , , , , , ,
  - 3.3.4 : ,
  - 3.3.5 : ,
- 3.4 : , , ,
- 3.5
4
- 4.1
- 4.2
- 4.3 : , , , , ,
- 4.4
5
- 5.1
- 5.2
- 5.3
- 5.4
- 5.5
- 5.6
- 5.7
- 5.8
6

1 Introduction

Unicode标准规定一个字串在内存中存储的顺序为逻辑顺序。以水平行来呈现文本的时候，大多数script自左向右来显示字符。然而，有一些script（比如Arabic 或Hebrew）水平文本的自然顺序是自右向左来显示的。如果所有的文本都具有一致的水平方向，那么显示文本的顺序也就不会有不清楚的地方了。

然而，由于这些R2L scripts使用了自左向右书写的数字，则文本实际上是双向的：是R2L 和 L2R 的混合。除了数字以外，嵌入的来自于英语或其他自左向右书写的scripts 的单词，也会产生双向的文本。没有一个清晰的规范，当文本的水平方向出现不一致的时候，在决定显示字符的顺序方面，含混不清的问题就会上升。

这个附录描述了用于决定双向Unicode文本的方向的算法。这个算法扩展了广泛用于现有的实现中的implicit model，并为特殊的情况添加了显式的格式码。在大多数情况下，没有必要包含额外的关于文本的信息来获取正确的显示顺序。

然而，在双向文本的情况下，对于有些情景，一个隐式的双向顺序不足以产生出可以理解的文本。为了处理这些情况，则定义一个方向格式码的最小的集合，来控制渲染时字符的顺序。这样则可以对显示顺序进行精确的控制以完成清晰的交换，并可以确保用于像文件名或者标签这样的简单项的无格式的文本总是可以被正确的排序来显示。

方向格式码仅仅被用来改变文本的显示顺序。在其他所有方面它们应该被忽lue掉——它们不影响文本的比较或者断字，解析，或数值分析。

当使用双向文本的时候，字符仍然以逻辑顺序来翻译——只有显示会被影响。双向文本的显示顺序依赖于文本中字符的方向属性。注意，有一些重要的安全问题与双向文本相关联：请参考 []来获取更多信息。

2 Directional Formatting Codes

两种类型的显式编码被用于改变标准的Unicode Bidirectional Algorithm（UBA）。此外，还有一些隐式的排序编码，R2L 和L2R 标记。所有这些编码的作用范围被限制为当前段落，因而它们的影响将会被一个段落分割符终止。方向类型L2R 和 R2L被称为强类型，属于这些类型的字符被称为强方向字符。与数字关联的方向类型被称为若类型，属于这些类型的字符被称为若类型字符。

这些控制均具有属性 Bidi_Control，并且被划分为两组：

隐式的 Bidi 控制

U+200E..U+200F LEFT-TO-RIGHT MARK..RIGHT-TO-LEFT MARK

显式 Bidi 控制

U+202A..U+202E LEFT-TO-RIGHT EMBEDDING..RIGHT-TO-LEFT OVERRIDE

在网页上，应该使用值为dir="ltr"或dir="rtl"的dir属性来替换显式的bidi控制。更多信息可参考 [].

尽管术语嵌入被用于一些显示编码中，但是编码范围内的文本并不独立于周围的文本。相反的，一个嵌入内的字符可能惠影响到外面的字符。设计本算法以使对于显式编码的使用可以等效的由行外的信息来表示，比如样式表信息。然而，本算法中，所有供选择的表示的定义将会参考显示编码的行为

2.1

下面的编码标志着一串文本被当作嵌入的文本。比如，一个英文的引号出现在一个阿拉伯语句子的中间可能被标记为被嵌入的L2R文本。如果有一个希伯莱语的短语出现在英语引号的中间，则那个短语可能被标记为被嵌入的R2L.这些编码允许嵌套的嵌入的文本。

Abbr.	Code	Chart	Name	Description
LRE	U+202A		LEFT-TO-RIGHT EMBEDDING	将接下来的文本作为嵌入的L2R.
RLE	U+202B		RIGHT-TO-LEFT EMBEDDING	将接下来的文本作为嵌入的R2L.

The effect of right-left line direction, for example, can be accomplished by embedding the text with RLE...PDF.

2.2

在有特殊需要的情况下，下面的编码允许双向字符类型被覆写，比如对于部分数字。这些编码允许嵌套的方向覆写。处于安全性的考量，只要有可能，就应该避免这些字符。更多信息，请参考 []。

Abbr.	Code	Chart	Name	Description
LRO	U+202D		LEFT-TO-RIGHT OVERRIDE	强制将接下来的字符当作强L2R字符。
RLO	U+202E		RIGHT-TO-LEFT OVERRIDE	强制将接下来的字符当作强R2L字符。

这些编码的精确的含义，将在接下来对算法的讨论中变得更加清晰。R2L覆写，比如，可被用于强制使一个组成为混合的英文，数字和希伯来字母的零件号按照自右向左的顺序来书写。

2.3

下面的编码终止上一个显式编码的作用（嵌入或覆写）并恢复双向状态为它前面的那个编码。

Abbr.	Code	Chart	Name	Description
PDF	U+202C		POP DIRECTIONAL FORMATTING	Restore the bidirectional state to what it was before the last LRE, RLE, RLO, or LRO.

这个编码的精确含义将会在后面算法的讨论中变得更加清晰。

2.4

这些编码是非常轻量的编码。除了它们不显示或没有任何其他语义作用外，它们的行为就像R2L 或 L2R字符。由于它们的范围更加本地化，相对于使用显式的嵌入或覆写，则使用它们更方便。

Abbr.	Code	Chart	Name	Description
LRM	U+200E		LEFT-TO-RIGHT MARK	Left-to-right zero-width character
RLM	U+200F		RIGHT-TO-LEFT MARK	Right-to-left zero-width character

在下面的算法中，将不会特别提及隐式的方向标记。这是由于就对于双向排序而言，它们的作用和一个相应的强方向字符完全一样，仅有的区别为它们不显示。

Unicode Bidirectional Algorithm (UBA)接收一个文本流作为输入，并按如下四个主要阶段来处理：

将文本分割为段落。 算法的其他部分分别被应用于各个段落中的文本。
初始化。 一个方向字符类型的列表将会被初始化，原始文本中的每一个字符对应一个项。每一项的值为各个字符的Bidi_Class属性。此后，则直到重排序阶段，原始字符都不会再被引用。然后初始化一个嵌入层级的列表，每一个字符均有一个层级。
嵌入层级的解析。一系列的规则将被应用于嵌入层级的列表和方向字符类型的列表。每一个规则都基于那些列表的当前值，并可能改变那些值。
Resolution of the embedding levels. A series of rules are applied to the lists of embedding levels and directional character types. Each rule is based on the current values of those lists, and can modify those values. Each rule is applied to each of the values in sequence before continuing to the next rule. The result of this phase is a modified list of embedding levels; the list of directional character types is no longer needed.
Reordering. The text within each paragraph is reordered for display: first, the text in the paragraph is broken into lines, then the resolved embedding levels are used to reorder the text of each line for display.

The Unicode Bidirectional Algorithm (UBA) takes a stream of text as input and proceeds in four main phases:

Separation into paragraphs. The rest of the algorithm is applied separately to the text within each paragraph.
Initialization. A list of directional character types is initialized, with one entry for each character in the original text. The value of each entry is the Bidi_Class property of the respective character. After this point, the original characters are no longer referenced until the reordering phase. A list of embedding levels, with one level per character, is then initialized.
Resolution of the embedding levels. A series of rules are applied to the lists of embedding levels and directional character types. Each rule is based on the current values of those lists, and can modify those values. Each rule is applied to each of the values in sequence before continuing to the next rule. The result of this phase is a modified list of embedding levels; the list of directional character types is no longer needed.
Reordering. The text within each paragraph is reordered for display: first, the text in the paragraph is broken into lines, then the resolved embedding levels are used to reorder the text of each line for display.

The algorithm reorders text only within a paragraph; characters in one paragraph have no effect on characters in a different paragraph. Paragraphs are divided by the Paragraph Separator or appropriate Newline Function (for guidelines on the handling of CR, LF, and CRLF, see Section 4.4, Directionality, and Section 5.8, Newline Guidelines of []). Paragraphs may also be determined by higher-level protocols: for example, the text in two different cells of a table will be in different paragraphs.

Combining characters always attach to the preceding base character in the memory representation. Even after reordering for display and performing character shaping, the glyph representing a combining character will attach to the glyph representing its base character in memory. Depending on the line orientation and the placement direction of base letterform glyphs, it may, for example, attach to the glyph on the left, or on the right, or above.

This annex uses the numbering conventions for normative definitions and rules in Table 1.

Table 1. Normative Definitions and Rules

Numbering	Section
BDn	Definitions
Pn	Paragraph levels
Xn	Explicit levels and directions
Wn	Weak types
Nn	Neutral types
In	Implicit levels
Ln	Resolved levels

3.1

. The bidirectional characters types are values assigned to each Unicode character, including unassigned characters. The formal property name in the Unicode Character Database [] is Bidi_Class.

. Embedding levels are numbers that indicate how deeply the text is nested, and the default direction of text on that level. The minimum embedding level of text is zero, and the maximum explicit depth is level 61.

Embedding levels are explicitly set by both override format codes and by embedding format codes; higher numbers mean the text is more deeply nested. The reason for having a limitation is to provide a precise stack limit for implementations to guarantee the same results. Sixty-one levels is far more than sufficient for ordering, even with mechanically generated formatting; the display becomes rather muddied with more than a small number of embeddings.

. The default direction of the current embedding level (for the character in question) is called the embedding direction. It is L if the embedding level is even, and Rif the embedding level is odd.

For example, in a particular piece of text, Level 0 is plain English text. Level 1 is plain Arabic text, possibly embedded within English level 0 text. Level 2 is English text, possibly embedded within Arabic level 1 text, and so on. Unless their direction is overridden, English text and numbers will always be an even level; Arabic text (excluding numbers) will always be an odd level. The exact meaning of the embedding level will become clear when the reordering algorithm is discussed, but the following provides an example of how the algorithm works.

. The paragraph embedding level is the embedding level that determines the default bidirectional orientation of the text in that paragraph.

. The direction of the paragraph embedding level is called the paragraph direction.

In some contexts the paragraph direction is also known as the base direction.

. The directional override status determines whether the bidirectional type of characters is to be reset. The override status is set by using explicit directional controls. This status has three states, as shown in Table 2.

Table 2. Directional Override Status

Status	Interpretation
Neutral	No override is currently active
Right-to-left	Characters are to be reset to R
Left-to-right	Characters are to be reset to L

. A level run is a maximal substring of characters that have the same embedding level. It is maximal in that no character immediately before or after the substring has the same level (a level run is also known as a directional run).

Example

In this and the following examples, case is used to indicate different implicit character types for those unfamiliar with right-to-left letters. Uppercase letters stand for right-to-left characters (such as Arabic or Hebrew), and lowercase letters stand for left-to-right characters (such as English or Russian).

Memory: car is THE CAR in arabic Character types: LLL-LL-RRR-RRR-LL-LLLLLL Resolved levels: 000000011111110000000000

Notice that the neutral character (space) between THE and CAR gets the level of the surrounding characters. The level of the neutral characters can also be changed by inserting appropriate directional marks around neutral characters. These marks have no other effects.

Table 3 lists additional abbreviations used in the examples and internal character types used in the algorithm.

Table 3. Abbreviations for Examples and Internal Types

Symbol	Description
	Neutral or Separator (, , , )
	The text ordering type ( or ) that matches the embedding level direction (even or odd)
	The text ordering type ( or ) assigned to the position before a level run.
	The text ordering type ( or ) assigned to the position after a level run.

3.2

The normative bidirectional character types for each character are specified in the [] and are summarized in . This is a summary only: there are exceptions to the general scope. For example, certain characters such as U+0CBF kannada vowel sign I are given Type L (instead of NSM) to preserve canonical equivalence.

The term European digits is used to refer to decimal forms common in Europe and elsewhere, and Arabic-Indic digits to refer to the native Arabic forms. (SeeSection 8.2, Arabic of [], for more details on naming digits.)
Unassigned characters are given strong types in the algorithm. This is an explicit exception to the general Unicode conformance requirements with respect to unassigned characters. As characters become assigned in the future, these bidirectional types may change. For assignments to character types, see DerivedBidiClass.txt [] in the [].
Private-use characters can be assigned different values by a conformant implementation.
For the purpose of the Bidirectional Algorithm, inline objects (such as graphics) are treated as if they are an U+FFFC object replacement character.
As of Unicode 4.0, the Bidirectional Character Types of a few Indic characters were altered so that the Bidirectional Algorithm preserves . That is, two canonically equivalent strings will result in equivalent ordering after applying the algorithm. This invariant will be maintained in the future.
Note: The Bidirectional Algorithm does not preserve compatibility equivalence.

Table 4. Bidirectional Character Types

Category	Description	General Scope
Strong	Left-to-Right	LRM, most alphabetic, syllabic, Han ideographs, non-European or non-Arabic digits, ...
	Left-to-Right Embedding	LRE
	Left-to-Right Override	LRO
	Right-to-Left	RLM, Hebrew alphabet, and related punctuation
	Right-to-Left Arabic	Arabic, Thaana, and Syriac alphabets, most punctuation specific to those scripts, ...
	Right-to-Left Embedding	RLE
	Right-to-Left Override	RLO
Weak	Pop Directional Format	PDF
	European Number	European digits, Eastern Arabic-Indic digits, ...
	European Number Separator	plus sign, minus sign
	European Number Terminator	degree sign, currency symbols, ...
	Arabic Number	Arabic-Indic digits, Arabic decimal and thousands separators, ...
	Common Number Separator	colon, comma, full stop (period), no-break space, ...
	Nonspacing Mark	Characters marked Mn (Nonspacing_Mark) and Me (Enclosing_Mark) in the Unicode Character Database
	Boundary Neutral	Default ignorables, non-characters, and control characters, other than those explicitly given other types.
Neutral	Paragraph Separator	paragraph separator, appropriate Newline Functions, higher-level protocol paragraph determination
	Segment Separator	Tab
	Whitespace	space, figure space, line separator, form feed, General Punctuation spaces, ...
	Other Neutrals	All other characters, including object replacement character

3.3

The body of the Bidirectional Algorithm uses character types and explicit codes to produce a list of resolved levels. This resolution process consists of five steps: (1) determining the paragraph level; (2) determining explicit embedding levels and directions; (3) resolving weak types; (4) resolving neutral types; and (5) resolving implicit embedding levels.

3.3.1

. Split the text into separate paragraphs. A paragraph separator is kept with the previous paragraph. Within each paragraph, apply all the other rules of this algorithm.

. In each paragraph, find the first character of type L, AL, or R.

Because paragraph separators delimit text in this algorithm, this will generally be the first strong character after a paragraph separator or at the very beginning of the text. Note that the characters of type LRE, LRO, RLE, or RLO are ignored in this rule. This is because typically they are used to indicate that the embedded text is theopposite direction than the paragraph level.

. If a character is found in and it is of type AL or R, then set the paragraph embedding level to one; otherwise, set it to zero.

Whenever a higher-level protocol specifies the paragraph level, rules and may be overridden: see .

3.3.2

All explicit embedding levels are determined from the embedding and override codes, by applying the explicit level rules through . These rules are applied as part of the same logical pass over the input. As each character is processed, the current embedding level and the directional override status are tracked, being adjusted or kept the same depending on the type of that character. In turn, the current embedding level and the directional override status affect the assignment of the explicit embedding level for each character as defined by rules through .

Explicit Embeddings

. Begin by setting the current embedding level to the paragraph embedding level. Set the directional override status to neutral. Process each character iteratively, applying rules through . Only embedding levels from 0 to 61 are valid in this phase.

In the resolution of levels in rules and , the maximum embedding level of 62 can be reached.

. With each RLE, compute the least greater odd embedding level.

a. If this new level would be valid, then this embedding code is valid. Remember (push) the current embedding level and override status. Reset the current level to this new level, and reset the override status to neutral.
b. If the new level would not be valid, then this code is invalid. Do not change the current level or override status.

For example, level 0 → 1; levels 1, 2 → 3; levels 3, 4 → 5; ...59, 60 → 61; above 60, no change (do not change levels with RLE if the new level would be invalid).

. With each LRE, compute the least greater even embedding level.

a. If this new level would be valid, then this embedding code is valid. Remember (push) the current embedding level and override status. Reset the current level to this new level, and reset the override status to neutral.
b. If the new level would not be valid, then this code is invalid. Do not change the current level or override status.

For example, levels 0, 1 → 2; levels 2, 3 → 4; levels 4, 5 → 6; ...58, 59 → 60; above 59, no change (do not change levels with LRE if the new level would be invalid).

Explicit Overrides

An explicit directional override sets the embedding level in the same way the explicit embedding codes do, but also changes the directional character type of affected characters to the override direction.

. With each RLO, compute the least greater odd embedding level.

a. If this new level would be valid, then this embedding code is valid. Remember (push) the current embedding level and override status. Reset the current level to this new level, and reset the override status to right-to-left.
b. If the new level would not be valid, then this code is invalid. Do not change the current level or override status.

. With each LRO, compute the least greater even embedding level.

a. If this new level would be valid, then this embedding code is valid. Remember (push) the current embedding level and override status. Reset the current level to this new level, and reset the override status to left-to-right.
b. If the new level would not be valid, then this code is invalid. Do not change the current level or override status.

. For all types besides BN, RLE, LRE, RLO, LRO, and PDF:

a. Set the level of the current character to the current embedding level.
b. Whenever the directional override status is not neutral, reset the current character type according to the directional override status.

If the directional override status is neutral, then characters retain their normal types: Arabic characters stay AL, Latin characters stay L, neutrals stay N, and so on. If the directional override status is R, then characters become R. If the directional override status is L, then characters become L. The current embedding level is not changed by this rule.

Terminating Embeddings and Overrides

There is a single code to terminate the scope of the current explicit code, whether an embedding or a directional override. All codes and pushed states are completely popped at the end of paragraphs.

. With each PDF, determine the matching embedding or override code. If there was a valid matching code, restore (pop) the last remembered (pushed) embedding level and directional override.

. All explicit directional embeddings and overrides are completely terminated at the end of each paragraph. Paragraph separators are not included in the embedding.

. Remove all RLE, LRE, RLO, LRO, PDF, and BN codes.

Note that an implementation does not have to actually remove the codes; it just has to behave as though the codes were not present for the remainder of the algorithm. Conformance does not require any particular placement of these codes as long as all other characters are ordered correctly.
See Section 5, , for information on implementing the algorithm without removing the formatting codes.
The zero width joiner and non-joiner affect the shaping of the adjacent characters—those that are adjacent in the original backing-store order, even though those characters may end up being rearranged to be non-adjacent by the Bidirectional Algorithm. For more information, see Section 5.3, .

. The remaining rules are applied to each run of characters at the same level. For each run, determine the start-of-level-run (sor) and end-of-level-run (eor) type, either L or R. This depends on the higher of the two levels on either side of the boundary (at the start or end of the paragraph, the level of the “other” run is the base embedding level). If the higher level is odd, the type is R; otherwise, it is L.

For example:

Levels: 0 0 0 1 1 1 2 Runs: <--- 1 ---> <--- 2 ---> <3>

Run 1 is at level 0, sor is L, eor is R.
Run 2 is at level 1, sor is R, eor is L.
Run 3 is at level 2, sor is L, eor is L.

For two adjacent runs, the eor of the first run is the same as the sor of the second.

3.3.3

Weak types are now resolved one level run at a time. At level run boundaries where the type of the character on the other side of the boundary is required, the type assigned to sor or eor is used.

Nonspacing marks are now resolved based on the previous characters.

. Examine each nonspacing mark (NSM) in the level run, and change the type of the NSM to the type of the previous character. If the NSM is at the start of the level run, it will get the type of sor.

Assume in this example that sor is R:

AL NSM NSM → AL AL AL sor NSM → sor R

The text is next parsed for numbers. This pass will change the directional types European Number Separator, European Number Terminator, and Common Number Separator to be European Number text, Arabic Number text, or Other Neutral text. The text to be scanned may have already had its type altered by directional overrides. If so, then it will not parse as numeric.

. Search backward from each instance of a European number until the first strong type (R, L, AL, or sor) is found. If an AL is found, change the type of the European number to Arabic number.

AL EN → AL AN AL N EN → AL N AN sor N EN → sor N EN L N EN → L N EN R N EN → R N EN

. Change all ALs to R.

. A single European separator between two European numbers changes to a European number. A single common separator between two numbers of the same type changes to that type.

EN ES EN → EN EN EN EN CS EN → EN EN EN AN CS AN → AN AN AN

. A sequence of European terminators adjacent to European numbers changes to all European numbers.

ET ET EN → EN EN EN EN ET ET → EN EN EN AN ET EN → AN EN EN

. Otherwise, separators and terminators change to Other Neutral.

AN ET → AN ON L ES EN → L ON EN EN CS AN → EN ON AN ET AN → ON AN

. Search backward from each instance of a European number until the first strong type (R, L, or sor) is found. If an L is found, then change the type of the European number to L.

L N EN => L N L R N EN => R N EN

3.3.4

Neutral types are now resolved one level run at a time. At level run boundaries where the type of the character on the other side of the boundary is required, the type assigned to sor or eor is used.

The next phase resolves the direction of the neutrals. The results of this phase are that all neutrals become either R or L. Generally, neutrals take on the direction of the surrounding text. In case of a conflict, they take on the embedding direction.

. A sequence of neutrals takes the direction of the surrounding strong text if the text on both sides has the same direction. European and Arabic numbers act as if they were R in terms of their influence on neutrals. Start-of-level-run (sor) and end-of-level-run (eor) are used at level run boundaries.

L N L → L L L R N R → R R R R N AN → R R AN R N EN → R R EN AN N R → AN R R AN N AN → AN R AN AN N EN → AN R EN EN N R → EN R R EN N AN → EN R AN EN N EN → EN R EN

. Any remaining neutrals take the embedding direction.

N → e

The embedding direction for the given neutral character is derived from its embedding level: L if the character is set to an even level, and R if the level is odd. (See.)

Assume in the following example that eor is L and sor is R. Then an application of and yields the following:

L N eor → L L eor R N eor → R e eor sor N L → sor e L sor N R → sor R R

Examples. A list of numbers separated by neutrals and embedded in a directional run will come out in the run’s order.

Storage: he said "THE VALUES ARE 123, 456, 789, OK". Display: he said "KO ,789 ,456 ,123 ERA SEULAV EHT".

In this case, both the comma and the space between the numbers take on the direction of the surrounding text (uppercase = right-to-left), ignoring the numbers. The commas are not considered part of the number because they are not surrounded on both sides by digits (see Section 3.3.3, Resolving Weak Types). However, if there is a preceding left-to-right sequence, then European numbers will adopt that direction:

Storage: IT IS A bmw 500, OK. Display: .KO ,bmw 500 A SI TI3.3.5

In the final phase, the embedding level of text may be increased, based on the resolved character type. Right-to-left text will always end up with an odd level, and left-to-right and numeric text will always end up with an even level. In addition, numeric text will always end up with a higher level than the paragraph level. (Note that it is possible for text to end up at levels higher than 61 as a result of this process.) This results in the following rules:

. For all characters with an even (left-to-right) embedding direction, those of type R go up one level and those of type AN or EN go up two levels.

. For all characters with an odd (right-to-left) embedding direction, those of type L, EN or AN go up one level.

Table 5 summarizes the results of the implicit algorithm.

Table 5. Resolving Implicit Levels

Type	Embedding Level
Type	Even	Odd
L	EL	EL+1
R	EL+1	EL
AN	EL+2	EL+1
EN	EL+2	EL+1

3.4

The following rules describe the logical process of finding the correct display order. As opposed to resolution phases, these rules act on a per-line basis and are applied after any line wrapping is applied to the paragraph.

Logically there are the following steps:

The levels of the text are determined according to the previous rules.
The characters are shaped into glyphs according to their context (taking the embedding levels into account for mirroring).
The accumulated widths of those glyphs (in logical order) are used to determine line breaks.
For each line, rules – are used to reorder the characters on that line.
The glyphs corresponding to the characters on the line are displayed in that order.

. On each line, reset the embedding level of the following characters to the paragraph embedding level:

Segment separators,
Paragraph separators,
Any sequence of whitespace characters preceding a segment separator or paragraph separator, and
Any sequence of white space characters at the end of the line.

The types of characters used here are the original types, not those modified by the previous phase.
Because a paragraph separator breaks lines, there will be at most one per line, at the end of that line.

In combination with the following rule, this means that trailing whitespace will appear at the visual end of the line (in the paragraph direction). Tabulation will always have a consistent direction within a paragraph.

. From the highest level found in the text to the lowest odd level on each line, including intermediate levels not actually present in the text, reverse any contiguous sequence of characters that are at that level or higher.

This rule reverses a progressively larger series of substrings.

The following examples illustrate the reordering, showing the successive steps in application of Rule . The original text, including any embedding codes for producing the particular levels, is shown in the "Storage" row in the example tables. The application of the rules from Section 3.3 and of the Rule results in (a) text with Bidi Controls and BN characters removed, plus (b) resolved levels. These are listed in the rows "Before Reordering" and "Resolved Levels". Each successive row thereafter shows the one pass of reversal from Rule , such as "Reverse levels 1-2". At each iteration, the underlining shows the text that has been reversed.

The paragraph embedding level for the first and third examples is 0 (left-to-right direction), and for the second and fourth examples is 1 (right-to-left direction).

Example 1 (embedding level = 0)

Storage:	car means CAR.
Before Reordering:	car means CAR.
Resolved levels:	00000000001110
Reverse level 1:	car means RAC.

Example 2 (embedding level = 1)

Storage:	car MEANS CAR.
Before Reordering:	car MEANS CAR.
Resolved levels:	22211111111111
Reverse level 2:	rac MEANS CAR.
Reverse levels 1-2:	.RAC SNAEM car

Example 3 (embedding level = 0)

Storage:	he said “car MEANS CAR.”
Before Reordering:	he said “car MEANS CAR.”
Resolved levels:	000000000222111111111100
Reverse level 2:	he said “rac MEANS CAR.”
Reverse levels 1-2:	he said “RAC SNAEM car.”

Example 4 (embedding level = 1)

Storage:	DID YOU SAY ’he said “car MEANS CAR”‘?
Before Reordering:	DID YOU SAY ’he said “car MEANS CAR”‘?
Resolved levels::	11111111111112222222224443333333333211
Reverse level 4:	DID YOU SAY ’he said “rac MEANS CAR”‘?
Reverse levels 3-4:	DID YOU SAY ’he said “RAC SNAEM car”‘?
Reverse levels 2-4:	DID YOU SAY ’”rac MEANS CAR“ dias eh‘?
Reverse levels 1-4:	?‘he said “RAC SNAEM car”’ YAS UOY DID

. Combining marks applied to a right-to-left base character will at this point precede their base character. If the rendering engine expects them to follow the base characters in the final display process, then the ordering of the marks and the base character must be reversed.

Many font designers provide default metrics for combining marks that support rendering by simple overhang. Because of the reordering for right-to-left characters, it is common practice to make the glyphs for most combining characters overhang to the left (thus assuming the characters will be applied to left-to-right base characters) and make the glyphs for combining characters in right-to-left scripts overhang to the right (thus assuming that the characters will be applied to right-to-left base characters). With such fonts, the display ordering of the marks and base glyphs may need to be adjusted when combining marks are applied to “unmatching” base characters. See Section 5.13, Rendering Nonspacing Marks of [], for more information.

. A character is depicted by a mirrored glyph if and only if (a) the resolved directionality of that character is R, and (b) the Bidi_Mirrored property value of that character is true.

The Bidi_Mirrored property is defined by Section 4.7, Bidi Mirrored—Normative of []; the property values are specified in [].
This rule can be overridden in certain cases; see .

For example, U+0028 left parenthesis—which is interpreted in the Unicode Standard as an opening parenthesis—appears as “(” when its resolved level is even, and as the mirrored glyph “)” when its resolved level is odd. Note that for backward compatibility the characters U+FD3E (﴾) ornate left parenthesis and U+FD3F (﴿) ornate right parenthesis are not mirrored.

阅读(4307) | 评论(0) | 转发(0) |

上一篇：[ZT]ICU 进阶: 使用 ICU 中的 Resource Bundle 技术

下一篇：[HarfBuzz] HarfBuzz API design

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6