<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"><html lang="en"> | |
<HEAD> | |
<meta name="copyright" content="Copyright (c) IBM Corporation and others 2012. This page is made available under license. For full details see the LEGAL in the documentation book that contains this page." > | |
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1"> | |
<META HTTP-EQUIV="Content-Style-Type" CONTENT="text/css"> | |
<LINK REL="STYLESHEET" HREF="../book.css" CHARSET="ISO-8859-1" TYPE="text/css"> | |
<TITLE>Structured Text Design</TITLE> | |
<link rel="stylesheet" type="text/css" HREF="../book.css"> | |
</HEAD> | |
<BODY BGCOLOR="#ffffff"> | |
<h2><a name="overview"></a>Design Overview</h2> | |
<h3>Terms and Abbreviations</h3> | |
<table border="1" callpadding = "3" cellspacing = "3" width="70%"> | |
<tbody><tr> | |
<td>Bidi</td> <td>Bidirectional</td> | |
</tr> | |
<tr> | |
<td>LTR</td> <td>Left to Right</td> | |
</tr> | |
<tr> | |
<td>RTL</td> <td>Right to Left</td> | |
</tr> | |
<tr> | |
<td>LRM</td> <td>Left-to-Right Mark</td> | |
</tr> | |
<tr> | |
<td>RLM</td> <td>Right-to-Left Mark</td> | |
</tr> | |
<tr> | |
<td>LRE</td> <td>Left-to-Right Embedding</td> | |
</tr> | |
<tr> | |
<td>RLE</td> <td>Right-to-Left Embedding</td> | |
</tr> | |
<tr> | |
<td>PDF</td> <td>Pop Directional Formatting</td> | |
</tr> | |
</tbody></table> | |
<h3>General Definitions, Terminology and Conventions</h3> | |
<p>Every instance of bidi text has a base text direction. Bidi text in Arabic or | |
Hebrew has a RTL base direction, even if it includes numbers or Latin phrases | |
which are written from left to right. Bidi text in English or Greek has a LTR | |
base direction, even if it includes Arabic or Hebrew phrases which are written | |
from right to left.</p> | |
<p>Structured expressions also have a base text direction, which is often | |
determined by the type of structured expression, but may also be affected by the | |
content of the expression (whether it contains Arabic or Hebrew words).</p> | |
<p>This document addresses two groups of problematic cases:</p> | |
<ol> | |
<li><b>Expressions with simple internal structure</b>: this category | |
regroups cases in which strings are concatenated together in simple ways | |
using known separators. For example: variable names, "name = value" | |
specifications, file path, etc...<br> | |
</li> | |
<li><b>Expressions with complex internal structure</b>: this category | |
regroups structured text like regular expressions, XPath expressions and | |
Java code. This category differs from the previous one since the expressions | |
belonging to it have a unique syntax which cannot be described by | |
concatenation of string segments using separators.</li> | |
</ol> | |
<p>We will see that the same algorithms can handle both groups, with some | |
adaptations in the details.</p> | |
<p>In the <b>examples</b> appearing in this document, upper case Latin letters | |
represent Arabic or Hebrew text, lower case Latin letters represent English | |
text.</p> | |
<p>"<b>@</b>" represents an LRM, "<b>&</b>" represents an RLM.</p> | |
<p>Notations like <b>LRE+LRM</b> represent character LRE | |
immediately followed by character LRM.</p> | |
<h3>Bidirectional Control Characters</h3> | |
<p>When there are problems of wrong display of bidi text, it is often possible | |
to cure them by adding some bidi control characters at appropriate locations in | |
the text. There are 7 bidi control characters: LRM, RLM, LRE, RLE, LRO, RLO and | |
PDF. Since this design has no use for LRO and RLO (Left-to-Right and | |
Right-to-Left Override, respectively), the following paragraphs will describe | |
the effect of the 5 other characters.</p> | |
<ul> | |
<li><b>LRM</b> (Left-to-Right Mark): LRM is an invisible character which | |
behaves like a letter in a Left to Right script such as Latin or Greek. It can | |
be used when a segment of LTR text starts or ends with characters which are | |
not intrinsically LTR and is displayed in a component with a RTL orientation. | |
<br> | |
Example: assume in memory the string "\\myserver\myshare(mydirectory)". We | |
want it displayed identically, but within a component with RTL | |
orientation it would be displayed as "(myserver\myshare(mydirectory\\". Adding | |
one LRM character at the beginning of the string will cause the leading | |
backslashes to be displayed on the left side, and adding one LRM character | |
at the end of the string will cause the trailing parenthesis to be displayed | |
on the right side.<br> | |
</li> | |
<li><b>RLM</b> (Right-to-Left Mark): RLM is an invisible character which | |
behaves like a letter in a Right to Left script like Hebrew. It can be used | |
when a segment of RTL text starts or ends with characters which are not | |
intrinsically RTL and is displayed in a component with a LTR | |
orientation.<br> | |
Example: assume in memory the string "HELLO WORLD !". We want it displayed | |
as "! DLROW OLLEH", but within a component with a LTR orientation it | |
would be displayed as "DLROW OLLEH !" (exclamation mark on the right side). | |
Adding one RLM character at the end of the string will cause the trailing | |
exclamation mark to be displayed on the left side.<br> | |
</li> | |
<li><b>LRE</b> (Left-to-Right Embedding): LRE can be used to give a base | |
LTR direction to a piece of text. It is most useful for mixed text which | |
contains both LTR and RTL segments.<br> | |
Example: assume in memory the string "i love RACHEL and LEA" which should be | |
displayed as "i love LEHCAR and AEL". However, within a component with RTL | |
orientation, it would be displayed as "AEL and LEHCAR i love". Adding one | |
LRE character at the beginning of the string and one PDF (see below) | |
character at the end of the string will cause proper display.<br> | |
</li> | |
<li><b>RLE</b> (Right-to-Left Embedding): RLE can be used to give a base | |
RTL direction to a piece of text. It is most useful for mixed text which | |
contains both LTR and RTL segments.<br> | |
Example: assume in memory the string "I LOVE london AND paris" which should | |
be displayed as "paris DNA london EVOL I". However, within a component with | |
LTR orientation, it would be displayed as "EVOL I london DNA paris". | |
Adding one RLE character at the beginning of the string and adding one PDF | |
(see below) character at the end of the string will cause proper display.<br> | |
</li> | |
<li><b>PDF</b> (Pop Directional Formatting): PDF may be used to limit the | |
effect of a preceding LRE or RLE. It may be omitted if not followed by any | |
text.</li> | |
</ul> | |
<p>Note that pieces of text bracketed between LRE/PDF or RLE/PDF can be | |
contained within larger pieces of text themselves bracketed between LRE/PDF or | |
RLE/PDF. This is why the "E" of LRE and RLE means "embedding". This could happen | |
if we have for instance a Hebrew sentence containing an English phrase itself | |
containing an Arabic segment. In practice, such complex cases should be avoided | |
if possible. The present design does not use more than one level of LRE/PDF or | |
RLE/PDF, except possibly in regular expressions.</p> | |
<h3>Bidi Classification</h3> | |
<p>Characters can be classified according to their bidi type as described in the | |
Unicode Standard (see | |
<a href="http://www.unicode.org/reports/tr9/#Bidirectional_Character_Types"> | |
Bidirectional_Character_Types</a> for a full description of the bidi types). For | |
our purpose, we will distinguish the following types of characters:</p> | |
<ul> | |
<li><b>"Strong" characters</b>: those with a bidi type of L, R or AL | |
(letters in LTR or RTL scripts);</li> | |
<li><b>Numbers</b>: European Numbers (type EN) or Arabic Numbers (type AN);</li> | |
<li><b>Neutrals</b>: all the rest.</li> | |
</ul> | |
<h3>Text Analysis</h3> | |
<p>In all the structured expressions that we are addressing, we can see characters | |
with a special syntactical role that we will call "separators", and pieces of | |
text between separators that we will call "tokens". The separators vary | |
according to the type of structured expression. Often they are punctuation signs | |
like colon (:), backslash (\) and full stop (.), or mathematical signs like Plus | |
(+) or Equal (=).</p> | |
<p><b>Our objective is that the relative progression of the | |
tokens and separators for display should always follow the base text direction | |
of the text, while each token will go LTR or RTL depending on its content and | |
according to the UBA.</b></p> | |
<p>For this to happen, the following must be done:</p> | |
<ol> | |
<li>Parse the expression to locate the separators and the tokens.<br></li> | |
<li>While parsing, note the bidi classification of characters parsed.<br></li> | |
<li>Depending on the bidi types of the characters before a token and in that | |
token, a LRM or a RLM may have to be added. The algorithm for this is detailed below.<br></li> | |
<li>If the expression has a LTR base direction and the component where | |
it is displayed has a RTL orientation, add LRE+LRM at the beginning of | |
the expression and LRM+PDF at its end.<br></li> | |
<li>If the expression has a RTL base direction and the component where | |
it is displayed has a LTR orientation, add RLE+RLM at the beginning of | |
the expression and RLM+PDF at its end.<br></li> | |
</ol> | |
<p>The original structured expression, before addition of directional formatting | |
characters, is called <em><strong>lean</strong></em> text.</p> | |
<p>The processed expression, after addition of directional formatting | |
characters, is called <em><strong>full</strong></em> text.</p> | |
<h3>LRM Addition (structured text with LTR base text direction)</h3> | |
<p>A LRM will be added before a token if the following conditions are satisfied:</p> | |
<ul> | |
<li>The last strong character before the token has a bidi type equal to R or | |
AL and the first non-neutral character in the token itself has a bidi type | |
equal to R, AL, EN or AN.</li> | |
</ul> | |
<p>Examples (strings in logical order where "@" represents where an LRM should | |
be added):</p> | |
<pre> HEBREW @= ARABIC | |
HEBREW @= 123 | |
</pre> | |
<p>OR</p> | |
<ul> | |
<li>The last non-neutral character before the token has a bidi type equal to | |
AN and the first non-neutral character in the token has a bidi type equal to | |
R, AL or AN.</li> | |
</ul> | |
<p>Examples (strings in logical order where "@" represents where an LRM should | |
be added):</p> | |
<pre> ARABIC NUMBER 123 @< MAX | |
ARABIC NUMBER 123 @< 456 | |
</pre> | |
<h3>RLM Addition (structured text with RTL base text direction)</h3> | |
<p>A RLM will be added before a token if the following conditions are satisfied:</p> | |
<ul> | |
<li>The last strong character before the token has a bidi type equal | |
to L and the first non-neutral character in the token itself has a bidi | |
type | |
equal to L or EN.</li> | |
</ul> | |
<p>Example (string in logical order where "&" represents where an RLM should | |
be added): | |
</p><pre> my_pet &= dog | |
</pre> | |
</BODY> | |
</HTML> |