| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"><html lang="en"> |
| <HEAD> |
| |
| <meta name="copyright" content="Copyright (c) IBM Corporation and others 2012. This page is made available under license. For full details see the LEGAL in the documentation book that contains this page." > |
| |
| <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1"> |
| <META HTTP-EQUIV="Content-Style-Type" CONTENT="text/css"> |
| |
| <LINK REL="STYLESHEET" HREF="../book.css" CHARSET="ISO-8859-1" TYPE="text/css"> |
| <TITLE>Structured Text Design</TITLE> |
| |
| <link rel="stylesheet" type="text/css" HREF="../book.css"> |
| </HEAD> |
| <BODY BGCOLOR="#ffffff"> |
| |
| <h2><a name="overview"></a>Design Overview</h2> |
| |
| <h3>Terms and Abbreviations</h3> |
| <table border="1" callpadding = "3" cellspacing = "3" width="70%"> |
| <tbody><tr> |
| <td>Bidi</td> <td>Bidirectional</td> |
| </tr> |
| <tr> |
| <td>LTR</td> <td>Left to Right</td> |
| </tr> |
| <tr> |
| <td>RTL</td> <td>Right to Left</td> |
| </tr> |
| <tr> |
| <td>LRM</td> <td>Left-to-Right Mark</td> |
| </tr> |
| <tr> |
| <td>RLM</td> <td>Right-to-Left Mark</td> |
| </tr> |
| <tr> |
| <td>LRE</td> <td>Left-to-Right Embedding</td> |
| </tr> |
| <tr> |
| <td>RLE</td> <td>Right-to-Left Embedding</td> |
| </tr> |
| <tr> |
| <td>PDF</td> <td>Pop Directional Formatting</td> |
| </tr> |
| </tbody></table> |
| |
| <h3>General Definitions, Terminology and Conventions</h3> |
| <p>Every instance of bidi text has a base text direction. Bidi text in Arabic or |
| Hebrew has a RTL base direction, even if it includes numbers or Latin phrases |
| which are written from left to right. Bidi text in English or Greek has a LTR |
| base direction, even if it includes Arabic or Hebrew phrases which are written |
| from right to left.</p> |
| <p>Structured expressions also have a base text direction, which is often |
| determined by the type of structured expression, but may also be affected by the |
| content of the expression (whether it contains Arabic or Hebrew words).</p> |
| <p>This document addresses two groups of problematic cases:</p> |
| <ol> |
| <li><b>Expressions with simple internal structure</b>: this category |
| regroups cases in which strings are concatenated together in simple ways |
| using known separators. For example: variable names, "name = value" |
| specifications, file path, etc...<br> |
| </li> |
| <li><b>Expressions with complex internal structure</b>: this category |
| regroups structured text like regular expressions, XPath expressions and |
| Java code. This category differs from the previous one since the expressions |
| belonging to it have a unique syntax which cannot be described by |
| concatenation of string segments using separators.</li> |
| </ol> |
| <p>We will see that the same algorithms can handle both groups, with some |
| adaptations in the details.</p> |
| <p>In the <b>examples</b> appearing in this document, upper case Latin letters |
| represent Arabic or Hebrew text, lower case Latin letters represent English |
| text.</p> |
| <p>"<b>@</b>" represents an LRM, "<b>&</b>" represents an RLM.</p> |
| <p>Notations like <b>LRE+LRM</b> represent character LRE |
| immediately followed by character LRM.</p> |
| |
| <h3>Bidirectional Control Characters</h3> |
| <p>When there are problems of wrong display of bidi text, it is often possible |
| to cure them by adding some bidi control characters at appropriate locations in |
| the text. There are 7 bidi control characters: LRM, RLM, LRE, RLE, LRO, RLO and |
| PDF. Since this design has no use for LRO and RLO (Left-to-Right and |
| Right-to-Left Override, respectively), the following paragraphs will describe |
| the effect of the 5 other characters.</p> |
| <ul> |
| <li><b>LRM</b> (Left-to-Right Mark): LRM is an invisible character which |
| behaves like a letter in a Left to Right script such as Latin or Greek. It can |
| be used when a segment of LTR text starts or ends with characters which are |
| not intrinsically LTR and is displayed in a component with a RTL orientation. |
| <br> |
| Example: assume in memory the string "\\myserver\myshare(mydirectory)". We |
| want it displayed identically, but within a component with RTL |
| orientation it would be displayed as "(myserver\myshare(mydirectory\\". Adding |
| one LRM character at the beginning of the string will cause the leading |
| backslashes to be displayed on the left side, and adding one LRM character |
| at the end of the string will cause the trailing parenthesis to be displayed |
| on the right side.<br> |
| </li> |
| <li><b>RLM</b> (Right-to-Left Mark): RLM is an invisible character which |
| behaves like a letter in a Right to Left script like Hebrew. It can be used |
| when a segment of RTL text starts or ends with characters which are not |
| intrinsically RTL and is displayed in a component with a LTR |
| orientation.<br> |
| Example: assume in memory the string "HELLO WORLD !". We want it displayed |
| as "! DLROW OLLEH", but within a component with a LTR orientation it |
| would be displayed as "DLROW OLLEH !" (exclamation mark on the right side). |
| Adding one RLM character at the end of the string will cause the trailing |
| exclamation mark to be displayed on the left side.<br> |
| </li> |
| <li><b>LRE</b> (Left-to-Right Embedding): LRE can be used to give a base |
| LTR direction to a piece of text. It is most useful for mixed text which |
| contains both LTR and RTL segments.<br> |
| Example: assume in memory the string "i love RACHEL and LEA" which should be |
| displayed as "i love LEHCAR and AEL". However, within a component with RTL |
| orientation, it would be displayed as "AEL and LEHCAR i love". Adding one |
| LRE character at the beginning of the string and one PDF (see below) |
| character at the end of the string will cause proper display.<br> |
| </li> |
| <li><b>RLE</b> (Right-to-Left Embedding): RLE can be used to give a base |
| RTL direction to a piece of text. It is most useful for mixed text which |
| contains both LTR and RTL segments.<br> |
| Example: assume in memory the string "I LOVE london AND paris" which should |
| be displayed as "paris DNA london EVOL I". However, within a component with |
| LTR orientation, it would be displayed as "EVOL I london DNA paris". |
| Adding one RLE character at the beginning of the string and adding one PDF |
| (see below) character at the end of the string will cause proper display.<br> |
| </li> |
| <li><b>PDF</b> (Pop Directional Formatting): PDF may be used to limit the |
| effect of a preceding LRE or RLE. It may be omitted if not followed by any |
| text.</li> |
| </ul> |
| <p>Note that pieces of text bracketed between LRE/PDF or RLE/PDF can be |
| contained within larger pieces of text themselves bracketed between LRE/PDF or |
| RLE/PDF. This is why the "E" of LRE and RLE means "embedding". This could happen |
| if we have for instance a Hebrew sentence containing an English phrase itself |
| containing an Arabic segment. In practice, such complex cases should be avoided |
| if possible. The present design does not use more than one level of LRE/PDF or |
| RLE/PDF, except possibly in regular expressions.</p> |
| |
| <h3>Bidi Classification</h3> |
| <p>Characters can be classified according to their bidi type as described in the |
| Unicode Standard (see |
| <a href="http://www.unicode.org/reports/tr9/#Bidirectional_Character_Types"> |
| Bidirectional_Character_Types</a> for a full description of the bidi types). For |
| our purpose, we will distinguish the following types of characters:</p> |
| <ul> |
| <li><b>"Strong" characters</b>: those with a bidi type of L, R or AL |
| (letters in LTR or RTL scripts);</li> |
| <li><b>Numbers</b>: European Numbers (type EN) or Arabic Numbers (type AN);</li> |
| <li><b>Neutrals</b>: all the rest.</li> |
| </ul> |
| |
| <h3>Text Analysis</h3> |
| <p>In all the structured expressions that we are addressing, we can see characters |
| with a special syntactical role that we will call "separators", and pieces of |
| text between separators that we will call "tokens". The separators vary |
| according to the type of structured expression. Often they are punctuation signs |
| like colon (:), backslash (\) and full stop (.), or mathematical signs like Plus |
| (+) or Equal (=).</p> |
| <p><b>Our objective is that the relative progression of the |
| tokens and separators for display should always follow the base text direction |
| of the text, while each token will go LTR or RTL depending on its content and |
| according to the UBA.</b></p> |
| <p>For this to happen, the following must be done:</p> |
| <ol> |
| <li>Parse the expression to locate the separators and the tokens.<br></li> |
| <li>While parsing, note the bidi classification of characters parsed.<br></li> |
| <li>Depending on the bidi types of the characters before a token and in that |
| token, a LRM or a RLM may have to be added. The algorithm for this is detailed below.<br></li> |
| <li>If the expression has a LTR base direction and the component where |
| it is displayed has a RTL orientation, add LRE+LRM at the beginning of |
| the expression and LRM+PDF at its end.<br></li> |
| <li>If the expression has a RTL base direction and the component where |
| it is displayed has a LTR orientation, add RLE+RLM at the beginning of |
| the expression and RLM+PDF at its end.<br></li> |
| </ol> |
| <p>The original structured expression, before addition of directional formatting |
| characters, is called <em><strong>lean</strong></em> text.</p> |
| <p>The processed expression, after addition of directional formatting |
| characters, is called <em><strong>full</strong></em> text.</p> |
| |
| <h3>LRM Addition (structured text with LTR base text direction)</h3> |
| <p>A LRM will be added before a token if the following conditions are satisfied:</p> |
| <ul> |
| <li>The last strong character before the token has a bidi type equal to R or |
| AL and the first non-neutral character in the token itself has a bidi type |
| equal to R, AL, EN or AN.</li> |
| </ul> |
| <p>Examples (strings in logical order where "@" represents where an LRM should |
| be added):</p> |
| <pre> HEBREW @= ARABIC |
| HEBREW @= 123 |
| </pre> |
| <p>OR</p> |
| <ul> |
| <li>The last non-neutral character before the token has a bidi type equal to |
| AN and the first non-neutral character in the token has a bidi type equal to |
| R, AL or AN.</li> |
| </ul> |
| <p>Examples (strings in logical order where "@" represents where an LRM should |
| be added):</p> |
| <pre> ARABIC NUMBER 123 @< MAX |
| ARABIC NUMBER 123 @< 456 |
| </pre> |
| |
| <h3>RLM Addition (structured text with RTL base text direction)</h3> |
| <p>A RLM will be added before a token if the following conditions are satisfied:</p> |
| <ul> |
| <li>The last strong character before the token has a bidi type equal |
| to L and the first non-neutral character in the token itself has a bidi |
| type |
| equal to L or EN.</li> |
| </ul> |
| <p>Example (string in logical order where "&" represents where an RLM should |
| be added): |
| </p><pre> my_pet &= dog |
| </pre> |
| |
| </BODY> |
| </HTML> |