files/rustdoc/utf8_ranges/index.html - gerrit/www.eclipse.org/paho - Git at Google

 <!DOCTYPE html>
 <html lang="en">
 <head>
     <meta charset="utf-8">
     <meta name="viewport" content="width=device-width, initial-scale=1.0">
     <meta name="generator" content="rustdoc">
     <meta name="description" content="API documentation for the Rust `utf8_ranges` crate.">
     <meta name="keywords" content="rust, rustlang, rust-lang, utf8_ranges">

     <title>utf8_ranges - Rust</title>

     <link rel="stylesheet" type="text/css" href="../normalize.css">
     <link rel="stylesheet" type="text/css" href="../rustdoc.css">
     <link rel="stylesheet" type="text/css" href="../main.css">


 </head>
 <body class="rustdoc mod">
     <!--[if lte IE 8]>
     <div class="warning">
         This old browser is unsupported and will most likely display funky
         things.
     </div>
     <![endif]-->


     <nav class="sidebar">

         <p class='location'>Crate utf8_ranges</p><div class="block items"><ul><li><a href="#structs">Structs</a></li><li><a href="#enums">Enums</a></li></ul></div><p class='location'></p><script>window.sidebarCurrent = {name: 'utf8_ranges', ty: 'mod', relpath: '../'};</script>
     </nav>

     <nav class="sub">
         <form class="search-form js-only">
             <div class="search-container">
                 <input class="search-input" name="search"
                        autocomplete="off"
                        placeholder="Click or press ‘S’ to search, ‘?’ for more options…"
                        type="search">
             </div>
         </form>
     </nav>

     <section id='main' class="content">
 <h1 class='fqn'><span class='in-band'>Crate <a class="mod" href=''>utf8_ranges</a></span><span class='out-of-band'><span id='render-detail'>
                    <a id="toggle-all-docs" href="javascript:void(0)" title="collapse all docs">
                        [<span class='inner'>&#x2212;</span>]
                    </a>
                </span><a class='srclink' href='../src/utf8_ranges/lib.rs.html#1-511' title='goto source code'>[src]</a></span></h1>
 <div class='docblock'><p>Crate <code>utf8-ranges</code> converts ranges of Unicode scalar values to equivalent
 ranges of UTF-8 bytes. This is useful for constructing byte based automatons
 that need to embed UTF-8 decoding.</p>

 <p>See the documentation on the <code>Utf8Sequences</code> iterator for more details and
 an example.</p>

 <h1 id='wait-what-is-this' class='section-header'><a href='#wait-what-is-this'>Wait, what is this?</a></h1>
 <p>This is simplest to explain with an example. Let&#39;s say you wanted to test
 whether a particular byte sequence was a Cyrillic character. One possible
 scalar value range is <code>[0400-04FF]</code>. The set of allowed bytes for this
 range can be expressed as a sequence of byte ranges:</p>

 <pre class="rust rust-example-rendered">
 [<span class="ident">D0</span><span class="op">-</span><span class="ident">D3</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]</pre>

 <p>This is simple enough: simply encode the boundaries, <code>0400</code> encodes to
 <code>D0 80</code> and <code>04FF</code> encodes to <code>D3 BF</code>, and create ranges from each
 corresponding pair of bytes: <code>D0</code> to <code>D3</code> and <code>80</code> to <code>BF</code>.</p>

 <p>However, what if you wanted to add the Cyrillic Supplementary characters to
 your range? Your range might then become <code>[0400-052F]</code>. The same procedure
 as above doesn&#39;t quite work because <code>052F</code> encodes to <code>D4 AF</code>. The byte ranges
 you&#39;d get from the previous transformation would be <code>[D0-D4][80-AF]</code>. However,
 this isn&#39;t quite correct because this range doesn&#39;t capture many characters,
 for example, <code>04FF</code> (because its last byte, <code>BF</code> isn&#39;t in the range <code>80-AF</code>).</p>

 <p>Instead, you need multiple sequences of byte ranges:</p>

 <pre class="rust rust-example-rendered">
 [<span class="ident">D0</span><span class="op">-</span><span class="ident">D3</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]  <span class="attribute"># <span class="ident">matches</span> <span class="ident">codepoints</span> <span class="number">0400</span><span class="op">-</span><span class="number">04FF</span>
 [<span class="ident">D4</span>]</span>[<span class="number">80</span><span class="op">-</span><span class="ident">AF</span>]     <span class="attribute"># <span class="ident">matches</span> <span class="ident">codepoints</span> <span class="number">0500</span><span class="op">-</span><span class="number">052F</span></pre>

 <p>This gets even more complicated if you want bigger ranges, particularly if
 they naively contain surrogate codepoints. For example, the sequence of byte
 ranges for the basic multilingual plane (<code>[0000-FFFF]</code>) look like this:</p>

 <pre class="rust rust-example-rendered">
 [<span class="number">0</span><span class="op">-</span><span class="number">7F</span>]
 [<span class="ident">C2</span><span class="op">-</span><span class="ident">DF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
 [<span class="ident">E0</span>][<span class="ident">A0</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
 [<span class="ident">E1</span><span class="op">-</span><span class="ident">EC</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
 [<span class="ident">ED</span>][<span class="number">80</span><span class="op">-</span><span class="number">9F</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
 [<span class="ident">EE</span><span class="op">-</span><span class="ident">EF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]</pre>

 <p>Note that the byte ranges above will <em>not</em> match any erroneous encoding of
 UTF-8, including encodings of surrogate codepoints.</p>

 <p>And, of course, for all of Unicode (<code>[000000-10FFFF]</code>):</p>

 <pre class="rust rust-example-rendered">
 [<span class="number">0</span><span class="op">-</span><span class="number">7F</span>]
 [<span class="ident">C2</span><span class="op">-</span><span class="ident">DF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
 [<span class="ident">E0</span>][<span class="ident">A0</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
 [<span class="ident">E1</span><span class="op">-</span><span class="ident">EC</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
 [<span class="ident">ED</span>][<span class="number">80</span><span class="op">-</span><span class="number">9F</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
 [<span class="ident">EE</span><span class="op">-</span><span class="ident">EF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
 [<span class="ident">F0</span>][<span class="number">90</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
 [<span class="ident">F1</span><span class="op">-</span><span class="ident">F3</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
 [<span class="ident">F4</span>][<span class="number">80</span><span class="op">-</span><span class="number">8F</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]</pre>

 <p>This crate automates the process of creating these byte ranges from ranges of
 Unicode scalar values.</p>

 <h1 id='why-would-i-ever-use-this' class='section-header'><a href='#why-would-i-ever-use-this'>Why would I ever use this?</a></h1>
 <p>You probably won&#39;t ever need this. In 99% of cases, you just decode the byte
 sequence into a Unicode scalar value and compare scalar values directly.
 However, this explicit decoding step isn&#39;t always possible. For example, the
 construction of some finite state machines may benefit from converting ranges
 of scalar values into UTF-8 decoder automata (e.g., for character classes in
 regular expressions).</p>

 <h1 id='lineage' class='section-header'><a href='#lineage'>Lineage</a></h1>
 <p>I got the idea and general implementation strategy from Russ Cox in his
 <a href="https://swtch.com/%7Ersc/regexp/regexp3.html">article on regexps</a> and RE2.
 Russ Cox got it from Ken Thompson&#39;s <code>grep</code> (no source, folk lore?).
 I also got the idea from
 <a href="https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java">Lucene</a>,
 which uses it for executing automata on their term index.</p>
 </div><h2 id='structs' class='section-header'><a href="#structs">Structs</a></h2>
 <table>
                        <tr class=' module-item'>
                            <td><a class="struct" href="struct.Utf8Range.html"
                                   title='struct utf8_ranges::Utf8Range'>Utf8Range</a></td>
                            <td class='docblock-short'>
                                 <p>A single inclusive range of UTF-8 bytes.</p>
                            </td>
                        </tr>
                        <tr class=' module-item'>
                            <td><a class="struct" href="struct.Utf8Sequences.html"
                                   title='struct utf8_ranges::Utf8Sequences'>Utf8Sequences</a></td>
                            <td class='docblock-short'>
                                 <p>An iterator over ranges of matching UTF-8 byte sequences.</p>
                            </td>
                        </tr></table><h2 id='enums' class='section-header'><a href="#enums">Enums</a></h2>
 <table>
                        <tr class=' module-item'>
                            <td><a class="enum" href="enum.Utf8Sequence.html"
                                   title='enum utf8_ranges::Utf8Sequence'>Utf8Sequence</a></td>
                            <td class='docblock-short'>
                                 <p>Utf8Sequence represents a sequence of byte ranges.</p>
                            </td>
                        </tr></table></section>
     <section id='search' class="content hidden"></section>

     <section class="footer"></section>

     <aside id="help" class="hidden">
         <div>
             <h1 class="hidden">Help</h1>

             <div class="shortcuts">
                 <h2>Keyboard Shortcuts</h2>

                 <dl>
                     <dt>?</dt>
                     <dd>Show this help dialog</dd>
                     <dt>S</dt>
                     <dd>Focus the search field</dd>
                     <dt>&larrb;</dt>
                     <dd>Move up in search results</dd>
                     <dt>&rarrb;</dt>
                     <dd>Move down in search results</dd>
                     <dt>&#9166;</dt>
                     <dd>Go to active search result</dd>
                     <dt>+</dt>
                     <dd>Collapse/expand all sections</dd>
                 </dl>
             </div>

             <div class="infos">
                 <h2>Search Tricks</h2>

                 <p>
                     Prefix searches with a type followed by a colon (e.g.
                     <code>fn:</code>) to restrict the search to a given type.
                 </p>

                 <p>
                     Accepted types are: <code>fn</code>, <code>mod</code>,
                     <code>struct</code>, <code>enum</code>,
                     <code>trait</code>, <code>type</code>, <code>macro</code>,
                     and <code>const</code>.
                 </p>

                 <p>
                     Search functions by type signature (e.g.
                     <code>vec -> usize</code> or <code>* -> vec</code>)
                 </p>
             </div>
         </div>
     </aside>


     <script>
         window.rootPath = "../";
         window.currentCrate = "utf8_ranges";
     </script>
     <script src="../main.js"></script>
     <script defer src="../search-index.js"></script>
 </body>
 </html>
	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="utf-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0">
	<meta name="generator" content="rustdoc">
	<meta name="description" content="API documentation for the Rust `utf8_ranges` crate.">
	<meta name="keywords" content="rust, rustlang, rust-lang, utf8_ranges">

	<title>utf8_ranges - Rust</title>

	<link rel="stylesheet" type="text/css" href="../normalize.css">
	<link rel="stylesheet" type="text/css" href="../rustdoc.css">
	<link rel="stylesheet" type="text/css" href="../main.css">




	</head>
	<body class="rustdoc mod">
	<!--[if lte IE 8]>
	<div class="warning">
	This old browser is unsupported and will most likely display funky
	things.
	</div>
	<![endif]-->



	<nav class="sidebar">

	<p class='location'>Crate utf8_ranges</p><div class="block items"><ul><li><a href="#structs">Structs</a></li><li><a href="#enums">Enums</a></li></ul></div><p class='location'></p><script>window.sidebarCurrent = {name: 'utf8_ranges', ty: 'mod', relpath: '../'};</script>
	</nav>

	<nav class="sub">
	<form class="search-form js-only">
	<div class="search-container">
	<input class="search-input" name="search"
	autocomplete="off"
	placeholder="Click or press ‘S’ to search, ‘?’ for more options…"
	type="search">
	</div>
	</form>
	</nav>

	<section id='main' class="content">
	<h1 class='fqn'><span class='in-band'>Crate <a class="mod" href=''>utf8_ranges</a></span><span class='out-of-band'><span id='render-detail'>
	<a id="toggle-all-docs" href="javascript:void(0)" title="collapse all docs">
	[<span class='inner'>−</span>]
	</a>
	</span><a class='srclink' href='../src/utf8_ranges/lib.rs.html#1-511' title='goto source code'>[src]</a></span></h1>
	<div class='docblock'><p>Crate <code>utf8-ranges</code> converts ranges of Unicode scalar values to equivalent
	ranges of UTF-8 bytes. This is useful for constructing byte based automatons
	that need to embed UTF-8 decoding.</p>

	<p>See the documentation on the <code>Utf8Sequences</code> iterator for more details and
	an example.</p>

	<h1 id='wait-what-is-this' class='section-header'><a href='#wait-what-is-this'>Wait, what is this?</a></h1>
	<p>This is simplest to explain with an example. Let's say you wanted to test
	whether a particular byte sequence was a Cyrillic character. One possible
	scalar value range is <code>[0400-04FF]</code>. The set of allowed bytes for this
	range can be expressed as a sequence of byte ranges:</p>

	<pre class="rust rust-example-rendered">
	[<span class="ident">D0</span><span class="op">-</span><span class="ident">D3</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]</pre>

	<p>This is simple enough: simply encode the boundaries, <code>0400</code> encodes to
	<code>D0 80</code> and <code>04FF</code> encodes to <code>D3 BF</code>, and create ranges from each
	corresponding pair of bytes: <code>D0</code> to <code>D3</code> and <code>80</code> to <code>BF</code>.</p>

	<p>However, what if you wanted to add the Cyrillic Supplementary characters to
	your range? Your range might then become <code>[0400-052F]</code>. The same procedure
	as above doesn't quite work because <code>052F</code> encodes to <code>D4 AF</code>. The byte ranges
	you'd get from the previous transformation would be <code>[D0-D4][80-AF]</code>. However,
	this isn't quite correct because this range doesn't capture many characters,
	for example, <code>04FF</code> (because its last byte, <code>BF</code> isn't in the range <code>80-AF</code>).</p>

	<p>Instead, you need multiple sequences of byte ranges:</p>

	<pre class="rust rust-example-rendered">
	[<span class="ident">D0</span><span class="op">-</span><span class="ident">D3</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>] <span class="attribute"># <span class="ident">matches</span> <span class="ident">codepoints</span> <span class="number">0400</span><span class="op">-</span><span class="number">04FF</span>
	[<span class="ident">D4</span>]</span>[<span class="number">80</span><span class="op">-</span><span class="ident">AF</span>] <span class="attribute"># <span class="ident">matches</span> <span class="ident">codepoints</span> <span class="number">0500</span><span class="op">-</span><span class="number">052F</span></pre>

	<p>This gets even more complicated if you want bigger ranges, particularly if
	they naively contain surrogate codepoints. For example, the sequence of byte
	ranges for the basic multilingual plane (<code>[0000-FFFF]</code>) look like this:</p>

	<pre class="rust rust-example-rendered">
	[<span class="number">0</span><span class="op">-</span><span class="number">7F</span>]
	[<span class="ident">C2</span><span class="op">-</span><span class="ident">DF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
	[<span class="ident">E0</span>][<span class="ident">A0</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
	[<span class="ident">E1</span><span class="op">-</span><span class="ident">EC</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
	[<span class="ident">ED</span>][<span class="number">80</span><span class="op">-</span><span class="number">9F</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
	[<span class="ident">EE</span><span class="op">-</span><span class="ident">EF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]</pre>

	<p>Note that the byte ranges above will <em>not</em> match any erroneous encoding of
	UTF-8, including encodings of surrogate codepoints.</p>

	<p>And, of course, for all of Unicode (<code>[000000-10FFFF]</code>):</p>

	<pre class="rust rust-example-rendered">
	[<span class="number">0</span><span class="op">-</span><span class="number">7F</span>]
	[<span class="ident">C2</span><span class="op">-</span><span class="ident">DF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
	[<span class="ident">E0</span>][<span class="ident">A0</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
	[<span class="ident">E1</span><span class="op">-</span><span class="ident">EC</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
	[<span class="ident">ED</span>][<span class="number">80</span><span class="op">-</span><span class="number">9F</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
	[<span class="ident">EE</span><span class="op">-</span><span class="ident">EF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
	[<span class="ident">F0</span>][<span class="number">90</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
	[<span class="ident">F1</span><span class="op">-</span><span class="ident">F3</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
	[<span class="ident">F4</span>][<span class="number">80</span><span class="op">-</span><span class="number">8F</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]</pre>

	<p>This crate automates the process of creating these byte ranges from ranges of
	Unicode scalar values.</p>

	<h1 id='why-would-i-ever-use-this' class='section-header'><a href='#why-would-i-ever-use-this'>Why would I ever use this?</a></h1>
	<p>You probably won't ever need this. In 99% of cases, you just decode the byte
	sequence into a Unicode scalar value and compare scalar values directly.
	However, this explicit decoding step isn't always possible. For example, the
	construction of some finite state machines may benefit from converting ranges
	of scalar values into UTF-8 decoder automata (e.g., for character classes in
	regular expressions).</p>

	<h1 id='lineage' class='section-header'><a href='#lineage'>Lineage</a></h1>
	<p>I got the idea and general implementation strategy from Russ Cox in his
	<a href="https://swtch.com/%7Ersc/regexp/regexp3.html">article on regexps</a> and RE2.
	Russ Cox got it from Ken Thompson's <code>grep</code> (no source, folk lore?).
	I also got the idea from
	<a href="https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java">Lucene</a>,
	which uses it for executing automata on their term index.</p>
	</div><h2 id='structs' class='section-header'><a href="#structs">Structs</a></h2>
	<table>
	<tr class=' module-item'>
	<td><a class="struct" href="struct.Utf8Range.html"
	title='struct utf8_ranges::Utf8Range'>Utf8Range</a></td>
	<td class='docblock-short'>
	<p>A single inclusive range of UTF-8 bytes.</p>
	</td>
	</tr>
	<tr class=' module-item'>
	<td><a class="struct" href="struct.Utf8Sequences.html"
	title='struct utf8_ranges::Utf8Sequences'>Utf8Sequences</a></td>
	<td class='docblock-short'>
	<p>An iterator over ranges of matching UTF-8 byte sequences.</p>
	</td>
	</tr></table><h2 id='enums' class='section-header'><a href="#enums">Enums</a></h2>
	<table>
	<tr class=' module-item'>
	<td><a class="enum" href="enum.Utf8Sequence.html"
	title='enum utf8_ranges::Utf8Sequence'>Utf8Sequence</a></td>
	<td class='docblock-short'>
	<p>Utf8Sequence represents a sequence of byte ranges.</p>
	</td>
	</tr></table></section>
	<section id='search' class="content hidden"></section>

	<section class="footer"></section>

	<aside id="help" class="hidden">
	<div>
	<h1 class="hidden">Help</h1>

	<div class="shortcuts">
	<h2>Keyboard Shortcuts</h2>

	<dl>
	<dt>?</dt>
	<dd>Show this help dialog</dd>
	<dt>S</dt>
	<dd>Focus the search field</dd>
	<dt>&larrb;</dt>
	<dd>Move up in search results</dd>
	<dt>&rarrb;</dt>
	<dd>Move down in search results</dd>
	<dt>⏎</dt>
	<dd>Go to active search result</dd>
	<dt>+</dt>
	<dd>Collapse/expand all sections</dd>
	</dl>
	</div>

	<div class="infos">
	<h2>Search Tricks</h2>

	<p>
	Prefix searches with a type followed by a colon (e.g.
	<code>fn:</code>) to restrict the search to a given type.
	</p>

	<p>
	Accepted types are: <code>fn</code>, <code>mod</code>,
	<code>struct</code>, <code>enum</code>,
	<code>trait</code>, <code>type</code>, <code>macro</code>,
	and <code>const</code>.
	</p>

	<p>
	Search functions by type signature (e.g.
	<code>vec -> usize</code> or <code>* -> vec</code>)
	</p>
	</div>
	</div>
	</aside>



	<script>
	window.rootPath = "../";
	window.currentCrate = "utf8_ranges";
	</script>
	<script src="../main.js"></script>
	<script defer src="../search-index.js"></script>
	</body>
	</html>