| <!DOCTYPE html> |
| <html lang="en"> |
| <head> |
| <meta charset="utf-8"> |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> |
| <meta name="generator" content="rustdoc"> |
| <meta name="description" content="API documentation for the Rust `utf8_ranges` crate."> |
| <meta name="keywords" content="rust, rustlang, rust-lang, utf8_ranges"> |
| |
| <title>utf8_ranges - Rust</title> |
| |
| <link rel="stylesheet" type="text/css" href="../normalize.css"> |
| <link rel="stylesheet" type="text/css" href="../rustdoc.css"> |
| <link rel="stylesheet" type="text/css" href="../main.css"> |
| |
| |
| |
| |
| </head> |
| <body class="rustdoc mod"> |
| <!--[if lte IE 8]> |
| <div class="warning"> |
| This old browser is unsupported and will most likely display funky |
| things. |
| </div> |
| <![endif]--> |
| |
| |
| |
| <nav class="sidebar"> |
| |
| <p class='location'>Crate utf8_ranges</p><div class="block items"><ul><li><a href="#structs">Structs</a></li><li><a href="#enums">Enums</a></li></ul></div><p class='location'></p><script>window.sidebarCurrent = {name: 'utf8_ranges', ty: 'mod', relpath: '../'};</script> |
| </nav> |
| |
| <nav class="sub"> |
| <form class="search-form js-only"> |
| <div class="search-container"> |
| <input class="search-input" name="search" |
| autocomplete="off" |
| placeholder="Click or press ‘S’ to search, ‘?’ for more options…" |
| type="search"> |
| </div> |
| </form> |
| </nav> |
| |
| <section id='main' class="content"> |
| <h1 class='fqn'><span class='in-band'>Crate <a class="mod" href=''>utf8_ranges</a></span><span class='out-of-band'><span id='render-detail'> |
| <a id="toggle-all-docs" href="javascript:void(0)" title="collapse all docs"> |
| [<span class='inner'>−</span>] |
| </a> |
| </span><a class='srclink' href='../src/utf8_ranges/lib.rs.html#1-511' title='goto source code'>[src]</a></span></h1> |
| <div class='docblock'><p>Crate <code>utf8-ranges</code> converts ranges of Unicode scalar values to equivalent |
| ranges of UTF-8 bytes. This is useful for constructing byte based automatons |
| that need to embed UTF-8 decoding.</p> |
| |
| <p>See the documentation on the <code>Utf8Sequences</code> iterator for more details and |
| an example.</p> |
| |
| <h1 id='wait-what-is-this' class='section-header'><a href='#wait-what-is-this'>Wait, what is this?</a></h1> |
| <p>This is simplest to explain with an example. Let's say you wanted to test |
| whether a particular byte sequence was a Cyrillic character. One possible |
| scalar value range is <code>[0400-04FF]</code>. The set of allowed bytes for this |
| range can be expressed as a sequence of byte ranges:</p> |
| |
| <pre class="rust rust-example-rendered"> |
| [<span class="ident">D0</span><span class="op">-</span><span class="ident">D3</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]</pre> |
| |
| <p>This is simple enough: simply encode the boundaries, <code>0400</code> encodes to |
| <code>D0 80</code> and <code>04FF</code> encodes to <code>D3 BF</code>, and create ranges from each |
| corresponding pair of bytes: <code>D0</code> to <code>D3</code> and <code>80</code> to <code>BF</code>.</p> |
| |
| <p>However, what if you wanted to add the Cyrillic Supplementary characters to |
| your range? Your range might then become <code>[0400-052F]</code>. The same procedure |
| as above doesn't quite work because <code>052F</code> encodes to <code>D4 AF</code>. The byte ranges |
| you'd get from the previous transformation would be <code>[D0-D4][80-AF]</code>. However, |
| this isn't quite correct because this range doesn't capture many characters, |
| for example, <code>04FF</code> (because its last byte, <code>BF</code> isn't in the range <code>80-AF</code>).</p> |
| |
| <p>Instead, you need multiple sequences of byte ranges:</p> |
| |
| <pre class="rust rust-example-rendered"> |
| [<span class="ident">D0</span><span class="op">-</span><span class="ident">D3</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>] <span class="attribute"># <span class="ident">matches</span> <span class="ident">codepoints</span> <span class="number">0400</span><span class="op">-</span><span class="number">04FF</span> |
| [<span class="ident">D4</span>]</span>[<span class="number">80</span><span class="op">-</span><span class="ident">AF</span>] <span class="attribute"># <span class="ident">matches</span> <span class="ident">codepoints</span> <span class="number">0500</span><span class="op">-</span><span class="number">052F</span></pre> |
| |
| <p>This gets even more complicated if you want bigger ranges, particularly if |
| they naively contain surrogate codepoints. For example, the sequence of byte |
| ranges for the basic multilingual plane (<code>[0000-FFFF]</code>) look like this:</p> |
| |
| <pre class="rust rust-example-rendered"> |
| [<span class="number">0</span><span class="op">-</span><span class="number">7F</span>] |
| [<span class="ident">C2</span><span class="op">-</span><span class="ident">DF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>] |
| [<span class="ident">E0</span>][<span class="ident">A0</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>] |
| [<span class="ident">E1</span><span class="op">-</span><span class="ident">EC</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>] |
| [<span class="ident">ED</span>][<span class="number">80</span><span class="op">-</span><span class="number">9F</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>] |
| [<span class="ident">EE</span><span class="op">-</span><span class="ident">EF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]</pre> |
| |
| <p>Note that the byte ranges above will <em>not</em> match any erroneous encoding of |
| UTF-8, including encodings of surrogate codepoints.</p> |
| |
| <p>And, of course, for all of Unicode (<code>[000000-10FFFF]</code>):</p> |
| |
| <pre class="rust rust-example-rendered"> |
| [<span class="number">0</span><span class="op">-</span><span class="number">7F</span>] |
| [<span class="ident">C2</span><span class="op">-</span><span class="ident">DF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>] |
| [<span class="ident">E0</span>][<span class="ident">A0</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>] |
| [<span class="ident">E1</span><span class="op">-</span><span class="ident">EC</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>] |
| [<span class="ident">ED</span>][<span class="number">80</span><span class="op">-</span><span class="number">9F</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>] |
| [<span class="ident">EE</span><span class="op">-</span><span class="ident">EF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>] |
| [<span class="ident">F0</span>][<span class="number">90</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>] |
| [<span class="ident">F1</span><span class="op">-</span><span class="ident">F3</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>] |
| [<span class="ident">F4</span>][<span class="number">80</span><span class="op">-</span><span class="number">8F</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]</pre> |
| |
| <p>This crate automates the process of creating these byte ranges from ranges of |
| Unicode scalar values.</p> |
| |
| <h1 id='why-would-i-ever-use-this' class='section-header'><a href='#why-would-i-ever-use-this'>Why would I ever use this?</a></h1> |
| <p>You probably won't ever need this. In 99% of cases, you just decode the byte |
| sequence into a Unicode scalar value and compare scalar values directly. |
| However, this explicit decoding step isn't always possible. For example, the |
| construction of some finite state machines may benefit from converting ranges |
| of scalar values into UTF-8 decoder automata (e.g., for character classes in |
| regular expressions).</p> |
| |
| <h1 id='lineage' class='section-header'><a href='#lineage'>Lineage</a></h1> |
| <p>I got the idea and general implementation strategy from Russ Cox in his |
| <a href="https://swtch.com/%7Ersc/regexp/regexp3.html">article on regexps</a> and RE2. |
| Russ Cox got it from Ken Thompson's <code>grep</code> (no source, folk lore?). |
| I also got the idea from |
| <a href="https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java">Lucene</a>, |
| which uses it for executing automata on their term index.</p> |
| </div><h2 id='structs' class='section-header'><a href="#structs">Structs</a></h2> |
| <table> |
| <tr class=' module-item'> |
| <td><a class="struct" href="struct.Utf8Range.html" |
| title='struct utf8_ranges::Utf8Range'>Utf8Range</a></td> |
| <td class='docblock-short'> |
| <p>A single inclusive range of UTF-8 bytes.</p> |
| </td> |
| </tr> |
| <tr class=' module-item'> |
| <td><a class="struct" href="struct.Utf8Sequences.html" |
| title='struct utf8_ranges::Utf8Sequences'>Utf8Sequences</a></td> |
| <td class='docblock-short'> |
| <p>An iterator over ranges of matching UTF-8 byte sequences.</p> |
| </td> |
| </tr></table><h2 id='enums' class='section-header'><a href="#enums">Enums</a></h2> |
| <table> |
| <tr class=' module-item'> |
| <td><a class="enum" href="enum.Utf8Sequence.html" |
| title='enum utf8_ranges::Utf8Sequence'>Utf8Sequence</a></td> |
| <td class='docblock-short'> |
| <p>Utf8Sequence represents a sequence of byte ranges.</p> |
| </td> |
| </tr></table></section> |
| <section id='search' class="content hidden"></section> |
| |
| <section class="footer"></section> |
| |
| <aside id="help" class="hidden"> |
| <div> |
| <h1 class="hidden">Help</h1> |
| |
| <div class="shortcuts"> |
| <h2>Keyboard Shortcuts</h2> |
| |
| <dl> |
| <dt>?</dt> |
| <dd>Show this help dialog</dd> |
| <dt>S</dt> |
| <dd>Focus the search field</dd> |
| <dt>⇤</dt> |
| <dd>Move up in search results</dd> |
| <dt>⇥</dt> |
| <dd>Move down in search results</dd> |
| <dt>⏎</dt> |
| <dd>Go to active search result</dd> |
| <dt>+</dt> |
| <dd>Collapse/expand all sections</dd> |
| </dl> |
| </div> |
| |
| <div class="infos"> |
| <h2>Search Tricks</h2> |
| |
| <p> |
| Prefix searches with a type followed by a colon (e.g. |
| <code>fn:</code>) to restrict the search to a given type. |
| </p> |
| |
| <p> |
| Accepted types are: <code>fn</code>, <code>mod</code>, |
| <code>struct</code>, <code>enum</code>, |
| <code>trait</code>, <code>type</code>, <code>macro</code>, |
| and <code>const</code>. |
| </p> |
| |
| <p> |
| Search functions by type signature (e.g. |
| <code>vec -> usize</code> or <code>* -> vec</code>) |
| </p> |
| </div> |
| </div> |
| </aside> |
| |
| |
| |
| <script> |
| window.rootPath = "../"; |
| window.currentCrate = "utf8_ranges"; |
| </script> |
| <script src="../main.js"></script> |
| <script defer src="../search-index.js"></script> |
| </body> |
| </html> |