| <html> |
| <head> |
| <title>Non-uniform file encodings in the Eclipse Platform</title> |
| <link rel="stylesheet" href="http://dev.eclipse.org/default_style.css" type="text/css"> |
| </head> |
| <body text="#000000" bgcolor="#FFFFFF"> |
| <h1>Non-uniform file encodings in the Eclipse Platform</h1> |
| <p><font size="-1">Last modified: June 12, 2003</font> </p> |
| <cite><strong>Plan item description:</strong> Eclipse 2.1 uses |
| a single global file encoding setting for reading and writing files in the |
| workspace. This is problematic; for example, when Java source files in the |
| workspace use OS default file encoding while XML files in the workspace use |
| UTF-8 file encoding. The Platform should support non-uniform file encodings. |
| [Platform Core, Platform UI, Text, Search, Compare, JDT UI, JDT Core] [Theme: |
| User experience] (bug <a href="http://bugs.eclipse.org/bugs/show_bug.cgi?id=37933">37933</a>, |
| <a href="http://dev.eclipse.org/bugs/show_bug.cgi?id=5399">5399</a>) </cite> |
| <p>The current situation is as follows:</p> |
| <ul> |
| <li><code>ResourcesPlugin.getEncoding</code> returns the default encoding for |
| the workspace (the <code>org.eclipse.core.resources.encoding</code> preference |
| value if available, otherwise the value of the <code>file.encoding</code> |
| Java system property).</li> |
| <li><code>IFile.getContents</code>/<code>setContents</code> work with byte streams |
| - no encoding can be applied.</li> |
| <li><code>IFile.getEncoding</code> tries to guess the file encoding (looking |
| for the <a href="http://www.unicode.org/unicode/faq/utf_bom.html">Byte Order |
| Mark</a>), which is not enough. Also, this API has no known client |
| so far. This API method would be deprecated.</li> |
| <li>the Java compiler supports non-uniform encondings for Java source files, |
| but in Eclipse it relies on <code>ResourcesPlugin.getEncoding (same value |
| for all sources)</code>.</li> |
| <li>the text editor framework supports setting the encoding for files being |
| edited (setting a persistent property on the file resource), but there is |
| no support for setting the encoding of multiple files simultaneously, and |
| other components are not aware of the encoding settings.</li> |
| </ul> |
| <h2>Requirements</h2> |
| <ul> |
| <li>users should be able to set the encoding for a file or a default encoding |
| for a container (folders, projects, the workspace root) and its children.</li> |
| <li>users should be able to share encoding settings in a team repository (metadata |
| should reside in the project content area).</li> |
| <li>file-specific encoding set by users prevails upon file contents-based encoding.</li> |
| <li>file contents-based encoding prevails upon the inherited |
| encoding setting.</li> |
| </ul> |
| <h2>Proposed solution</h2> |
| <p>The encoding for a resource (as returned by <code>IResource.getCharset</code> |
| - see API changes) will be:</p> |
| <ol> |
| <li> the encoding explictly set by a client/user (with <code>IResource.setCharset</code> - |
| see API changes), if any, or</li> |
| <li> for a file resource, the encoding discovered by an <em>encoding interpreter</em> |
| associated to the file extension, if one exists <em>and</em> can determine |
| the encoding, or</li> |
| <li> for a file resource, the file encoding determined by its Byte Order Mark, |
| if it exists, or</li> |
| <li> the resource parent's encoding (except for the workspace root, whose encoding |
| is equivalent to ResourcesPlugin.getEncoding()).</li> |
| </ol> |
| <p>Regarding #2, an extension-point would allow file format-aware encoding interpreters |
| to register to the encoding discovery mechanism for specific file types (extensions) |
| or to associate existing encoding interpreters to their own file extensions. |
| Users would be able to associate more file extensions for the known interpreters |
| (preference).</p> |
| <p>All clients, when creating character-based streams when reading/writing the |
| contents of a file resource, should pass along the charset string obtained from |
| <code>IFile.getCharset</code> instead of the one provided by <code>ResourcesPlugin.getEncoding</code>. |
| Examples are: text editors, compiler, search, compare. </p> |
| <p>Also, setting the encoding for a resource would generate a resource change |
| event, but only for the directly affected resource (if clients are interested |
| on what effects the change in a directory had on files inside it, they will |
| have to find it out by themselves). </p> |
| <h3>API changes</h3> |
| <h4>Added:</h4> |
| <p><code>public void IResource.setCharset(String charsetName) throws CoreException</code></p> |
| <p>Sets the charset name for this resource. May be <code>null</code>, which sets |
| it to default. For the workspace root, it sets the workspace's default encoding |
| preference to the charset's canonical name (or to the default encoding, if <code>null</code> |
| was provided). </p> |
| <p><code>public String IResource.getCharset() throws CoreException</code></p> |
| <p>Returns the name of the charset for this resource. For files, if none has been |
| defined (with <code>setCharset</code>), returns the default charset. To determine |
| the default charset, it tries to guess it by a) inspecting the file contents |
| (BOM), b) calling the corresponding encoding interpreter (if any). Otherwise, |
| the parent's charset is returned. For the workspace root, a charset corresponding |
| to the workspace's default encoding preference is returned.</p> |
| <p><code>public boolean IResource.isDefaultCharset() throws CoreException</code></p> |
| <p>Returns <code>true</code> if the currently configured charset was not explicitly |
| set by the user - (has a default value either guessed by file contents, or inherited from parent).</p> |
| <p><code>public static final int IResourceDelta.ENCODING = 0x100000;</code></p> |
| <p><code>public String IResourceDelta.getNewCharset();</code></p> |
| <p><code>public String IResourceDelta.getOldCharset();</code></p> |
| <p>For notifying changes in file encodings. Both methods should only be called |
| only valid when <code>getKind()==CHANGE</code>, and <code>(getFlags()&ENCODING)!=0</code>.</p> |
| <pre>public interface IEncodingInterpreter { |
| /** returns null if the charset cannot be determined. */ |
| public String interpretCharset(java.io.InputStream input); |
| }</pre> |
| <p>Encoding interpreters will be associated to file types through a new core resources |
| extension point. Users can associate additional file extensions ia preferences.</p> |
| <p>The platform would provide itself implementations for xml and other popular |
| (?) file formats.</p> |
| <h4>Deprecated:</h4> |
| <pre>public int IFile.getEncoding() |
| public int IFile.ENCODING_* constants</pre> |
| <h3>Encoding settings metadata</h3> |
| <p>The encoding settings metadata will be stored inside the project's content |
| area so it can be easily shared.</p> |
| <h3>Scenarios</h3> |
| <ul> |
| <li>The user opens in an editor a text file whose contents where created using |
| encoding "MS932" in a workspace whose default encoding is "US-ASCII". |
| It was not possible to guess the file encoding automatically, so what the |
| user sees is gibberish. The user figures out the cause of the problem and |
| expliclty sets the encoding for that specific file to be "MS932". |
| The editor will get notified and might offer to the user the option of reloading |
| the file contents.</li> |
| <li>The user gets the source code for a set of classes he/she needs to use, |
| but the classes do not compile because the author used internal Java identifiers |
| not supported in the user workspace's current encoding. The user then selects |
| the offending source files, and apply to them the correct encoding. The encoding |
| change in the affected files will be reported in subsequent resource change |
| events (to listeners and builders). Builders may recompute build state affected |
| by changed encodings. Views depending on the contents of the affected files |
| may decide to reload the contents using the new encoding. </li> |
| <li> If the user changes the encoding for a bunch of <em>directories</em>, only |
| the directly affected resources will appear in the delta. Clients may want |
| to re-read/re-build the contents of files whose parent changed encoding. Otherwise, |
| the user will have to trigger a full build in the affected project. Refresh |
| will not help.</li> |
| <li>The user moves a text file with default encoding to a directory which has |
| a different encoding than the previous parent (which means the encoding for |
| the file has changed). No encoding changes will be reported. Clients may want |
| to re-read the file when it is moved if it reports a different encoding than |
| the one originally used to load its contents.</li> |
| </ul> |
| </body></html> |