blob: 14984ee2c9371d2d4c2474bfcda4fc5e3a2874fe [file] [log] [blame]
<html>
<head>
<title>Non-uniform file encodings in the Eclipse Platform</title>
<link rel="stylesheet" href="http://dev.eclipse.org/default_style.css" type="text/css">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<h1>Non-uniform file encodings in the Eclipse Platform</h1>
<p><font size="-1">Last modified: June 12, 2003</font> </p>
<cite><strong>Plan item description:</strong> Eclipse 2.1 uses
a single global file encoding setting for reading and writing files in the
workspace. This is problematic; for example, when Java source files in the
workspace use OS default file encoding while XML files in the workspace use
UTF-8 file encoding. The Platform should support non-uniform file encodings.
[Platform Core, Platform UI, Text, Search, Compare, JDT UI, JDT Core] [Theme:
User experience] (bug <a href="http://bugs.eclipse.org/bugs/show_bug.cgi?id=37933">37933</a>,
<a href="http://dev.eclipse.org/bugs/show_bug.cgi?id=5399">5399</a>) </cite>
<p>The current situation is as follows:</p>
<ul>
<li><code>ResourcesPlugin.getEncoding</code> returns the default encoding for
the workspace (the <code>org.eclipse.core.resources.encoding</code> preference
value if available, otherwise the value of the <code>file.encoding</code>
Java system property).</li>
<li><code>IFile.getContents</code>/<code>setContents</code> work with byte streams
- no encoding can be applied.</li>
<li><code>IFile.getEncoding</code> tries to guess the file encoding (looking
for the <a href="http://www.unicode.org/unicode/faq/utf_bom.html">Byte Order
Mark</a>), which is not enough. Also, this API has no known client
so far. This API method would be deprecated.</li>
<li>the Java compiler supports non-uniform encondings for Java source files,
but in Eclipse it relies on <code>ResourcesPlugin.getEncoding (same value
for all sources)</code>.</li>
<li>the text editor framework supports setting the encoding for files being
edited (setting a persistent property on the file resource), but there is
no support for setting the encoding of multiple files simultaneously, and
other components are not aware of the encoding settings.</li>
</ul>
<h2>Requirements</h2>
<ul>
<li>users should be able to set the encoding for a file or a default encoding
for a container (folders, projects, the workspace root) and its children.</li>
<li>users should be able to share encoding settings in a team repository (metadata
should reside in the project content area).</li>
<li>file-specific encoding set by users prevails upon file contents-based encoding.</li>
<li>file contents-based encoding prevails upon the inherited
encoding setting.</li>
</ul>
<h2>Proposed solution</h2>
<p>The encoding for a resource (as returned by <code>IResource.getCharset</code>
- see API changes) will be:</p>
<ol>
<li> the encoding explictly set by a client/user (with <code>IResource.setCharset</code> -
see API changes), if any, or</li>
<li> for a file resource, the encoding discovered by an <em>encoding interpreter</em>
associated to the file extension, if one exists <em>and</em> can determine
the encoding, or</li>
<li> for a file resource, the file encoding determined by its Byte Order Mark,
if it exists, or</li>
<li> the resource parent's encoding (except for the workspace root, whose encoding
is equivalent to ResourcesPlugin.getEncoding()).</li>
</ol>
<p>Regarding #2, an extension-point would allow file format-aware encoding interpreters
to register to the encoding discovery mechanism for specific file types (extensions)
or to associate existing encoding interpreters to their own file extensions.
Users would be able to associate more file extensions for the known interpreters
(preference).</p>
<p>All clients, when creating character-based streams when reading/writing the
contents of a file resource, should pass along the charset string obtained from
<code>IFile.getCharset</code> instead of the one provided by <code>ResourcesPlugin.getEncoding</code>.
Examples are: text editors, compiler, search, compare. </p>
<p>Also, setting the encoding for a resource would generate a resource change
event, but only for the directly affected resource (if clients are interested
on what effects the change in a directory had on files inside it, they will
have to find it out by themselves). </p>
<h3>API changes</h3>
<h4>Added:</h4>
<p><code>public void IResource.setCharset(String charsetName) throws CoreException</code></p>
<p>Sets the charset name for this resource. May be <code>null</code>, which sets
it to default. For the workspace root, it sets the workspace's default encoding
preference to the charset's canonical name (or to the default encoding, if <code>null</code>
was provided). </p>
<p><code>public String IResource.getCharset() throws CoreException</code></p>
<p>Returns the name of the charset for this resource. For files, if none has been
defined (with <code>setCharset</code>), returns the default charset. To determine
the default charset, it tries to guess it by a) inspecting the file contents
(BOM), b) calling the corresponding encoding interpreter (if any). Otherwise,
the parent's charset is returned. For the workspace root, a charset corresponding
to the workspace's default encoding preference is returned.</p>
<p><code>public boolean IResource.isDefaultCharset() throws CoreException</code></p>
<p>Returns <code>true</code> if the currently configured charset was not explicitly
set by the user - (has a default value either guessed by file contents, or inherited from parent).</p>
<p><code>public static final int IResourceDelta.ENCODING = 0x100000;</code></p>
<p><code>public String IResourceDelta.getNewCharset();</code></p>
<p><code>public String IResourceDelta.getOldCharset();</code></p>
<p>For notifying changes in file encodings. Both methods should only be called
only valid when <code>getKind()==CHANGE</code>, and <code>(getFlags()&amp;ENCODING)!=0</code>.</p>
<pre>public interface IEncodingInterpreter {
/** returns null if the charset cannot be determined. */
public String interpretCharset(java.io.InputStream input);
}</pre>
<p>Encoding interpreters will be associated to file types through a new core resources
extension point. Users can associate additional file extensions ia preferences.</p>
<p>The platform would provide itself implementations for xml and other popular
(?) file formats.</p>
<h4>Deprecated:</h4>
<pre>public int IFile.getEncoding()
public int IFile.ENCODING_* constants</pre>
<h3>Encoding settings metadata</h3>
<p>The encoding settings metadata will be stored inside the project's content
area so it can be easily shared.</p>
<h3>Scenarios</h3>
<ul>
<li>The user opens in an editor a text file whose contents where created using
encoding &quot;MS932&quot; in a workspace whose default encoding is &quot;US-ASCII&quot;.
It was not possible to guess the file encoding automatically, so what the
user sees is gibberish. The user figures out the cause of the problem and
expliclty sets the encoding for that specific file to be &quot;MS932&quot;.
The editor will get notified and might offer to the user the option of reloading
the file contents.</li>
<li>The user gets the source code for a set of classes he/she needs to use,
but the classes do not compile because the author used internal Java identifiers
not supported in the user workspace's current encoding. The user then selects
the offending source files, and apply to them the correct encoding. The encoding
change in the affected files will be reported in subsequent resource change
events (to listeners and builders). Builders may recompute build state affected
by changed encodings. Views depending on the contents of the affected files
may decide to reload the contents using the new encoding. </li>
<li> If the user changes the encoding for a bunch of <em>directories</em>, only
the directly affected resources will appear in the delta. Clients may want
to re-read/re-build the contents of files whose parent changed encoding. Otherwise,
the user will have to trigger a full build in the affected project. Refresh
will not help.</li>
<li>The user moves a text file with default encoding to a directory which has
a different encoding than the previous parent (which means the encoding for
the file has changed). No encoding changes will be reported. Clients may want
to re-read the file when it is moved if it reports a different encoding than
the one originally used to load its contents.</li>
</ul>
</body></html>