| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> |
| <html> |
| <head> |
| <title>Non-uniform file encodings in the Eclipse Platform</title> |
| <link rel="stylesheet" href="../default_style.css" type="text/css"> |
| </head> |
| <body text="#000000" bgcolor="#ffffff"> |
| <h1>Non-uniform file encodings in the Eclipse Platform</h1> |
| <p><font size="-1">Last modified: February 23, 2004</font> </p> |
| <cite><strong>Plan item description:</strong> Eclipse 2.1 uses a single |
| global file encoding setting for reading and writing files in the |
| workspace. This is problematic; for example, when Java source files in |
| the workspace use OS default file encoding while XML files in the |
| workspace use UTF-8 file encoding. The Platform should support |
| non-uniform file encodings. [Platform Core, Platform UI, Text, Search, |
| Compare, JDT UI, JDT Core] [Theme: User experience] (bug <a |
| href="http://bugs.eclipse.org/bugs/show_bug.cgi?id=37933">37933</a>, <a |
| href="http://dev.eclipse.org/bugs/show_bug.cgi?id=5399">5399</a>) </cite> |
| <p>The pre-M7 situation is as follows:</p> |
| <ul> |
| <li><code>ResourcesPlugin.getEncoding</code> returns the default |
| encoding for the workspace (the <code>org.eclipse.core.resources.encoding</code> |
| preference value if available, otherwise the value of the <code>file.encoding</code> |
| Java system property).</li> |
| <li><code>IFile.getContents</code>/<code>setContents</code> work with |
| byte streams - no encoding can be applied.</li> |
| <li><code>IFile.getEncoding</code> tries to guess the file encoding |
| (looking for the <a |
| href="http://www.unicode.org/unicode/faq/utf_bom.html">Byte Order Mark</a>), |
| which is not enough. Also, this API has no known client |
| so far. This API method would be deprecated.</li> |
| <li>the Java compiler supports non-uniform encondings for Java source |
| files, but in Eclipse it relies on <code>ResourcesPlugin.getEncoding<span |
| style="font-family: helvetica,arial,sans-serif;"> |
| (same value for all sources)</span></code><span |
| style="font-family: helvetica,arial,sans-serif;">.</span></li> |
| <li>the text editor framework supports setting the encoding for files |
| being edited (setting a persistent property on the file resource), but |
| there is no support for setting the encoding of multiple files |
| simultaneously, and other components are not aware of the encoding |
| settings.</li> |
| </ul> |
| <h2>Requirements </h2> |
| <ul> |
| <li>the encoding of a file should be automatically determined by |
| considering the file's content and/or its name or extension.</li> |
| <li>encoding information should be available not only for workspace |
| resources (e.g. IFiles) but for external files too. This has become |
| more important in light of the recent RCP work since use of the |
| resource plugin has become optional.</li> |
| <li>encoding information should be available for local history |
| contents (IFileState) and archives (e.g. *.zip, and *.jar).</li> |
| <li>in the future it should be possible to use the content based |
| encoding interpreter for determining more information about the file |
| (e.g. content type) without duplicating the mechanism but rather by |
| augmenting |
| it.</li> |
| <li>users should be able to set the default encoding for a project.</li> |
| <li>users should be able to share the default encoding settings in a |
| team |
| repository (metadata should reside in the project content area).</li> |
| <li>file contents-based encoding prevails upon the inherited encoding |
| setting.</li> |
| <li>users should be able to easily store a file in a different |
| encoding (in order to change its encoding).</li> |
| </ul> |
| <h2>Proposed solution</h2> |
| In addition to the existing approach of having a single global encoding for a |
| workbench we propose |
| <ol> |
| <li>an extensible mechanism to determine the encoding of a stream by |
| analyzing its contents or, if available its file name,</li> |
| <li>to add a default encoding property to projects. This default |
| encoding is used if no encoding could be determined in the first step.</li> |
| </ol> |
| We do not (yet) propose a settable encoding attribute per file because |
| <ul> |
| <li>we do not see an immediate need for this fine level of |
| granularity,</li> |
| <li>we have no sharable file attributes in Eclipse which would make |
| sharing of files with different encodings difficult.</li> |
| </ul> |
| The encoding for a stream or an IStorage (as returned by two<code> getCharset</code> |
| methods - see API changes) will be: |
| <ol> |
| <li> the encoding discovered by a <span style="font-style: italic;">content</span><em> |
| interpreter</em> associated to the file extension (or file type), if one exists |
| <em>and</em> can determine the encoding, or</li> |
| <li> the default encoding define for the enclosing project, if any, or</li> |
| <li>the global workspace encoding (equivalent to |
| ResourcesPlugin.getEncoding()).</li> |
| </ol> |
| <p>Regarding #1, an extension-point would allow file format-aware |
| encoding interpreters to register to the encoding discovery mechanism |
| for specific file types (extensions) or to associate existing encoding |
| interpreters to their own file extensions. Users would be able to |
| associate more file extensions for the known interpreters (preference).</p> |
| <p>All clients, when creating character-based streams when |
| reading/writing the contents of a file resource, should pass along the |
| charset string obtained from one of the <code>getCharset</code> |
| methods instead of |
| the one provided by <code>ResourcesPlugin.getEncoding</code>. Examples |
| are: text editors, compiler, search, compare. </p> |
| <h3>API changes</h3> |
| <h4>Added:</h4> |
| To make the encoding support available for non-workspace based |
| resources we propose to add the following method to |
| org.eclipse.core.runtime.IPlatform:<br> |
| <pre>public interface <span style="font-weight: bold;">IPlatform</span> {<br> // ...<br> public String <span |
| style="font-weight: bold;">getCharset</span>(InputStream stream, String fileExtension) throws CoreException;<br> // ...<br>}<br></pre> |
| <p>The InputStream seems to be the most widely used and scalable |
| mechanism to get |
| access to any kind of byte content. InputStreams can be easily created |
| for a |
| java.io.File, an IStorage (which subsumes IFile and IFileState; see |
| below), as |
| well as for bytes in memory (ByteArrayInputStream).<br> |
| The optional file extension argument can be used to quickly reject more |
| expensive ways for infering the encoding from the contents.<br> |
| </p> |
| A corresponding implementation (based on IContentInterpreters; see |
| below) lives in org.eclipse.core.runtime.Platform.<br> |
| <br> |
| For the resource plugin we propose to add a new interface |
| IEncodedStorage that adds the single method getCharset to the existing |
| IStorage interface:<br> |
| <pre>interface <span style="font-weight: bold;">IEncodedStorage</span> extends IStorage {<br> public String <span |
| style="font-weight: bold;">getCharset</span>() throws CoreException;<br>}<br></pre> |
| <p>Its method getCharset returns the name of the encoding for an |
| IStorage. It would make sense to add this method directly to the |
| IStorage interface, since any InputStream can only be interpreted |
| correctly if the used encoding is known. But because clients are |
| allowed to implement IStorage this would be a breaking API change, so |
| we decided to introduce a separate extension to IStorage. <br> |
| </p> |
| <p>Two existing interfaces will extend IEncodedStorage: IFile and |
| IFileState, two concrete class will provide an implementation: File and |
| FileState.<br> |
| </p> |
| <p>For both, files and file states, the implementation of getCharset |
| first uses IPlatform.getCharset(...) from above to find an encoding |
| based on any registered IContentInterpreters. If no encoding can be |
| determined, File.getCharset() locates the enclosing project of the file |
| and queries its IProjectDescription for a default encoding. For this we |
| need the following two new methods on IProjectDescription:<br> |
| </p> |
| <pre>interface <span style="font-weight: bold;">IProjectDescription</span> {<br> // ...<br> public String <span |
| style="font-weight: bold;">getDefaultCharset</span>();<br> public void <span |
| style="font-weight: bold;">setDefaultCharset</span>(String charset);<br> // ...<br>}<br></pre> |
| If no default encoding has been defined fo the project, the workspace's |
| default encoding preference is returned (via the existing API).<br> |
| <br> |
| Other implementers of IStorage will have to decide whether they should |
| base their implementation on IEncodedStorage.<br> |
| <br> |
| The implementation of Platform.getCharset will make use of content interpreters |
| implementing the IContentInterpreter interface and that can be associated to file |
| types through a new Core Runtime extension point "org.eclipse.core.runtime.contentInterpreter". |
| Users can associate additional file extensions via preferences.<br> |
| <p>The method interpretContent does not return the detected encoding but stores |
| it into a result object of type IContentInfo that is passed in as an argument. |
| This approach makes it possible to allow for collecting additional information |
| (like 'type'/'subtype') instead of just the encoding.</p> |
| <pre>interface <span style="font-weight: bold;">IContentInterpreter</span> { |
| public void <span style="font-weight: bold;">interpretContent</span>(IContentInfo result, InputStream contents); |
| }</pre> |
| The IContentInfo is: |
| <pre>public interface <span style="font-weight: bold;">IContentInfo</span> { |
| public void <span style="font-weight: bold;">setCharset</span>(String charset); |
| public String <span style="font-weight: bold;">getCharset</span>(); |
| }</pre> |
| Since we would not allow clients to implement (or extend) IContentInfo, |
| we will be able to extend the API with new setters and getters in the |
| future without breaking API. <br> |
| <p>The platform would provide itself implementations of |
| IContentInterpreters for xml and other |
| popular file formats.<br> |
| </p> |
| <h4>Deprecated:</h4> |
| <span style="font-family: monospace;">public int IFile.getEncoding()</span><br> |
| <span style="font-family: monospace;">public int IFile.ENCODING_</span>* |
| constants<br> |
| <br> |
| <code><span style="font-family: monospace;">public String |
| ResourcesPlugin.getEncoding()</span>: </code>Since all clients of this |
| method will most likely have to adapt their code, I suggest to |
| deprecate getEncoding() and introduce a new method getDefaultCharset() |
| that better reflects the real purpose (and brings it more in line with |
| IProjectDescription.getDefaultCharset())<br> |
| <br> |
| <h3>UI Changes<br> |
| </h3> |
| We need to add new UI for changing the default encoding for a project. |
| A |
| good place for this would be the Property dialog |
| since encoding can be considered a property of the project, similar to |
| the read-only property etc. The property dialog for files |
| would only show the current value for the encoding but would not allow |
| to change it. <br> |
| <br> |
| We should provide a "<span style="font-style: italic;">Convert Encoding</span>" |
| action that converts the contents of a file (or all files in a hierarchy) to a |
| different encoding. This action would ask the user for two encodings: the first |
| is used when reading all selected files and the second when writing these files |
| back to the workspace. <br> |
| The action would not |
| change the encoding value returned by getCharset() but it would provide |
| a means to make the encoding of multiple files consistent with the |
| default encoding of the enclosing project.<br> |
| (An alternative to this UI would be to provide something like a "Save with |
| encoding" action for editors. But this UI seems to be less convenient if |
| the encoding of multiple files needs to be changed).<br> |
| <br> |
| In order to make sharing of files with heterogenous encodings easier, |
| we'll have to enhance the compare/merge tools to be able to work with |
| heterogenous encodings:<br> |
| <br> |
| To facilitate that, we try to automatically determine the encoding for |
| the remote resource<br> |
| <ul> |
| <li>by knowing the default encoding of the remote project if the |
| .project file (containing the encoding attribut) has been synched |
| first, or<br> |
| </li> |
| <li>by using the local IEncodingInterpreter mechanism for the remote |
| resources (which are available as streams), or<br> |
| </li> |
| <li>by allowing the user to change the encoding for the remote |
| resource on the fly until it displays correctly.</li> |
| </ul> |
| With these means it becomes possible to compare and merge files |
| independent from the fact whether we use the same encodings on both |
| sides or not.<br> |
| <br> |
| However, if we want to use the same encoding (that is if we catchup with the remote |
| .project file), we will have to convert the encoding of our local files to adapt |
| them to the new encoding. For this we will provide the "<span style="font-style: italic;">Convert |
| Encoding"</span> action in the Compare/Merge tools where required.<br> |
| <br> |
| <h3>Scenarios</h3> |
| <ul> |
| <li>The user opens text files whose contents was created using encoding "MS932" |
| in a workspace whose default encoding is "US-ASCII". It was not |
| possible to guess the file encoding automatically, so what the user sees is |
| gibberish. The user figures out the cause of the problem and explicitly sets |
| the encoding for the project containg the files to be "MS932". He |
| will have to reload all editors to see the contents correctly and will have |
| to trigger a full build in the affected project. </li> |
| <li>The user has a Java project with a few Java files and no explicitely specified |
| project encoding and a CP1252 workspace encoding. Now he wants to start using |
| all kinds of Unicode characters in his Java files. He sets the default encoding |
| of his project to UTF-16 and he converts all existing Java files to the UTF-16 |
| encoding. All newly created Java files will automatically have the correct |
| encoding. Project metadata files like ".project" or "plugin.xml" |
| files will still be read in their correct encoding since IContentInterpreters |
| still apply to them.</li> |
| <li>Determining the encoding to use for newly created files: Normally the encoding |
| becomes relevant on saving the file for the first time. Since the IFile already |
| exists and knows its project, the encoding to use can be determined by the |
| proposed API. A potential problem might arise from the fact that a newly created |
| file should use an encoding that is consistent with the encoding it would |
| get from an IContentInterpreter. Examples are *.properties and *.xml files. |
| They have a UTF-8 encoding even if the enclosing project uses a different |
| encoding. The code that writes these files to disk (and defines the initial |
| encoding) must understand this. It can neither use the encoding for the project, |
| nor can it use the encoding for the file (because the file is still empty |
| when the IEncodingInterpreter tries to determines its encoding).<br> |
| </li> |
| </ul> |
| <ul> |
| </ul> |
| </body> |
| </html> |