| <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN2"> |
| <html> |
| <head> |
| <title>Removing restrictions on valid characters in paths</title> |
| <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> |
| <meta http-equiv="Content-Style-Type" content="text/css"> |
| <link href="http://dev.eclipse.org/default_style.css" rel="stylesheet" type="text/css"> |
| </head> |
| <body bgcolor="#ffffff"> |
| |
| <table width="100%"> |
| <tr><td style="background:#0080C0"><b><span style="color:white">Proposal</span></b></td></tr> |
| </table> |
| |
| <h1>Removing restrictions on valid characters in paths</h1> |
| |
| <blockquote> <b>Summary</b> <br> |
| The data structure <tt>org.eclipse.core.runtime.IPath</tt> and its |
| canonical implementation, <tt>org.eclipse.core.runtime.Path</tt>, impose |
| restrictions on segment names that can be more restrictive |
| than those for file names in the underlying file system. When Eclipse paths |
| are used to represent file system paths, these restrictions prevent |
| valid files from being added to the Eclipse workspace. This document |
| describes the current set of restrictions in Eclipse 3.0, and proposes changes to |
| lift these restrictions. |
| <p> |
| <font size="-1">Last modified: October 18, 2004</font> |
| </blockquote> |
| <p> |
| <h3>Background</h3> |
| <p> |
| <tt>IPath</tt> is an abstract data structure supplied by the <tt>org.eclipse.core.runtime</tt> |
| plug-in, and consists of the following parts: |
| <ul> |
| <li>An optional device (a <tt>String</tt>)</li> |
| <li>Zero or more ordered segments (represented as <tt>String</tt>)</li> |
| <li>Optional leading, trailing, and UNC separators</li> |
| </ul> |
| <p> |
| To facilitate conversion between <tt>IPath</tt> and <tt>String</tt> |
| instances, <tt>IPath</tt> reserves the colon (':') character as the device |
| delimiter, and the forward and back slash ('/' and '\') characters as segment delimiters. |
| The API javadoc for <tt>IPath.isValidSegment</tt> outlines the complete set |
| of restrictions: |
| <ul> |
| <li> the empty string is not valid |
| <li> any string containing the colon character (":") is not valid |
| <li> any string containing the slash character ("/") is not valid |
| <li> any string containing the backslash character ("\") is not valid |
| <li> any string starting or ending with a whitespace character is not valid |
| <li> all other strings are valid |
| </ul> |
| Contrast this with the restrictions on file names in Unix (as defined in the "Base Definitions" volume |
| of IEEE Std. 1003.1-2001, section 3.169 Filename): |
| <ul> |
| <li> the empty string is not valid </li> |
| <li> any string containing the slash character ('/') is not valid</li> |
| <li> the strings "." and ".." have special meaning</li> |
| <li> any string containing the null byte is not valid</li> |
| <li> all other strings are valid</li> |
| </ul> |
| Thus when Eclipse <tt>IPath</tt> objects are used to represent Unix file names, |
| they are unable to represent file names containing '\' or ':', or file names with leading |
| or trailing white space. |
| |
| <h3>Proposed Solution</h3> |
| <p> |
| Lifting the restriction on paths with leading or trailing whitespace and paths |
| containing the '\' character is easily achieved by specifying a new constructor for |
| creation of paths. Lifting the restriction on the ':' character is more |
| difficult, since it is needed as the path delimiter on operating systems that |
| support a device. |
| <p> |
| The solution must accomodate the two interesting categories |
| of <tt>IPath</tt> users: |
| <ul> |
| <li>Code that needs a platform-specific representation. In other words, someone that |
| actually wants to open a stream on a file, delete a file, etc. They essentially need to |
| translate between an <tt>IPath</tt> and a <tt>java.io.File</tt>.</li> |
| <li>Code that is reading, storing, and manipulating file system paths, but does |
| not need the platform-specific manifestation. In particular, there is a strong need |
| for a serialized representation of paths in a platform-neutral way so that they can |
| be later read and interpreted on a different platform (typically in the form of a path |
| that is relative to some platform specific prefix represented by a variable). |
| </ul> |
| <tt>IPath</tt> already acknowledges these two uses in its <tt>toString</tt> methods. |
| The standard <tt>toString</tt> method creates a platform-neutral encoding of |
| the path as a <tt>String</tt>. The <tt>toOSString</tt> method creates a platform-specific |
| encoding suitable for passing to <tt>java.io.File</tt> or other API that deals directly |
| with the file system. |
| <p> |
| The proposed solution is to introduce two constructors for creating <tt>IPath</tt> |
| that perform the inverse of the two existing <tt>toString</tt> methods: |
| <ul> |
| <li><code>Path.fromOSString</code>: A factory method that decodes a platform-specific string. |
| For example, this will parse the output of a previous call to <tt>IPath.toOSString</tt>, |
| or the value returned by <tt>java.io.File.getAbsolutePath</tt>.</li> |
| <li><code>Path.fromPortableString</code>: A factory method that decodes a platform-neutral |
| string, such as the output of a previous call to <tt>IPath.toPortableString</tt>.</li> |
| </ul> |
| <p> |
| Since changing the behaviour of the existing <code>toString</code> |
| method would cause too much breakage, an new method, <code>toPortableString</code> |
| will be introduced for creating a platform-neutral string representation of paths. |
| The existing <code>toString</code> method will remain unchanged. |
| <p> |
| Most clients will use the platform-specific form of paths. The path |
| can be converted to/from a platform-neutral representation when a path |
| needs to be serialized in a portable fashion. |
| <p> |
| The platform-neutral encoding of paths (<code>IPath.toPortableString</code>) will allow |
| all characters except slash ('/') in segment names, and include an optional device |
| separated from the segments by a single colon character. Literal colon characters in path |
| segments are escaped through doubling (one colon becomes two colons). |
| The following are some examples of windows file system paths and the |
| corresponding platform-neutral encoding: |
| <ul> |
| <li>"C:\folder\file.txt" becomes "C:/folder/file.txt"</li> |
| <li>"C:folder\file.txt" becomes "C:folder/file.txt"</li> |
| <li>"C:\folder\" becomes "C:/folder/"</li> |
| </ul> |
| Canonical Unix paths look identical to their platform-neutral encoding, except |
| in the presence of segments containing the colon character. The following are |
| some Unix paths and the corresponding encoding by <tt>IPath.toPortableString</tt>: |
| <ul> |
| <li>"/etc/" is encoded as "/etc/"</li> |
| <li>"/etc/passwd" is encoded as "/etc/passwd"</li> |
| <li>"/etc/timeNowIs4:25:12PM" is encoded as "/etc/timeNowIs4::25::12PM"</li> |
| <li>"c:/folder/file.txt" is encoded as "c::/folder/file.txt"</li> |
| </ul> |
| <p> |
| UNC paths, which typically have no device but have a double leading separator |
| will generally be the same |
| <ul> |
| <li>"//Server/Volume" becomes "//Server/Volume"</li> |
| <li>"//Server/TimeIs4:25:12PM" becomes "//Server/TimeIs4::25::12PM"</li> |
| </ul> |
| If for some reason a UNC path had a device, it will preceed the slashes: |
| <ul> |
| <li>"C://Server/Volume" becomes "C://Server/Volume"</li> |
| <li>"C://Server/TimeIs4:25:12PM" becomes "C://Server/TimeIs4::25::12PM"</li> |
| </ul> |
| <p> |
| This platform-neutral encoding unambiguously encodes all possible paths on |
| all supported platforms. Most importantly, this <tt>toPortableString</tt> implementation |
| is fully backward compatible with the Eclipse 3.0 implementation of <tt>IPath.toString</tt> |
| for all paths that can be created in Eclipse 3.0. This means that clients who |
| previously used <code>toString</code> for serializing paths can move to the |
| new <code>toPortableString/fromPortableString</code> methods without |
| migrating file formats. |
| <p> |
| The platform-specific Path factory method will impose the minimum platform-specific |
| requirements needed to unambiguosly parse all possible paths on that platform. |
| The Windows implementation, for example, will interpret everything up to the |
| first ':' as the device, and treat both '/' and '\' as path segment separators. No |
| other rules will be imposed. Thus the existing restriction on paths that prevents |
| path segments from having leading or trailing whitespace will no longer be enforced |
| on any platform. |
| <p> |
| As before, detailed validation of all legal characters and names on that platform |
| will not be enforced. Some clients use technology such as Cygwin or Samba to |
| mount foreign file systems on a platform. In these situations, path name rules |
| for the local file system do not apply. While it is difficult to fully support these users, |
| any additional platform-specific verification performed on paths causes further |
| problems for these users. Imposing the absolute minimum requirements for |
| unamiguously parsing paths allows the majority of users to function without |
| further impacting the corner cases. |
| </p> |
| |
| <h3>API Details</h3> |
| <p> |
| The following existing methods on <tt>IPath</tt> and <tt>Path</tt> are affected: |
| <ul> |
| <li><tt>isValidPath/isValidSegment</tt>. The implementation of these |
| methods will change to match the <code>fromOSString</code> factory |
| method. In other words, path validity becomes a platform-specific issue. |
| The specification will change to a more ambiguous wording stating only that |
| certain characters are reserved on some operating systems. In implementation, |
| it will just check for the device separator on operating systems that require it. |
| The restriction on leading and trailing spaces in segment names will be |
| removed on all operating systems that allow such paths.</li> |
| <li><tt>Path(String, String)</tt>. This constructor previously had the unspecified |
| behaviour of extracting the device from the second argument when no device |
| argument was supplied. This behaviour will change to no longer parse the |
| device from the second argument. This constructor will now handle literal |
| colon characters in path segments on all platforms.</li> |
| <li><tt>IPath.append(String)</tt>. This method implicitly constructs or interprets |
| a path literal from the provided parameter. This method will use the os-specific |
| rules when interpreting the provided path ('\' and ':' treated as segment and |
| device delimiters on Windows only).</li> |
| </ul> |
| New methods for Eclipse 3.1: |
| <ul> |
| <li><tt>Path.fromPortableString</tt>. A factory method for producing an <tt>IPath</tt> |
| given a platform-neutral encoding of a string. This is the inverse of the |
| <tt>IPath.toPortableString</tt> method. In particular, double colons will be interpreted |
| as single colon characters in segment names, and the first single colon (if any) |
| will be treated as the device separator.</li> |
| <li><tt>toPortableString</tt>. A method for producing a platform-neutral |
| encoding of a path as a string, suitable for storing in files that need to |
| be platform-neutral. This encoding will escape literal colons in path segments |
| using a double colon.</li> |
| <li><tt>Path.fromOSString</tt>. A factory method for producing an <tt>IPath</tt> |
| given a platform-specific encoding of a string. This is the inverse of the |
| <tt>IPath.toOSString</tt> method. On Windows the colon character is treated |
| as the device separator, and both varieties of slash are treated as segment separators.</li> |
| </ul> |
| </p> |
| <h3>What do we do with the <tt>Path(String)</tt> constructor?</h3> |
| <p> |
| This proposal introduces two factory methods that clearly distinguish |
| platform-neutral and platform-specific encodings of paths. The difficult |
| question is what to do with old single argument <tt>Path</tt> constructor. The two options are: |
| <ol> |
| <li> |
| Leave the implementation of this constructor unchanged, but deprecate |
| it. The advantage of this solution is that it does not break the API contract |
| spelled out in the current <code>Path</code> constructor, which explicitly |
| states how it handles ':' and '\' characters. The disadvantage is that this will |
| require all callers of the existing <tt>Path</tt> constructors |
| to migrate to one of the two path factory methods, depending on the origin |
| of the path string being used. Clients that do not migrate to the new factory |
| methods risk errors introduced when trying to construct <code>IPath</code> |
| instances corresponding to file system paths that were previously treated as |
| invalid. For example, the resources plugin would allow introduction of |
| resources with the ':' and '\' characters. Other plugins trying to create |
| a path corresponding to those resources using the old constructors will |
| fail. Experiments with this solution showed that plug-ins that failed to |
| migrate to the new factory method were broken due to the unexpected |
| introduction of previously invalid paths. This presents a bleak picture |
| for backwards-compatibility, regardless of the fact that no API contracts |
| are broken. |
| </li><li> |
| The second option is to change the existing single argument path constructor to be |
| platform-specific. In other words, the Windows implementation of these |
| methods would remain unchanged, but implementations on other platforms |
| would stop treating ':' as the device separator, and no longer treat '\' |
| as a path segment separator. This clearly violates the existing API |
| specification of the <code>Path</code> constructor. On the positive |
| side, this introduces very little breakage in practice. The net effect |
| is of removing old restrictions on some operating systems. The only |
| breakage will be caused to clients who use a device for some reason |
| on all operating systems, and clients that need to construct <code>IPath</code> |
| objects representing file system paths from platforms other than the one |
| that the current Eclipse instance is running in. For example, a plug-in running |
| on Linux would not be able to use the old constructors to create <code>IPath</code> |
| objects representing files from a remote Windows system. |
| </li></ol> |
| <p> |
| After investigating the implementation of both of the above approaches, |
| the second option introduces the smallest breakage by far. For example, |
| the first option requires almost all of the 600 references to the <code>Path</code> |
| constructors found in the current edition of the Eclipse platform. The second |
| option requires only a small set of localized changes in code that deals |
| with serializing and deserializing paths in a platform-neutral manner. Based |
| on testing the implementation of these two options, this proposal recommends |
| option two. |
| <h3>Examples</h3> |
| <p> |
| The following examples illustrate the behaviour of the various <tt>Path</tt> |
| constructors and <tt>to*String</tt> methods. |
| <p> |
| Given the absolute path with device "C:" and single segment "foo", the following |
| <tt>IPath</tt> methods will produce: |
| <ul> |
| <li>toString -> "C:/foo"</li> |
| <li>toPortableString -> "C:/foo" </li> |
| <li>toOSString (windows) -> "C:\foo" </li> |
| <li>toOSString (Linux) -> "C:/foo" </li> |
| <li>getDevice -> "C:"</li> |
| </ul> |
| Given the relative path with null device, and two segments "C:" and "foo": |
| <ul> |
| <li>toString -> "C:/foo"</li> |
| <li>toPortableString -> "C::/foo"</li> |
| <li>toOSString (windows) -> "C:\foo"</li> |
| <li>toOSString (Linux) -> "C:/foo"</li> |
| <li>getDevice -> (null)</li> |
| </ul> |
| Given the string "C:\\foo" (single backslash escaped in Java literal format), the following constructors will produce: |
| <ul> |
| <li>fromOSString and Path(String) (windows) -> Absolute path with device "C:" and single segment "foo" |
| <li>fromOSString and Path(String) (Linux) -> Relative path with null device and single segment "C:\foo" |
| <li>fromPortableString -> Relative path with device "C:" and single segment "\foo" |
| </ul> |
| Given the string "C:/foo": |
| <ul> |
| <li>fromOSString and Path(String) (windows) -> Absolute path with device "C:" and single segment "foo" |
| <li>fromOSString and Path(String) (Linux) -> Relative path with null device and two segments "C:" and "foo" |
| <li>fromPortableString -> Absolute path with device "C:" and single segment "foo" |
| </ul> |
| Given the string "C::/foo": |
| <ul> |
| <li>fromOSString and Path(String) (windows) -> Relative path with device "C:" and two segments ":" and "foo" (an invalid path) |
| <li>fromOSString and Path(String) (Linux) -> Relative path with null device and two segments "C::" and "foo" |
| <li>fromPortableString -> Relative path with null device and two segments "C:" and "foo" |
| </ul> |
| <h3>Other migration issues</h3> |
| <p> |
| All clients who store absolute <tt>IPath</tt> objects as platform-neutral |
| strings in a serialized form (as produced by <tt>IPath.toString</tt> in Eclipse 3.0), |
| should switch to the new <tt>fromPortableString/toPortableString</tt> methods |
| rather than the <tt>Path</tt> constructor and the <tt>toString</tt> method. |
| Backward compatibility with files written by Eclipse 3.0 is automatic |
| (no changes to file format or changing file format version numbers required). |
| Examples of files that contain string representations of paths that will |
| need to migrate include the workspace <tt>.project</tt> and <tt>.classpath</tt> files. |
| </p> |
| <h3>Other Observations</h3> |
| <p> |
| Under this proposal <tt>IPath.toPortableString</tt> and <tt>Path.fromPortableString</tt> |
| are perfect inverses of each other. In other words, the expression |
| <pre> |
| path.equals(Path.fromPortableString(path.toPortableString())) |
| </pre> |
| will be true for all paths, and |
| <pre> |
| string.equals(Path.fromPortableString(string).toPortableString()) |
| </pre> |
| will be true for all strings that represent canonical paths (strings with duplicate slashes |
| or "." and ".." references will turn out differently). Furthermore, the Eclipse 3.1 implementation of |
| <tt>Path.fromPortableString</tt> will be the perfect inverse of the Eclipse 3.0 implementation |
| of <tt>IPath.toString</tt>. |
| <p> |
| On Unix, the <tt>toOSString</tt> and <tt>fromOSString</tt> methods will be |
| inverses of each other. On Windows, the same can only be said for paths that |
| do not contain colon or backslash characters within segment names (such paths |
| are invalid on Windows anyway). Consider the following example: |
| <pre> |
| String input = "foo::bar"; |
| IPath pathOne = Path.fromPortableString(input); |
| IPath pathTwo = Path.fromOSString(pathOne.toOSString()); |
| pathOne.equals(pathTwo) -> false! |
| </pre> |
| The input string represents a path with no device, and a single segment whose |
| name is "foo:bar" (invalid on Windows). When this is output using <tt>toOSString</tt>, |
| it is encoded as "foo:bar". The <tt>fromOSString</tt> then interprets this as |
| a path with device "foo:" and first segment "bar". Similar mangling occurs if you |
| create a path with a segment containing the backslash character: |
| <pre> |
| String input = "foo\\bar"; |
| IPath pathOne = Path.fromPortableString(input); |
| IPath pathTwo = Path.fromOSString(pathOne.toOSString()); |
| pathOne.equals(pathTwo) -> false! |
| </pre> |
| In this case, the input is a path with one segment whose name is "foo\bar". This |
| is interpreted by <tt>fromOSString</tt> as a path with two segments "foo" and "bar". |
| In other words, under this proposal you cannot reliably manipulate paths containing |
| backslash or colon using to/fromOSString on Windows. This seems to be an acceptable |
| limitation. |
| </p> |
| <h3>References</h3> |
| <p> |
| <ul> |
| <li><a href="path_migration.html">Draft migration guide entry</a></li> |
| <li><a href="https://bugs.eclipse.org/bugs/show_bug.cgi?id=24152">bug 24152</a></li> |
| </ul> |
| </body> |
| </html> |