Unicode file names

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Unicode file names

Marian Ciobanu
Hi,


I've read the posts at http://thread.gmane.org/gmane.comp.gnu.mingw.user/23536 and http://thread.gmane.org/gmane.comp.gnu.mingw.user/14517 but I hope I'm missing something.

It seems to me that the only way to access files whose names can't be represented in the 8-bit system locale is to use native Windows functions or some external library. Is that so? (By access I mean fopen(), readdir(), stat(), ifstream, ...)

If that's the case, I wonder what led to the decision to make things this way, since this seems to favor porting from MSVC and other Windows compilers to GCC, or creating DLLs that are compatible with those compilers, while making it a pain to write stand-alone applications that work on both Windows and Linux/UNIX. Was this the intended goal? (Or perhaps there's a higher goal that I fail to see.)

(I'm using MinGW 3.4.2 / 3.4.5 and switching is currently not an option, but did anything change in 4.4.0?)

Thanks

------------------------------------------------------------------------------
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.

Most annoying abuses are:
1) Top posting
2) Thread hijacking
3) HTML/MIME encoded mail
4) Improper quoting
5) Improper trimming
_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file names

Tor Lillqvist
> It seems to me that the only way to access files whose names can't be represented in the 8-bit system locale is to use native Windows functions or some external library. Is that so? (By access I mean fopen(), readdir(), stat(), ifstream, ...)

The C library used by MinGW, i.e. the Microsoft one, has wide
character versions of all APIs that handle file names.These take file
name parameters as wchar_t strings instead of as char strings.

(In Windows C compilers, including MinGW, the wchar_t type is 16 bits,
standing for one UTF-16 "unit", and wchar_t strings are UTF-16 (LE).
Note that thus some (rare) Unicode characters, those outside the BMP,
take two wchar_t "units", so-called surrogate pairs.)

These functions have the same names as their normal "narrow char"
counterparts, but prefixed with _w. For instance _wfopen(),
_wreaddir(), _wstat(). As for C++ stuff like the ifstream you mention
I have no personal experience, but a quick glance does show me that
there is something called wifstream (etc, check the <iosfwd> header).
I am not interested enough in C++ to bother finding out whether the
"wideness" of these classes relate to just the data being written to /
read from them or also the names of files, though. (It does seem so,
unfortunately, so in that case you probably need to use a mix of C and
C++ to use wide character file names in C++.)

> If that's the case, I wonder what led to the decision to make things this way, since this seems to favor porting from MSVC and other Windows compilers to GCC, or creating DLLs that are compatible with those compilers, while making it a pain to write stand-alone applications that work on both Windows and Linux/UNIX. Was this the intended goal?

The intent of MinGW is not to make Windows look like Unix. (There are
separate projects/products for that, like Cygwin.) (But then, Cygwin
does not really handle the issue of file names not representable in
the system codepage either, as far as I know.)

The way file names are handled is a fundamental difference between
Windows and Unix. On Windows, file names are UTF-16. On Unix, file
names are arbitrary sequences of bytes ("char" in C). So if you want
to be able to handle arbitrary files, you need to take that into
consideration. (Or let somebody else try to handle it, i.e. use a
programming language like Java or C#, or a library like GLib.)

--tml

------------------------------------------------------------------------------
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.

Most annoying abuses are:
1) Top posting
2) Thread hijacking
3) HTML/MIME encoded mail
4) Improper quoting
5) Improper trimming
_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file names

Marian Ciobanu
> These functions have the same names as their normal "narrow char"
> counterparts, but prefixed with _w. For instance _wfopen(),
> _wreaddir(), _wstat(). As for C++ stuff like the ifstream you mention
> I have no personal experience, but a quick glance does show me that
> there is something called wifstream (etc, check the <iosfwd> header).
> I am not interested enough in C++ to bother finding out whether the
> "wideness" of these classes relate to just the data being written to /
> read from them or also the names of files, though. (It does seem so,
> unfortunately, so in that case you probably need to use a mix of C and
> C++ to use wide character file names in C++.)

Thank you for these details, though I already knew most of them; I was just hoping that I missed something, because what happens doesn't make sense to me. As far as I know, Microsoft is trying to move coders from 8bit local codepage to Unicode since as least Windows 2000 (by using CreateFileW or _wfopen() rather than CreateFileA() or fopen()), but it has to provide the "A" functions for "legacy" applications (well, I'm aware that many new programs just use the extended ASCII charset and don't care about Unicode, but they weren't supposed to.)

OTOH MinGW didn't have the "legacy" issue, so it could have gone UTF8 from the beginning. To me it makes a lot more sense to do it this way, because it both removes inconsistent behavior and makes it easier to write programs that work in various countries, on Linux and on Windows. Since the conversion between UTF8 and UTF16 is trivial, it seems not to be a big deal to call the 16bit functions instead of the 8bit ones.

So my idea of how fopen() should work in MinGW is that it should allocate a temporary buffer in which to run a UTF8-to-UTF16 conversion and then it should call Microsoft's _wfopen(), or just pass a "ccs=UTF-8" in newer versions of MS's fopen(). This would make writing programs that are portable between Windows and Linux significantly easier.

I realize that there should be some issues with this approach, but all I could think of was some performance penalty (which should be insignificant in most cases), the possibility of introducing bugs (as the code would be more complicated), passing UTF-8 names to non-MinGW DLLs that expect local codepage names, and the human resource requirement that somebody should actually do it. However, none of these issues seems big enough to me to justify not supporting Unicode names. OTOH I obviously don't know enough about MinGW's history and quite likely there are are issues with my approach that I don't see yet.


> The way file names are handled is a fundamental difference between
> Windows and Unix. On Windows, file names are UTF-16. On Unix, file
> names are arbitrary sequences of bytes ("char" in C). So if you want

Yes, that's true in theory, but there are some practical considerations: as far as I know Linux uses UTF8 by default on its partitions. Sure, in your particular program you can use Latin-1 or something else to create and read files and it will work just fine except that some characters will not look OK when running a "ls". However, programs don't seem to force their encoding, so for practical purposes the names are UTF8.

This brings me back to this question: I have a program that processes files on Linux and I want to make it work on Windows; so what do I do? I'm not willing to pepper my code with #ifdefs, and I HAVE to use a C++ class to handle the files anyway. The answer is that I'm going to get rid of ofstream, readdir() and the rest, and use QFile and QDir from Qt instead, since this is a Qt program anyway. Also, I should keep in mind that ofstream cannot be used in programs that hope to achieve some portability. That's a bit sad given that ofstream is supposed THE portable class to use in C++ to write to files. Note that in Linux one can use ofstream even if the file names are not UTF8, but with MinGW's ofstream it's just not possible to create files whose names have some foreign characters, and I guess this is my main issue. (The "w" in wifstream is about text files that contain 16bit, wchar_t characters; to open one of them you still pass a char* file name.) At least MS's ofstream has a wchar_t* constructor, so you can take care of portability by calling a macro that converts UTF8 to UTF16 on Windows and does nothing on Linux.


Now I don't want to seem too ungrateful. I appreciate the effort that was put in to get MinGW where it is today, and I'm thankful for that, but, from my limited point of view, some things could have been done better.

Thanks

____________________________________________________________
FREE 3D EARTH SCREENSAVER - Watch the Earth right on your desktop!
Check it out at http://www.inbox.com/earth

------------------------------------------------------------------------------
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.

Most annoying abuses are:
1) Top posting
2) Thread hijacking
3) HTML/MIME encoded mail
4) Improper quoting
5) Improper trimming
_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file names

Tor Lillqvist
> OTOH MinGW didn't have the "legacy" issue, so it could have gone UTF8 from the beginning.

I don't see what you mean here as MinGW by definition uses the
Microsoft C library which is not UTF-8 oriented. MinGW very much had a
legacy issue in that it wanted to be, to the extent possible, present
the same API as Microsoft's compiler. (As far as I know.)

Sure, one could write a new C library for Windows that would present a
"saner" interface to the system, one point being using UTF-8 for all
strings passed to and returned from the C library. Not a light
undertaking, though. When MinGW was conceived, UTF-8 was not yet as
ubiquitous and obvious as it is today.

> So my idea of how fopen() should work in MinGW is that it should allocate a temporary buffer in which to run a UTF8-to-UTF16 conversion and then it should call Microsoft's _wfopen(), or just pass a "ccs=UTF-8" in newer versions of MS's fopen(). This would make writing programs that are portable between Windows and Linux significantly easier.

What you describe would be fopen() in a new C library then, not the
fopen() in the Microsoft C library that MinGW-built code uses.

> I realize that there should be some issues with this approach, but all I could think of was some performance penalty

The major issue would be incompatibility with other compilers for
Windows (including the normal MinGW). Any non-trivial software package
would need to be separately ported to such a compiler, introducing
ifdefs.

It would be questionable, IMHO, whether a compiler targeting such a C
library should even predefine the _WIN32 macro, as that is generally
used around code that assumes the C library is the Microsoft one. So
in practise such a compiler would mean a wholly new target to port to.

Still, that doesn't mean it wouldn't be an interesting endeavour. And
with time, most users of MinGW might even be convinced to switch to
this C library and compiler. If you decide to start such a project, I
might be interested in taking part. (Not in any C++ stuff, though;)

 > as far as I know Linux uses UTF8 by default on its partitions

As you know, Linux file systems / partitions (or UNIX ones in general)
don't "use" any character set and encoding. File names (directory
entries) consist of arbitrary byte sequences (any bytes except slash
and nul). It's entirely up to user-level code, desktop environments
etc, to interpret the byte sequences as being some specific character
set and encoding (like, for instance, Unicode in UTF-8).

At any non-US site with old file volumes still around, unless the
BOFHs have been very aggressive and consistent throughout history, it
is fairly certain there exists a significant number of files with
names that are non-ASCII but not UTF-8 (but for instance ISO8859-1,
which was much more popular than UTF-8 in a large part of Western
Europe for a quite long time).

--tml

------------------------------------------------------------------------------
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.

Most annoying abuses are:
1) Top posting
2) Thread hijacking
3) HTML/MIME encoded mail
4) Improper quoting
5) Improper trimming
_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users