FW: Re: Unicode file names

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

FW: Re: Unicode file names

Marian Ciobanu



>> OTOH MinGW didn't have the "legacy" issue, so it could have gone UTF8
>> from the beginning.
>
> I don't see what you mean here as MinGW by definition uses the
> Microsoft C library which is not UTF-8 oriented. MinGW very much had a
> legacy issue in that it wanted to be, to the extent possible, present
> the same API as Microsoft's compiler. (As far as I know.)

Oh, I wasn't aware of that. I thought it was supposed to be a simpler, faster, not-POSIX-compatible alternative to Cygwin.


> The major issue would be incompatibility with other compilers for
> Windows (including the normal MinGW). Any non-trivial software package
> would need to be separately ported to such a compiler, introducing
> ifdefs.

Well, yes, ..., but software that uses the 8-bit calls is broken anyway in the sense that it stops working the moment the codepage changes. So to actually make it work predictably, you'd have to either switch to 16-bit or use such a UTF-8 library / compiler.


> Still, that doesn't mean it wouldn't be an interesting endeavour. And
> with time, most users of MinGW might even be convinced to switch to
> this C library and compiler. If you decide to start such a project, I
> might be interested in taking part. (Not in any C++ stuff, though;)

For me to start such a project is quite unlikely, both because there are already many things that I want to do (and don't have the time) and because I'm not sure about its usefulness. Had I used C, I would have just implemented my own "fopenutf8()" and be done with it, no need to make such a fuss about this. Then if I thought others might be interested I could have published a library that contained this "fopenutf8()" and other related functions. They seem quite easy to do, but I don't care about them, because I'm actually a C++ user, and they don't do me any good. For C++ I could probably derive from ifstream, but I don't like the idea (and it might not work anyway.)

(There are issues with "fopenutf8()" as well, like mixing code that uses it with code that doesn't, but I believe this approach to be adequate in some cases.)

------------------------------------------------------------------------------
Enter the BlackBerry Developer Challenge  
This is your chance to win up to $100,000 in prizes! For a limited time,
vendors submitting new applications to BlackBerry App World(TM) will have
the opportunity to enter the BlackBerry Developer Challenge. See full prize  
details at: http://p.sf.net/sfu/Challenge
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.

Most annoying abuses are:
1) Top posting
2) Thread hijacking
3) HTML/MIME encoded mail
4) Improper quoting
5) Improper trimming
_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Reply | Threaded
Open this post in threaded view
|

Re: FW: Re: Unicode file names

Tor Lillqvist
> Well, yes, ..., but software that uses the 8-bit calls is broken anyway in the sense that it stops working the moment the codepage changes.

I don't know what you mean with "the moment the codepage changes". You
make the situation sound worse than it is. The system codepage of a
machine does not change without overwriting the Windows installation
with a different language edition of Windows, as far as I know.

Code that is written to use the "normal" C library (plain "char") APIs
(and A-suffixed versions of Win32 APIs, that is without any suffix at
all assuming UNICODE is not defined), for instance code ported
straight from Unix, does work fine in most cases on various language
editions of Windows with different system codepages, and is able to
handle non-ASCII file names in the system codepages in question.

You don't need to write such code to work just in one particular
system codepage. (In fact, it would be hard to intentionally do it.)

"Narrow char" code will usually, to the best of my knowledge, work
fine on a Western Windows installation, a Greek one, an Arabic one, or
a Hebrew one etc without recompilation and will handle files with
names in those codepages (which all do include plain ASCII in the
7-bit half).

(Then, on systems with East Asian double-byte system codepages, such
"plain" code will also work mostly fine, except that doing things like
strchr(filename, '\\') to find directory separators will break as some
double-byte characters have '\\' as the second byte. Ditto for '/'. To
properly handle strings encoded in also double-byte system codepages,
one should use the multi-byte string functions like _mbschr().)

It is just the case where a system has files with names containing
characters not in the system codepage that absolutely *requires* using
Unicode APIs, wide character strings and wide character APIs, to
handle such files.

As such, I have no idea how common or rare such situations are, but
they might be quite common in some parts of the world, or in
institutions that regularly handle files from different parts of the
world. In my personal opinion, it is important to be prepared for such
situations. That is why I tend to bring up the issue of being Unicode
aware.

--tml

------------------------------------------------------------------------------
Enter the BlackBerry Developer Challenge  
This is your chance to win up to $100,000 in prizes! For a limited time,
vendors submitting new applications to BlackBerry App World(TM) will have
the opportunity to enter the BlackBerry Developer Challenge. See full prize  
details at: http://p.sf.net/sfu/Challenge
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.

Most annoying abuses are:
1) Top posting
2) Thread hijacking
3) HTML/MIME encoded mail
4) Improper quoting
5) Improper trimming
_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Reply | Threaded
Open this post in threaded view
|

Re: FW: Re: Unicode file names

Роман Донченко
Tor Lillqvist <[hidden email]> писал в своём письме Thu,  
09 Jul 2009 14:21:55 +0400:

>> Well, yes, ..., but software that uses the 8-bit calls is broken anyway  
>> in the sense that it stops working the moment the codepage changes.
>
> I don't know what you mean with "the moment the codepage changes". You
> make the situation sound worse than it is. The system codepage of a
> machine does not change without overwriting the Windows installation
> with a different language edition of Windows, as far as I know.

I think the focused dropdown on this picture:  
<http://www.i18nwithvb.com/images/RLO_Tab3.jpg> does it. I'm too afraid to  
try it, though. 8=]

> It is just the case where a system has files with names containing
> characters not in the system codepage that absolutely *requires* using
> Unicode APIs, wide character strings and wide character APIs, to
> handle such files.
>
> As such, I have no idea how common or rare such situations are, but
> they might be quite common in some parts of the world, or in
> institutions that regularly handle files from different parts of the
> world.

I have a few music files with ISO 8859-1 characters. The ability to handle  
them is usually a sign of good software. 8=]

> In my personal opinion, it is important to be prepared for such
> situations. That is why I tend to bring up the issue of being Unicode
> aware.

Being a long time code page sufferer, I thank you.

By the way, one of the reasons my "From" field is Cyrillic is that so I  
can see who has sucky mailers. Their replies are usually addressed to  
"????? ????????". ;=]

Roman.


------------------------------------------------------------------------------
Enter the BlackBerry Developer Challenge  
This is your chance to win up to $100,000 in prizes! For a limited time,
vendors submitting new applications to BlackBerry App World(TM) will have
the opportunity to enter the BlackBerry Developer Challenge. See full prize  
details at: http://p.sf.net/sfu/Challenge
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.

Most annoying abuses are:
1) Top posting
2) Thread hijacking
3) HTML/MIME encoded mail
4) Improper quoting
5) Improper trimming
_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Reply | Threaded
Open this post in threaded view
|

Re: FW: Re: Unicode file names

Marian Ciobanu
In reply to this post by Tor Lillqvist
>> Well, yes, ..., but software that uses the 8-bit calls is broken anyway
>> in the sense that it stops working the moment the codepage changes.
>
> I don't know what you mean with "the moment the codepage changes". You
> make the situation sound worse than it is. The system codepage of a
> machine does not change without overwriting the Windows installation
> with a different language edition of Windows, as far as I know.

I thought you could change it from Control Panel or at least from RegEdit. Anyway, that's not what I meant; even if you can do it, it's not a normal use scenario. I was thinking more in terms of transferring data between computers with different codepages, e.g. unpacking an archive that your "different codepage" friend sent you.

As for making the situation worse than it is, you're right. Most applications work fine with 8bit calls to local codepages. My personal case is an MP3 diagnosis + correction tool (called MP3 Diags), which I wrote on Linux and then ported to Windows. It seems to work fine as long as you stay ASCII, but doesn't see anything above that, because of a mix of UTF8 and local codepage calls. I can improve it to always use the local codepage, but that doesn't really fix it, because on my computer I have files whose names don't fit in the local codepage (and if I change the codepage, then others won't fit.)

If you go beyond Western Europe and North America, you see a sudden increase in the chance of people having MP3 files whose names fall outside the local codepage. Now besides being invisible (for now) to my program, they can't be copied, opened, compared, played ... by tools that only comprehend the local codepage.

I think MP3 files are more likely than others to have "wrong" names because while the users have been taught to stick to ASCII when creating files if they want to keep out of trouble, CD rippers can use whatever characters they please.

So this sort of brings me to my main point: we're stuck in a situation where users would like to use more characters in the file names, but they don't because many tools can't deal with such names; the tools, on the other hand, have little incentive to change because the users learnt their lesson and just use ASCII, and perhaps the local codepage. So my suggestion is that the tools should change. Then why don't "I" start making the change? Well, I'm already spending a lot of time on the freely available MP3 tool, and there are other things that I need / want to do. Also, it would take me a lot more time to make, say, MinGW UTF8-aware, than it would take somebody who is familiar with it.

------------------------------------------------------------------------------
Enter the BlackBerry Developer Challenge  
This is your chance to win up to $100,000 in prizes! For a limited time,
vendors submitting new applications to BlackBerry App World(TM) will have
the opportunity to enter the BlackBerry Developer Challenge. See full prize  
details at: http://p.sf.net/sfu/Challenge
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.

Most annoying abuses are:
1) Top posting
2) Thread hijacking
3) HTML/MIME encoded mail
4) Improper quoting
5) Improper trimming
_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Reply | Threaded
Open this post in threaded view
|

Re: FW: Re: Unicode file names

Tor Lillqvist
> So this sort of brings me to my main point: we're stuck in a situation where users would like to use more characters in the file names, but they don't because many tools can't deal with such names;

Thus it is a good thing that modern programming environments like Java
and C# use Unicode for file names (all strings in fact). Just write
your code in Java or C# and there is no problem with arbitrary file
names on Windows.

Also, for instance the GTK+ stack API uses Unicode (UTF-8) for file
names on Windows. I don't know what other comparable toolkits like Qt
do. Hopefully they do the same. So if you for some reason don't want
to use Java or C#, use C (or C++) but with a library / toolkit that
provides a UTF-8 view of the file system.

Writing application code in plain C, in this century, without any
higher-level toolkit than the C library or basic Win32 APIs, sounds a
bit odd to me.

--tml

------------------------------------------------------------------------------
Enter the BlackBerry Developer Challenge  
This is your chance to win up to $100,000 in prizes! For a limited time,
vendors submitting new applications to BlackBerry App World(TM) will have
the opportunity to enter the BlackBerry Developer Challenge. See full prize  
details at: http://p.sf.net/sfu/Challenge
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.

Most annoying abuses are:
1) Top posting
2) Thread hijacking
3) HTML/MIME encoded mail
4) Improper quoting
5) Improper trimming
_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Reply | Threaded
Open this post in threaded view
|

Re: FW: Re: Unicode file names

Mark-32
Tor Lillqvist wrote:
> Writing application code in plain C, in this century, without any
> higher-level toolkit than the C library or basic Win32 APIs, sounds a
> bit odd to me.
>
> --tml
I can't let that pass really :-)

there may be good reasons; for instance my current project is a windows
port of a highly portable browser; now although its browser function
needs a few libs, it makes sense that its gui should be as unbloated as
possible; as well as [hopefully, when I get it right :-) ] working
compatibly in as many different flavours of windows as possible; hence
w32api / gdi

Best

Mark

http://www.halloit.com

Key ID 046B65CF


------------------------------------------------------------------------------
Enter the BlackBerry Developer Challenge  
This is your chance to win up to $100,000 in prizes! For a limited time,
vendors submitting new applications to BlackBerry App World(TM) will have
the opportunity to enter the BlackBerry Developer Challenge. See full prize  
details at: http://p.sf.net/sfu/Challenge
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.

Most annoying abuses are:
1) Top posting
2) Thread hijacking
3) HTML/MIME encoded mail
4) Improper quoting
5) Improper trimming
_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Reply | Threaded
Open this post in threaded view
|

Re: FW: Re: Unicode file names

Marian Ciobanu
In reply to this post by Marian Ciobanu
> So this sort of brings me to my main point: we're stuck in a situation
> where users would like to use more characters in the file names, but they
> don't because many tools can't deal with such names; the tools, on the
> other hand, have little incentive to change because the users learnt
> their lesson and just use ASCII, and perhaps the local codepage. So my
> suggestion is that the tools should change. Then why don't "I" start
> making the change? Well, I'm already spending a lot of time on the freely
> available MP3 tool, and there are other things that I need / want to do.
> Also, it would take me a lot more time to make, say, MinGW UTF8-aware,
> than it would take somebody who is familiar with it.


In case somebody stumbles upon this thread, looking for a solution to the same problem, here's what I did: I created drop-in replacement classes for fstream / ifstream / ofstream, which take Unicode names (UTF-8 or UTF-16) on their constructors and on their open() methods.

The code is available for download from the MP3 Diags project. You need the files fstream_unicode.h and fstream_unicode.cpp

You can also download or take a look at the files at http://mp3diags.svn.sourceforge.net/viewvc/mp3diags/src/fstream_unicode.h?view=markup and http://mp3diags.svn.sourceforge.net/viewvc/mp3diags/src/fstream_unicode.cpp?view=markup

They haven't been heavily tested, but they seem to work fine for me.

------------------------------------------------------------------------------
Enter the BlackBerry Developer Challenge  
This is your chance to win up to $100,000 in prizes! For a limited time,
vendors submitting new applications to BlackBerry App World(TM) will have
the opportunity to enter the BlackBerry Developer Challenge. See full prize  
details at: http://p.sf.net/sfu/Challenge
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.

Most annoying abuses are:
1) Top posting
2) Thread hijacking
3) HTML/MIME encoded mail
4) Improper quoting
5) Improper trimming
_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users