The mysterious case of LoadLibrary("fmifs.dll") and tolower()

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

The mysterious case of LoadLibrary("fmifs.dll") and tolower()

Pete Batard
Hi,

I'm experiencing what I can only qualify as a very weird issue with
MinGW32 (gcc 4.6.2) and MinGW-w64 (gcc 4.6.1/tdm64-1), when running the
simple program below:

-----------------------------------------------------------------------
#include <windows.h>
#include <stdio.h>
#include <ctype.h>

int main(int argc, char** argv)
{
   int c = 0xe9;
   HMODULE h;
       
   printf("tolower(0x%02x) = 0x%02x\n", c, tolower(c));
   h = LoadLibraryA("fmifs.dll");
   printf("tolower(0x%02x) = 0x%02x\n", c, tolower(c));
   return 0;
}
-----------------------------------------------------------------------

The output from this program is as follows:

o On Windows 7 x64 with an English/Ireland locale, and when compiled
with either MinGW32 or MinGW-w64:
   tolower(0xe9) = 0xe9
   tolower(0xe9) = 0xa3

o On Windows 7 x86 with an English/US locale, when compiled with MinGW32:
   tolower(0xe9) = 0xe9
   tolower(0xe9) = 0x3f

Obviously, the expectation is for tolower() to produce the same output
regardless of the call to LoadLibrary(), especially as, when testing the
same program with MSVC (Visual Studio 2010) or cygwin, the output
remains 0xe9 both before and after the call.

Now, the very weird part is that, this behaviour only seems to manifest
itself when using fmifs.dll. None of the other DLLs I tried seem to have
as dramatic an impact on the behaviour of tolower() as fmifs has.

For the sake of completion, this issue was first highlighted in a
formatting application (hence the use of fmifs.dll, which is a
formatting DLL), that also extracts ISO images, and converts ISO9660
filenames using tolower(). On an en_US machine, and as per the above,
one can end up with a filename containing a '?' character (0x3F), which
is invalid for a Windows file. I tried to play with forcing the locale
before calling the second tolower(), but it didn't seem to change
anything. I am also aware of at least 3 completely independent machines,
including one with a brand new installation of Windows, that replicate
the problem.

As this is a very intriguing behaviour, any insight on its cause would
be appreciated...

Regards,

/Pete




------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.  Disregard for the list etiquette may cause your account to be moderated.

_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Also: mailto:[hidden email]?subject=unsubscribe
Reply | Threaded
Open this post in threaded view
|

Re: The mysterious case of LoadLibrary("fmifs.dll") and tolower()

Xiaofan Chen
On Wed, May 16, 2012 at 7:15 AM, Pete Batard <[hidden email]> wrote:

> Hi,
>
> I'm experiencing what I can only qualify as a very weird issue with
> MinGW32 (gcc 4.6.2) and MinGW-w64 (gcc 4.6.1/tdm64-1), when running the
> simple program below:
>
> -----------------------------------------------------------------------
> #include <windows.h>
> #include <stdio.h>
> #include <ctype.h>
>
> int main(int argc, char** argv)
> {
>   int c = 0xe9;
>   HMODULE h;
>
>   printf("tolower(0x%02x) = 0x%02x\n", c, tolower(c));
>   h = LoadLibraryA("fmifs.dll");
>   printf("tolower(0x%02x) = 0x%02x\n", c, tolower(c));
>   return 0;
> }
> -----------------------------------------------------------------------
>
> The output from this program is as follows:
>
> o On Windows 7 x64 with an English/Ireland locale, and when compiled
> with either MinGW32 or MinGW-w64:
>   tolower(0xe9) = 0xe9
>   tolower(0xe9) = 0xa3
>
> o On Windows 7 x86 with an English/US locale, when compiled with MinGW32:
>   tolower(0xe9) = 0xe9
>   tolower(0xe9) = 0x3f
>

Under my English Windows XP SP3 with Chinese Locale (Language for
Non-Unicode Program set to Chinese).

1) Build with MinGW gcc 4.6.2
mcuee@dellxp /c/work/mingw/test
$ ./test1_mingw32.exe
tolower(0xe9) = 0xe9
tolower(0xe9) = 0xe9

2) Build with TDM64 (MinGW-w64 4.6.1), adding "-m32" to build
32bit binary
mcuee@dellxp /c/work/mingw/test
$ ./test1_tdm64_win32.exe
tolower(0xe9) = 0xe9
tolower(0xe9) = 0xe9

--
Xiaofan

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.  Disregard for the list etiquette may cause your account to be moderated.

_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Also: mailto:[hidden email]?subject=unsubscribe
Reply | Threaded
Open this post in threaded view
|

Re: The mysterious case of LoadLibrary("fmifs.dll") and tolower()

Pete Batard
On 2012.05.16 00:29, Xiaofan Chen wrote:

> Under my English Windows XP SP3 with Chinese Locale (Language for
> Non-Unicode Program set to Chinese).
>
> 1) Build with MinGW gcc 4.6.2
> mcuee@dellxp /c/work/mingw/test
> $ ./test1_mingw32.exe
> tolower(0xe9) = 0xe9
> tolower(0xe9) = 0xe9
>
> 2) Build with TDM64 (MinGW-w64 4.6.1), adding "-m32" to build
> 32bit binary
> mcuee@dellxp /c/work/mingw/test
> $ ./test1_tdm64_win32.exe
> tolower(0xe9) = 0xe9
> tolower(0xe9) = 0xe9

Interesting.

I just tested a 4th machine with XP installed and an "English (Ireland)"
locale, and still observed the same results as with 7 (second output =
0xa3). And in the virtual XP mode of Windows 7, when running the same
program, I got 0x3f. Thus I would think the problem will also manifest
itself with XP, as long as a western locale is in use.

What I suspect, that would explain your results, is that the second
tolower() call is dependent on the system locale/codepage for lowercase
conversion, and AFAIK (correct me if I'm wrong) there's no lowercase
conversion for Chinese characters. So that might explain why an 0xe9, in
a Chinese codepage would still be left unchanged.

Regards,

/Pete

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.  Disregard for the list etiquette may cause your account to be moderated.

_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Also: mailto:[hidden email]?subject=unsubscribe
Reply | Threaded
Open this post in threaded view
|

Re: The mysterious case of LoadLibrary("fmifs.dll") and tolower()

Greg Chicares
In reply to this post by Pete Batard
On 2012-05-15 23:15Z, Pete Batard wrote:
>
>    int c = 0xe9;
>    HMODULE h;
>
>    printf("tolower(0x%02x) = 0x%02x\n", c, tolower(c));
>    h = LoadLibraryA("fmifs.dll");
>    printf("tolower(0x%02x) = 0x%02x\n", c, tolower(c));
[...last quoted line doesn't always print 0xe9 as it should...]
> Now, the very weird part is that, this behaviour only seems to manifest
> itself when using fmifs.dll. None of the other DLLs I tried seem to have
> as dramatic an impact on the behaviour of tolower() as fmifs has.

Might that dll use a different version of the C runtime library,
or even implement its own tolower()? Just wild guesses...

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.  Disregard for the list etiquette may cause your account to be moderated.

_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Also: mailto:[hidden email]?subject=unsubscribe
Reply | Threaded
Open this post in threaded view
|

Re: The mysterious case of LoadLibrary("fmifs.dll") and tolower()

Pete Batard
On 2012.05.16 01:13, Greg Chicares wrote:
> Might that dll use a different version of the C runtime library,
> or even implement its own tolower()?

I don't think fmifs.dll implements tolower(), and, from new test results
below, I don't think we switch to a different version of the C runtime
library. But that was a good guess, and it gave me some more insight as
to what might be happening.

As per [1], we know that fmifs's DllMain() is going to be called on
LoadLibrary(), and it seems to do something else that messes up the
locale, which causes tolower() to behave differently.

For instance, if it sets the locale to a codepage, such as cp850 where
0xE9 is capital U acute ('Ú'), tolower() will try to convert it to
lowercase U acute ('ú'), and this is in effect what we seem to be
observing on en_IE systems, as this is 0xA3 in cp850.
On the other hand, it seems that, for en_US, cp437 is being used, with
0xE9 being the capital greek letter theta ('Θ') but since no lowercase
theta ('θ') seems to be available in cp437, tolower() appears to fall
back to '?' (which I would say is a bug of tolower() as the manpage says
it should not translate the character if the conversion was not possible).

I think the switching of the codepage in fmifs.dll is the likely
explanation, especially as if I change the source to the following:

-----------------------------------------------------------------------
#include <windows.h>
#include <stdio.h>
#include <ctype.h>
#include <locale.h>

int main(int argc, char** argv)
{
   int c = 0xe9;
   HMODULE h = NULL;
       
   printf ("Locale is: %s\n", setlocale(LC_ALL,NULL) );
   printf("tolower(0x%02x) = 0x%02x\n", c, tolower(c));
   h = LoadLibraryA("fmifs.dll");
   printf ("Locale is: %s\n", setlocale(LC_ALL,NULL) );
   printf("tolower(0x%02x) = 0x%02x\n", c, tolower(c));
   setlocale(LC_ALL,"C");
   printf ("Locale is: %s\n", setlocale(LC_ALL,NULL) );
   printf("tolower(0x%02x) = 0x%02x\n", c, tolower(c));
   return 0;
}
-----------------------------------------------------------------------

Then, on an en_IE system I get:

Locale is: C
tolower(0xe9) = 0xe9
Locale is: English_Ireland.850
tolower(0xe9) = 0xa3
Locale is: C
tolower(0xe9) = 0xe9

For en_US, I also get the expected English_United States.437 as the locale.

I think my previous attempt to set the locale before calling tolower()
didn't use C, which is probably why it didn't appear to work. For now,
I'll make sure to save the current locale in my app, before calling
LoadLibrary, and restore it afterwards.

Still, the same program compiled with MSVC returns a "C" locale
everywhere, so there's a difference between the MinGW and MSVC behaviour
there. And tolower() doesn't seem to behave as per the expectation when
unable to convert a character.

Regards,

/Pete

[1]
http://msdn.microsoft.com/en-us/library/windows/desktop/ms682583%28v=vs.85%29.aspx

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.  Disregard for the list etiquette may cause your account to be moderated.

_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Also: mailto:[hidden email]?subject=unsubscribe
Reply | Threaded
Open this post in threaded view
|

Re: The mysterious case of LoadLibrary("fmifs.dll") and tolower()

Ross Ridge
In reply to this post by Pete Batard
Pete Batard writes:
> On the other hand, it seems that, for en_US, cp437 is being used ...

It's using CP1252, the Microsoft extendend version of ISO 8859-1,
and it should be using this character set in the Irish locale as well.
The difference in behaviour is just because the locales specify different
sets of upper and lower case letters

You may be confused because what you're seeing on screen is in a different
code page.  If you're using a standard console window then its probably
using CP437, but this doesn't affect the character set that the C runtime
library uses.

>Still, the same program compiled with MSVC returns a "C" locale
>everywhere, so there's a difference between the MinGW and MSVC behaviour
>there.

This is explained by the fact that MinGW and fmifs.dll use the same
C runtime (MSVCRT.DLL) while your version of MSVC uses different one
(MSVCR100.DLL).  Each has their own idea of what the locale is.

You probably need to rethink completely how you handle characters sets
in your application.  You may not be able to rely on the C runtime
library the way you expect to.  You're definitely going to need a better
understanding of how Windows uses character sets, notably the difference
between the "OEM" and "ANSI" character sets.  Windows only supports
the OEM character sets like CP437 in a few places, like console windows
and filenames.  Most of the rest of Windows only supports Unicode and
the ANSI code pages.

                                        Ross Ridge


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.  Disregard for the list etiquette may cause your account to be moderated.

_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Also: mailto:[hidden email]?subject=unsubscribe
Reply | Threaded
Open this post in threaded view
|

Re: The mysterious case of LoadLibrary("fmifs.dll") and tolower()

Pete Batard
On 2012.05.17 21:56, Ross Ridge wrote:
> Pete Batard writes:
>> On the other hand, it seems that, for en_US, cp437 is being used ...
>
> It's using CP1252,

setlocale(LC_ALL,NULL) does report 437 ("English_United States.437") for
en_US and 850 for en_IE (English_Ireland.850).

> and it should be using this character set in the Irish locale as well.
> The difference in behaviour is just because the locales specify different
> sets of upper and lower case letters

Yes, I got that.

> You may be confused because what you're seeing on screen is in a different
> code page.  If you're using a standard console window then its probably
> using CP437, but this doesn't affect the character set that the C runtime
> library uses.

Actually, the sample provided was very simplified from the real life
scenario. My application is fully Unicode in that I ensured that any
strings where extended characters are expected is UTF-8.

However some people also wanted the app to support non compliant ISO9660
images that don't use Joliet for extended characters and instead return
filenames that abuse the ISO9660 specs (apparently with the blessing of
Microsoft) by simply storing extended characters using the default
system codepage. From what I can see in the ISO9660, such a method of
storing extended characters is non-compliant [1], but still this
shouldn't be much of an issue if tolower() behaves as per its
specifications and doesn't overstep its boundaries by returning '?'.

I guess the likely explanation for the '?' character is that the
tolower() implementation first converts the uppercase character from
whatever codepage is used to Unicode, then, if possible, to lowercase
Unicode and finally tries to match the lowercase Unicode character in
the original codepage. If that's not possible, whoever programmed the
function probably decided to fallback to '?', without realizing that
this would create a problem.

> This is explained by the fact that MinGW and fmifs.dll use the same
> C runtime (MSVCRT.DLL) while your version of MSVC uses different one
> (MSVCR100.DLL).  Each has their own idea of what the locale is.

Makes sense.

> You probably need to rethink completely how you handle characters sets
> in your application.

I think I'm fine there. I went great length to be Unicode everywhere in
my app, to avoid precisely these kind of headaches. My only problem is
in trying to support non-compliant ISO9660 filesystems, with the
expectation that the library I reuse (libcdio) can rely on a tolower()
that never return invalid filename characters, as per the specs. As I
use the Unicode version of CreateFile, if anything other than '?' is
returned, I'm fine, and I actually don't care much about having the
right codepage set during the tolower() call, even if that means I get
unexpected lowercase characters. But unexpected '?', which isn't a
lowercase character, is a problem.

>  You're definitely going to need a better
> understanding of how Windows uses character sets, notably the difference
> between the "OEM" and "ANSI" character sets.  Windows only supports
> the OEM character sets like CP437 in a few places, like console windows
> and filenames.  Most of the rest of Windows only supports Unicode and
> the ANSI code pages.

That's good advice, but I'm well aware of that. I have had to deal with
the various headaches of codepage conversions for many years and as a
result, I have become a strong advocate of avoiding anything that is not
Unicode. For the record, I even go as far as using a custom UTF-8 layer
for the Windows API in my app, as I find that the default W one from
Microsoft is too limited [2].

It's only because users of my app have requested non-compliant ISO
support that I have an issue, which I could just easily have brushed
away as "not supported".

Still, I do not think tolower() should ever return '?', regardless of
the codepage being used. On UNIX, the man page clearly states: "The
value returned is that of the converted letter, or c if the conversion
was not possible", which is pretty explicit.

The MSDN page [3] is a bit more fuzzy with: "(...) converts a copy of c
to lower case if the conversion is possible, and returns the result.
There is no return value reserved to indicate an error". However the
last statement can be interpreted as meaning that a special character
such as '?' shouldn't be returned if the conversion is not possible, as
otherwise it disprove the whole statement about not having a return
value reserved to indicate an error.

This being said, and after further testing, it looks like this is a pure
Microsoft issue with the CRT DLL, which I guess is what MinGW reuses. If
one sets the locate to "English_United States.437" and compiles with
MSVC, '?' is also returned.

Therefore, I'll see what I can do to report this issue to Microsoft and
ensure that they either clarify the tolower() behaviour in the doc, or
fix it in a next version of the CRT.

Regards,

/Pete

[1] http://reboot.pro/16374/page__st__275#entry152757
[2] https://github.com/pbatard/rufus/blob/master/src/msapi_utf8.h
[3] http://msdn.microsoft.com/en-us/library/8h19t214.aspx

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.  Disregard for the list etiquette may cause your account to be moderated.

_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Also: mailto:[hidden email]?subject=unsubscribe