Quantcast

Bug re. Unicode on the console

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Bug re. Unicode on the console

Joel C. Salomon
C:\>chcp
Active code page: 437

C:\>type unicode-test.txt
Some Unicode symbols: ╬▒╬▓╬│╬┤.
C:\>cat unicode-test.txt
Some Unicode symbols: I±I²I3I'.
C:\>chcp 65001
Active code page: 65001

C:\>type unicode-test.txt
Some Unicode symbols: αβγδ.
C:\>cat unicode-test.txt
cat: write error: Permission denied

C:\>

In case the mailer mangles the diagnostic:  With code page 437 active,
‘type’ and ‘cat’ print different garbage symbols to the console.  With
code page 65001 active, ‘type’ correctly prints the first four letters
of the Greek alphabet, but ‘cat’ complains.

—Joel Salomon

------------------------------------------------------------------------------
Are you an open source citizen? Join us for the Open Source Bridge conference!
Portland, OR, June 17-19. Two days of sessions, one day of unconference: $250.
Need another reason to go? 24-hour hacker lounge. Register today!
http://ad.doubleclick.net/clk;215844324;13503038;v?http://opensourcebridge.org
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.

Most annoying abuses are:
1) Top posting
2) Thread hijacking
3) HTML/MIME encoded mail
4) Improper quoting
5) Improper trimming
_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Bug re. Unicode on the console

Tor Lillqvist
I don't intend to be rude, but I think it is fairly naïve to even expect stuff like that to work...

--tml


------------------------------------------------------------------------------
Are you an open source citizen? Join us for the Open Source Bridge conference!
Portland, OR, June 17-19. Two days of sessions, one day of unconference: $250.
Need another reason to go? 24-hour hacker lounge. Register today!
http://ad.doubleclick.net/clk;215844324;13503038;v?http://opensourcebridge.org
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.

Most annoying abuses are:
1) Top posting
2) Thread hijacking
3) HTML/MIME encoded mail
4) Improper quoting
5) Improper trimming
_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Bug re. Unicode on the console

Joel C. Salomon
Tor Lillqvist wrote:
> I think it is fairly naïve to even expect stuff like that to work...

Why?  The file is encoded in UTF-8, so ‘type’, given the proper code
page, outputs it correctly.  Obviously (given the different default
outputs between the programs) ‘type’ & ‘cat’ handle the console
differently.  But—
• Why? and
• Why on earth does ‘cat’ fail with the message
  “write error: Permission denied”?

And so I tested things.  I wrote a simple ‘cat’ program (I converted
<http://plan9.bell-labs.com/sources/plan9/sys/src/cmd/cat.c> to ANSI C;
attached) and tried it:

C:\Users\chesky\AppData\Local\Temp\cat>gcc -o ct -Wall ct.c

C:\Users\chesky\AppData\Local\Temp\cat>chcp
Active code page: 437

C:\Users\chesky\AppData\Local\Temp\cat>ct utest.txt
Some Unicode symbols: ╬▒╬▓╬│╬┤.
C:\Users\chesky\AppData\Local\Temp\cat>cat utest.txt
Some Unicode symbols: I±I²I3I'.
C:\Users\chesky\AppData\Local\Temp\cat>type utest.txt
Some Unicode symbols: ╬▒╬▓╬│╬┤.
C:\Users\chesky\AppData\Local\Temp\cat>chcp 65001
Active code page: 65001

C:\Users\chesky\AppData\Local\Temp\cat>ct utest.txt
Some Unicode symbols: αβγδ.ct: write error copying utest.txt: No error

C:\Users\chesky\AppData\Local\Temp\cat>cat utest.txt
cat: write error: Permission denied

C:\Users\chesky\AppData\Local\Temp\cat>type utest.txt
Some Unicode symbols: αβγδ.
C:\Users\chesky\AppData\Local\Temp\cat>

I tried this test with ‘ct’ opening the file in both text and binary
modes.  Now there’s obviously trouble somewhere, since ferror(stdout) is
set after utf-8 output is sent.  (The command “ct ct.c” works without
error.)  But that doesn’t explain the error from ‘cat’, nor why it’s
“naïve” to expect this to work.

—Joel Salomon

#include<errno.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

char *argv0;

void
cat(FILE *f, char *s)
{
        char buf[8192];
        size_t n;

        do {
                n = fread(buf, 1, sizeof(buf), f);
                if (ferror(f)) {
                        fprintf(stderr, "%s: error reading %s: %s",
                                argv0, s, strerror(errno));
                        exit(EXIT_FAILURE);
                }
                fwrite(buf, 1, n, stdout);
                if (ferror(stdout)) {
                        fprintf(stderr, "%s: write error copying %s: %s",
                                argv0, s, strerror(errno));
                        exit(EXIT_FAILURE);
                }
        } while (!feof(f));
}



int
main(int argc, char *argv[])
{
        int i;
        FILE *f;

        argv0 = argv[0];
        if(argc == 1)
                cat(stdin, "<stdin>");
        else for (i = 1; i < argc; i++) {
                f = fopen(argv[i], "r");
                if (f == NULL) {
                        fprintf(stderr, "%s: can't open %s: %s",
                                argv0, argv[i], strerror(errno));
                        exit(EXIT_FAILURE);
                }
                else{
                        cat(f, argv[i]);
                        fclose(f);
                }
        }
        exit(EXIT_SUCCESS);
}


------------------------------------------------------------------------------
Are you an open source citizen? Join us for the Open Source Bridge conference!
Portland, OR, June 17-19. Two days of sessions, one day of unconference: $250.
Need another reason to go? 24-hour hacker lounge. Register today!
http://ad.doubleclick.net/clk;215844324;13503038;v?http://opensourcebridge.org
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.

Most annoying abuses are:
1) Top posting
2) Thread hijacking
3) HTML/MIME encoded mail
4) Improper quoting
5) Improper trimming
_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Bug re. Unicode on the console

Yongwei Wu
2009/6/21 Joel C. Salomon <[hidden email]>

>
> Tor Lillqvist wrote:
> > I think it is fairly naïve to even expect stuff like that to work...
>
> Why?  The file is encoded in UTF-8, so ‘type’, given the proper code
> page, outputs it correctly.  Obviously (given the different default
> outputs between the programs) ‘type’ & ‘cat’ handle the console
> differently.  But—
> • Why? and
> • Why on earth does ‘cat’ fail with the message
>  “write error: Permission denied”?
>
> And so I tested things.  I wrote a simple ‘cat’ program (I converted
> <http://plan9.bell-labs.com/sources/plan9/sys/src/cmd/cat.c> to ANSI C;
> attached) and tried it:
>
> C:\Users\chesky\AppData\Local\Temp\cat>gcc -o ct -Wall ct.c
>
> C:\Users\chesky\AppData\Local\Temp\cat>chcp
> Active code page: 437
>
> C:\Users\chesky\AppData\Local\Temp\cat>ct utest.txt
> Some Unicode symbols: ╬▒╬▓╬│╬┤.
> C:\Users\chesky\AppData\Local\Temp\cat>cat utest.txt
> Some Unicode symbols: I±I²I3I'.
> C:\Users\chesky\AppData\Local\Temp\cat>type utest.txt
> Some Unicode symbols: ╬▒╬▓╬│╬┤.
> C:\Users\chesky\AppData\Local\Temp\cat>chcp 65001
> Active code page: 65001

This seems an interesting case, but probably it cannot work. Microsoft
has indicated clearly that UTF-8 cannot be used a valid locale,
because its runtime supports only double-byte character system (DBCS),
and no more than two bytes. In fact, your code does not run under MSVC
either--so it is not a GCC failure.

And, strangely enough, batch files refuse to run after I issue "chcp
65001", on my Windows XP system.

--
Wu Yongwei
URL: http://wyw.dcweb.cn/

------------------------------------------------------------------------------
Are you an open source citizen? Join us for the Open Source Bridge conference!
Portland, OR, June 17-19. Two days of sessions, one day of unconference: $250.
Need another reason to go? 24-hour hacker lounge. Register today!
http://ad.doubleclick.net/clk;215844324;13503038;v?http://opensourcebridge.org
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.

Most annoying abuses are:
1) Top posting
2) Thread hijacking
3) HTML/MIME encoded mail
4) Improper quoting
5) Improper trimming
_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Bug re. Unicode on the console

Yongwei Wu
2009/6/22 Yongwei Wu <[hidden email]>

>
> 2009/6/21 Joel C. Salomon <[hidden email]>
> >
> > Tor Lillqvist wrote:
> > > I think it is fairly naïve to even expect stuff like that to work...
> >
> > Why?  The file is encoded in UTF-8, so ‘type’, given the proper code
> > page, outputs it correctly.  Obviously (given the different default
> > outputs between the programs) ‘type’ & ‘cat’ handle the console
> > differently.  But—
> > • Why? and
> > • Why on earth does ‘cat’ fail with the message
> >  “write error: Permission denied”?
> >
> > And so I tested things.  I wrote a simple ‘cat’ program (I converted
> > <http://plan9.bell-labs.com/sources/plan9/sys/src/cmd/cat.c> to ANSI C;
> > attached) and tried it:
> >
> > C:\Users\chesky\AppData\Local\Temp\cat>gcc -o ct -Wall ct.c
> >
> > C:\Users\chesky\AppData\Local\Temp\cat>chcp
> > Active code page: 437
> >
> > C:\Users\chesky\AppData\Local\Temp\cat>ct utest.txt
> > Some Unicode symbols: ╬▒╬▓╬│╬┤.
> > C:\Users\chesky\AppData\Local\Temp\cat>cat utest.txt
> > Some Unicode symbols: I±I²I3I'.
> > C:\Users\chesky\AppData\Local\Temp\cat>type utest.txt
> > Some Unicode symbols: ╬▒╬▓╬│╬┤.
> > C:\Users\chesky\AppData\Local\Temp\cat>chcp 65001
> > Active code page: 65001
>
> This seems an interesting case, but probably it cannot work. Microsoft
> has indicated clearly that UTF-8 cannot be used a valid locale,
> because its runtime supports only double-byte character system (DBCS),
> and no more than two bytes. In fact, your code does not run under MSVC
> either--so it is not a GCC failure.
>
> And, strangely enough, batch files refuse to run after I issue "chcp
> 65001", on my Windows XP system.

I also tried changing your cat function to the following effect:

void cat(FILE *f, char *s)
{
    char buf[8192];
    size_t n;
    int ch;

    for (;;) {
        ch = getc(f);
        if (ch == EOF)
            break;
        if (putchar(ch) == EOF) {
            fflush(stdout);
            fprintf(stderr, "[Error on %x: %s]",
                    ch, strerror(errno));
        }
    }
}

Operating on a UTF-8 text file with the following two lines:

Hello World
Ægean

Both GCC- and MSVC-generated executable gives this result (CHCP 65001):

Hello World
[Error on c3: Permission denied][Error on 86: Permission denied]gean

It is OK if CHCP is something else (like 437 or 936). So it seems the
Microsoft runtime thinks non-ASCII bytes is not allowed in console
code page 65001.

Best regards,

Yongwei

--
Wu Yongwei
URL: http://wyw.dcweb.cn/

------------------------------------------------------------------------------
Are you an open source citizen? Join us for the Open Source Bridge conference!
Portland, OR, June 17-19. Two days of sessions, one day of unconference: $250.
Need another reason to go? 24-hour hacker lounge. Register today!
http://ad.doubleclick.net/clk;215844324;13503038;v?http://opensourcebridge.org
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.

Most annoying abuses are:
1) Top posting
2) Thread hijacking
3) HTML/MIME encoded mail
4) Improper quoting
5) Improper trimming
_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Loading...