Page 6 of 8

Re: filesysbox ntfs ubs massStorage problem

Posted: Wed Mar 12, 2014 10:44 am
by colinw
salass00 wrote: Personally I would rather just use UTF-8 and get rid of all this character set conversion garbage in filesysbox but it just isn't going to happen.
On the contrary, that's exactly what needs to happen. Anything else is a kludge that will also cripple the filesystems efficiency,
so you might as well take the first step towards UTF-8 compatibility.

Re: filesysbox ntfs ubs massStorage problem

Posted: Wed Mar 12, 2014 3:28 pm
by salass00
colinw wrote: On the contrary, that's exactly what needs to happen. Anything else is a kludge that will also cripple the filesystems efficiency,
so you might as well take the first step towards UTF-8 compatibility.
Well FWIW I will change filesysbox to use UTF-8 strings exclusively but I don't expect anyone to be spurred to implement UTF-8 support in the rest of the OS because this.

Re: filesysbox ntfs ubs massStorage problem

Posted: Wed Mar 12, 2014 10:07 pm
by colinw
salass00 wrote: Well FWIW I will change filesysbox to use UTF-8 strings exclusively but I don't expect anyone to be spurred to implement
UTF-8 support in the rest of the OS because this.
Actually that's exactly what I expect to happen to spur further development, when the "wobbly characters" from the
other OS's data on USB sticks, start to piss-off enough people, action will be taken.
You know as well as I do that nothing gets fixed if it's not obviously problematic.

I have already gone over ENV-handler, RAM-handler, APPDIR-handler and "other stuff" for UTF-8 compatibility for a while now,
and fixed anything that could be problematic, even if it is largely untested at this time.

We have been using ISO#### and ASCII encoding for a very long time already, and what we need to happen now,
is to NOT hardcode a wall in front of UTF-8 compatibility, so we can have a smooth transition.

The writing is on the wall, and it's written in UTF-8 encoding.

Re: filesysbox ntfs ubs massStorage problem

Posted: Thu Mar 13, 2014 10:18 am
by salass00
To implement case insensitive string comparison and hash functions I need a toupper() function that supports unicode.

AFAICT if I use setlocale(LC_CTYPE, "C-UTF-8") first I should then be able to use towupper() for this purpose, but I guess this doesn't work so well in a shared where it will be called from many different programs?

Re: filesysbox ntfs ubs massStorage problem

Posted: Thu Mar 13, 2014 12:45 pm
by chris
salass00 wrote:To implement case insensitive string comparison and hash functions I need a toupper() function that supports unicode.
You could use libunistring until locale.library gets UTF-8 support?

Re: filesysbox ntfs ubs massStorage problem

Posted: Thu Mar 13, 2014 1:37 pm
by colinw
salass00 wrote: To implement case insensitive string comparison and hash functions I need a toupper() function that supports unicode.
As I mentioned in a previous post, you must avoid single byte operations on a UTF-8 stream because
one byte != one glyph anymore, at least for values >= 0x7F. ToUpper() / ToLower() can't work.

For example, take the Angstrom character in UTF-8, it's codepoint value is U+212B and is
represented by 3 bytes; 0xE2 0x84 0xAB, and using a function that performs single byte operations
on each of those 3 bytes within the UTF-8 stream, is simply not going to work.

To make things even more interesting, the Angstrom can also be "composed" in UTF-8 by using the
capital latin letter "A" and adding a ring above it.

Re: filesysbox ntfs ubs massStorage problem

Posted: Thu Mar 13, 2014 1:50 pm
by salass00
colinw wrote:
salass00 wrote: To implement case insensitive string comparison and hash functions I need a toupper() function that supports unicode.
As I mentioned in a previous post, you must avoid single byte operations on a UTF-8 stream because
one byte != one glyph anymore, at least for values >= 0x7F. ToUpper() / ToLower() can't work.

For example, take the Angstrom character in UTF-8, it's codepoint value is U+212B and is
represented by 3 bytes; 0xE2 0x84 0xAB, and using a function that performs single byte operations
on each of those 3 bytes within the UTF-8 stream, is simply not going to work.
I wasn't talking about doing any single byte operations or even using toupper() itself. I already have code for decoding UTF-8 multibyte sequences into 32-bit unicode values. What I need is a toupper()-like function which takes this 32-bit unicode and converts it into it's equivalent upper case unicode if it has one.

The newlib.library towupper() accepts a wchar_t which is a 32-bit integer which is why I mentioned it.

Re: filesysbox ntfs ubs massStorage problem

Posted: Thu Mar 13, 2014 2:04 pm
by salass00
chris wrote:
salass00 wrote:To implement case insensitive string comparison and hash functions I need a toupper() function that supports unicode.
You could use libunistring until locale.library gets UTF-8 support?
I would rather not use GPL code in this case and this does much more than I need it to do, but thanks for the suggestion anyway.

Re: filesysbox ntfs ubs massStorage problem

Posted: Thu Mar 13, 2014 2:06 pm
by colinw
salass00 wrote: The newlib.library towupper() accepts a wchar_t which is a 32-bit integer which is why I mentioned it.
Carefull, I think the wchar_t is defined as 16 bit in our includes, better check it.

Re: filesysbox ntfs ubs massStorage problem

Posted: Thu Mar 13, 2014 2:14 pm
by salass00
colinw wrote: Carefull, I think the wchar_t is defined as 16 bit in our includes, better check it.
I know it's not (it was discussed on the developer mailing list a while back), but it probably doesn't matter because the UTF-8 support in the wide char functions has to be enabled first with setlocale() which means I probably won't be able to use it anyway.

In fact this is its definition from SDK/newlib/include/stddef.h:

Code: Select all

typedef int wchar_t;