UTF8 Glyph/Ideograph CodePoint validation and Rendering...

AmigaOS users can make feature requests in this forum.
Post Reply
Belxjander
Posts: 314
Joined: Mon May 14, 2012 10:26 pm
Location: 日本千葉県松戸市 / Matsudo City, Chiba, Japan
Contact:

UTF8 Glyph/Ideograph CodePoint validation and Rendering...

Post by Belxjander »

Can the graphics.library/Text() routine please be modified for the following...

Testing for UTF8 Encoded MultiByte Character where bit-8 is set for the octets in the character and the first character has a leading set of 1 bits (highest to lowest) for the same number of octets as comprise the character.

CodePoint 4E00 is 0xE4B880 and Ideographic meaning "1" in Chinese and Japanese
CodePoint 4E09 is 0xE4B889 and Ideographic meaning "3" in Chinese and Japanese

CodePoints 3041 through 30FF is Hiragana and Katakana Unicode ranges
Hex E38181 through E381BF, E38280 through E38296 for valid Hiragana display.
Hex E382A0 [CodePoint 30A0] through E382BF continued, E38380 to E383BA [ CodePoint 30FA ] for valid Katakana.

Validation will always have the first octet with bit 8 set as "CF"/"DF" masked as comparison identical to the original octet for a two-octet sequence,
"EF" mask-comparable as equal for a 3-octet encoding and "F7" mask-compare-equivalent for a 4-octet encoded sequence.

Some means of ?optional? rendering of detected UTF8 sequences?

I have several text files I need to edit and I would like to at least clearly see the Ideographs I am working with.
User avatar
colinw
AmigaOS Core Developer
AmigaOS Core Developer
Posts: 207
Joined: Mon Aug 15, 2011 9:20 am
Location: Brisbane, QLD. Australia.

Re: UTF8 Glyph/Ideograph CodePoint validation and Rendering.

Post by colinw »

Belxjander wrote:Can the graphics.library/Text() routine please be modified for the following...
[...]
I have several text files I need to edit and I would like to at least clearly see the Ideographs I am working with.
Me too, join the club, it's hard to read the words to my favorite Klingon Opera.... ;)

It's going to take more than a couple of mods to the Text() function and friends, i'm afraid.
We need a propper rendering engine to provide unicode support, just decoding a UTF-8 byte stream is not going
to do it, that's the really easy part and it's already built-in to the latest beta version of utility.library.

You'll just have to wait until the powers that be, raise unicode rendering to a higher priority, there are somwhat
more pressing issues to solve ATM.
Belxjander
Posts: 314
Joined: Mon May 14, 2012 10:26 pm
Location: 日本千葉県松戸市 / Matsudo City, Chiba, Japan
Contact:

Re: UTF8 Glyph/Ideograph CodePoint validation and Rendering.

Post by Belxjander »

colinw wrote:
Belxjander wrote:Can the graphics.library/Text() routine please be modified for the following...
[...]
I have several text files I need to edit and I would like to at least clearly see the Ideographs I am working with.
Me too, join the club, it's hard to read the words to my favorite Klingon Opera.... ;)

It's going to take more than a couple of mods to the Text() function and friends, i'm afraid.
We need a propper rendering engine to provide unicode support, just decoding a UTF-8 byte stream is not going
to do it, that's the really easy part and it's already built-in to the latest beta version of utility.library.

You'll just have to wait until the powers that be, raise unicode rendering to a higher priority, there are somwhat
more pressing issues to solve ATM.
Well I am currently working with TimberWolf for the rendering through Cairo and have inconsistent results from my IME work so far.
I'm just trying to pin down a large dataset into something I can work with reversibly at the moment.

After that I'm hoping to make it a lot more reliable (and the system only requires the rendering in graphics library so far)

One issue I am running into is entirely down to timing and have a couple of workarounds I need to check out.

The initial Hiragana Encoding for example...

[ぁ][あ] [ぃ][い] [ぅ][う] [ぇ][え] [ぉ][お]
[か][が] [き][ぎ] [く][ぐ] [け][げ] [こ][ご]
[さ][ざ] [し][じ] [す][ず] [せ][ぜ] [そ][ぞ]
[た][だ] [ち][ぢ] [っ][つ][づ] [て][で] [と][ど]
[な] [に] [ぬ] [ね] [の]
[は][ば][ぱ] [ひ][び][ぴ] [ふ][ぶ][ぷ] [へ][べ][ぺ] [ほ][ぼ][ぽ]
[ま] [み] [む] [め] [も]
[ゃ][や] [ゅ][ゆ] [ょ][よ]
[ら] [り] [る] [れ] [ろ]
[ゎ][わ] [ゐ] [ゑ] [を]
[ん]
[ゔ] [ゕ] [ゖ]

The main delay at the moment is the 2nd translation from Hiragana to Kanji,
with the following example in Unicode order from 4E00 (non-consecutive iteration)
[一]=[ひとつ] [丁]=[ひのと] [七]=[ななつ] [万]=[よろず] [丈]=[たけ] [三]=[みっつ]
User avatar
ssolie
Beta Tester
Beta Tester
Posts: 1010
Joined: Mon Dec 20, 2010 8:51 pm
Location: Canada
Contact:

Re: UTF8 Glyph/Ideograph CodePoint validation and Rendering.

Post by ssolie »

Belxjander wrote:Can the graphics.library/Text() routine please be modified for the following...
I'm still waiting for that IME which installs into the input stream and outputs UTF-8 sequences.

I would appreciate it if you focused only on the IME as we agreed.

The rest will follow but we must have that IME.
ExecSG Team Lead
Belxjander
Posts: 314
Joined: Mon May 14, 2012 10:26 pm
Location: 日本千葉県松戸市 / Matsudo City, Chiba, Japan
Contact:

Re: UTF8 Glyph/Ideograph CodePoint validation and Rendering.

Post by Belxjander »

I've been focusing on the IME, and looking at Unicode CodePoints and UTF8 encodings trying to think of a short method to index Japanese Kanji by readings.

I specifically have "chord"ing of input strings now completed and an initial alpha test build is on os4depot.net for anyone to check out.

there is "KDEBUG()" output on whatever channel Sashimi listens to ...

Just don't modify or change the IME files in LIBS: or LOCALE: while it is running...that causes an immediate crash and I don't yet know why.

the [Chord=XXXXXXXX] values are the TagItem search key used for a TagItem array to be used for lookup of Unicode CodePoints right now.

Just put Perception.Library into LIBS: and the Japanese.Language file into Locale:Languages/

Select Japanese in the "Locale" preferences for a preferred language...and it will be active on the next restart.

I also have code set aside for the UTF8 conversion.

Do I need to present the UTF8 as a modified "deadkey" InputEvent? or by some other method?

P.S. I still have further code to add on top of what is already there, I've been focused on getting layered chording happening properly.

I've deliberately offloaded the IME into it's own process separate from input.device so that it is possible to push events back through input.device if that is the way to go.
User avatar
ssolie
Beta Tester
Beta Tester
Posts: 1010
Joined: Mon Dec 20, 2010 8:51 pm
Location: Canada
Contact:

Re: UTF8 Glyph/Ideograph CodePoint validation and Rendering.

Post by ssolie »

Belxjander wrote:Do I need to present the UTF8 as a modified "deadkey" InputEvent? or by some other method?
Having the IME produce a valid UTF-8 sequence output to anywhere (e.g. serial, stdout, file) is sufficient for now. Nothing more is needed yet.
ExecSG Team Lead
Belxjander
Posts: 314
Joined: Mon May 14, 2012 10:26 pm
Location: 日本千葉県松戸市 / Matsudo City, Chiba, Japan
Contact:

Re: UTF8 Glyph/Ideograph CodePoint validation and Rendering.

Post by Belxjander »

ssolie wrote:
Belxjander wrote:Do I need to present the UTF8 as a modified "deadkey" InputEvent? or by some other method?
Having the IME produce a valid UTF-8 sequence output to anywhere (e.g. serial, stdout, file) is sufficient for now. Nothing more is needed yet.
Does the Sashimi debug channel qualify...???

Image
Post Reply