[solved] The disguised ö - or an encoding problem

Ask about general coding issues or problems here.

Moderators: egami, macek, gesf

Post Reply
Stephan
New php-forum User
New php-forum User
Posts: 2
Joined: Sun May 16, 2021 8:57 pm

Sun May 16, 2021 9:41 pm

Hello community,

I have a strange encoding problem. First, my application is all utf8, and it works fine. So it is not the common utf8 problem.

First some base information about the environment:

I write a ticketing system that also includes mails. The System fetch mails over IMAP from configured postboxes by the use of the PHP IMAP functions. It stores the mail content into a MySQL DB.

For outgoing mails I use SMTP sending over the PHP Mailer library. I write the e-mail content to a table for outgoing mails, and they get send by a cronjob later.

I add a process ID to every e-mail subject during the send process. If someone answers to this mail I search for this ID in the subject and if it first to an existing process in the system I merge this answer into the process. All grouped process entries have the same process ID in the database.

If you now send a mail from your own system to another address in this system, the mail would be merged into your own process where you send that mail from. Because of the process ID. So I search in the outgoing mail table if there was a mail send out by from mail address, to mail address and the subject. If this is the case I create a new process for this user. This works fine so fare.

All tables are set to utf8mb4_general encoding. So 4 bit utf8 storage to be able to store all e-mail characters. Many icons used in email subjects are only possible to store in this encoding.

Now the problem:

If I send a mail out over PHPMailer and fetch it again from the postbox of the receiver, the encoding of the title changed for special characters. If you look at them in all editors that support utf8 all is fine. Only if you paste them into Notepad++ and change from utf8 to ANSI you see the difference.

Pasted 2 strings in utf8 encoded document:

Auflösungsvertrag
Auflösungsvertrag


Then turn to ANSI in the encoding Menu:

Auflösungsvertrag
Auflösungsvertrag

The first encoding is the one from the database before sending it out by mail. The second is what comes back.
Both seams to be a correct utf8 encoding for the character, but MySQL handle this as 2 different characters. So searching for the subject don't work. If I manually delete the ö and write it again it works.

I think it has something to do with utf8/utf32 encoding. But I can not find any information about it and how to handle it the right way. Perhaps some PHP function that is able to convert from string a to b.

I did some testing with mb_convert_encoding, utf8_encode and iconv. But I could not change the one string to the other.
I also searched if I find a setting to let MySQL handle them as one character and did not find anything.
Also, the common utf8 problem makes a search for this not easy, because you have hundreds of posts in this direction :)

And I don't want to do some manually replacing around stuff, because then I only fetch the characters I do handle.
There must be a solution to get this done right.

Thanks for reading and perhaps a helpful answer.
Last edited by Stephan on Sun May 16, 2021 10:53 pm, edited 1 time in total.
Stephan
New php-forum User
New php-forum User
Posts: 2
Joined: Sun May 16, 2021 8:57 pm

Sun May 16, 2021 10:53 pm

I found the solution.

The topic is called Unicode equivalence and there are methods that normalize the string.

https://en.wikipedia.org/wiki/Unicode_equivalence

PHP also have a class for this.

https://www.php.net/manual/de/normalizer.normalize.php

I had to call

normalizer_normalize( $myString, Normalizer::NFKC );
Post Reply