Ticket #127 (assigned defect)

Opened 10 months ago

Last modified 3 months ago

Support for UTF-8 tags is buggy

Reported by: mikeshi Owned by: moeffju
Priority: critical Milestone: 0.6
Component: Habari Core Software Version: SVN
Keywords: tag Cc:

Description (last modified by moeffju) (diff)

Tags with characters outside the ASCII range sometimes break on save and display.

Change History

Changed 9 months ago by dmondark

  • milestone changed from 0.4 to Undetermined

Moving to a later milestone

Changed 8 months ago by moeffju

  • owner set to moeffju
  • status changed from new to assigned
  • summary changed from Not support Chinese Tag to Support for UTF-8 tags is buggy
  • description modified (diff)
  • milestone changed from Undetermined to 0.5

It seems that tags somehow end up in latin-1, or windows-1252 encoding. Anyhow, somewhere along the tag processing, the tags are mangled.

Changed 8 months ago by moeffju

  • description modified (diff)

This is fixed for me in SVN HEAD. If you converted an old installation, make sure your habaritags table has VARCHAR columns, not VARBINARY.

Can someone try to reproduce with CJK characters?

Changed 8 months ago by moeffju

Ok, CJK characters horribly break:

Error: PDOStatement::execute() [function.PDOStatement-execute]: SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '221-269' for key 1

(Key 1 is the UNIQUE on tag_text.)

This was upon trying to save a post with random CJK chars in the title and tags.

Changed 6 months ago by jaypipes

Verified on my local install SVN that UTF-8 characters are being mangled. I attempted to tag one of my posts with the Chinese character 喂 (for hello) and it produces a question mark in the tagging interface. So, it saves, but is mangled.

On further investigation, it turns out that even if the database is set to accept utf8 (which by default it is not, regardless of the fact that system/schema/mysql/connection.php sets the charset to utf8) the PHP or AJAX calls are mangling the character:

mysql> insert into tags values (null,'喂','喂');
Query OK, 1 row affected (0.00 sec)

mysql> select * from tags;
+----+----------------+----------------+
| id | tag_text       | tag_slug       |
+----+----------------+----------------+
|  1 | Thought Tree   | thought-tree   | 
|  2 | DB Rambles     | db-rambles     | 
|  3 | Open Source    | open-source    | 
|  4 | SEO Annoyances | seo-annoyances | 
|  5 | Articles       | articles       | 
|  6 | MySQL          | mysql          | 
|  7 | Web 2.0        | web-2-0        | 
|  8 | PHP            | php            | 
|  9 | PostgreSQL     | postgresql     | 
| 13 | 喂            | 喂            | 
+----+----------------+----------------+
10 rows in set (0.00 sec)

As you can see, the tag was manually inserted in the tags table.

However, in the admin section, a couple different behaviours are evident:

1) If I go to a new post and click the "Tags" selector, I see "å–" output for the Chinese character tag. Checking the content type of the HTML source, I see "text/html; charset=utf-8", so something is clearly mangling the character before it gets to the output stage.

2) If I type 喂 into the blog post and save, it is replaced with a ? mark.

Clearly, either InputFilter or something else is not saving UTF8 characters properly.

Changed 5 months ago by gsnedders

Also note that you cannot have a tag called "é" and a tag called "á" without causing issues, as they both create the same slug (a single dash, assuming they are entered in their NFC forms).

Changed 3 months ago by chrismeller

  • milestone changed from 0.5 to 0.6

Changed 3 months ago by rickc

This may be due to Post::parsetags(), which uses str_replace and regular expression functions without the 'u' modifier, neither of which are utf-8 safe.

Changed 3 months ago by rickc

Have the changes in r2430 fixed this?

Note: See TracTickets for help on using tickets.