likes
comments
collection
share

全网最详细 | 数据库教程10:MySQL中字符集和排序规则的详细介绍,为什么中文等字符编码(Unicode)下一定要使用utf8mb4?

作者站长头像
站长
· 阅读数 10

两个不同的字符集不能有相同的排序规则。且每个字符集都有一个默认的排序规则。

本篇大部分内容出自官网文档...

查看当前支持的字符集

INFORMATION_SCHEMA.CHARACTER_SETS

INFORMATION_SCHEMA.CHARACTER_SETS 表中可以查看可用的字符集。

MAXLEN 列表示:The maximum number of bytes required to store one character.

MariaDB [(none)]> select * from INFORMATION_SCHEMA.CHARACTER_SETS;
+--------------------+----------------------+-----------------------------+--------+
| CHARACTER_SET_NAME | DEFAULT_COLLATE_NAME | DESCRIPTION                 | MAXLEN |
+--------------------+----------------------+-----------------------------+--------+
| big5               | big5_chinese_ci      | Big5 Traditional Chinese    |      2 |
| dec8               | dec8_swedish_ci      | DEC West European           |      1 |
| cp850              | cp850_general_ci     | DOS West European           |      1 |
| hp8                | hp8_english_ci       | HP West European            |      1 |
| koi8r              | koi8r_general_ci     | KOI8-R Relcom Russian       |      1 |
| latin1             | latin1_swedish_ci    | cp1252 West European        |      1 |
| latin2             | latin2_general_ci    | ISO 8859-2 Central European |      1 |
...
...

或者查看包含 utf8 的字符集,可以使用LIKE子句或WHERE子句。

MariaDB [(none)]> SELECT * FROM INFORMATION_SCHEMA.CHARACTER_SETS
    ->   WHERE CHARACTER_SET_NAME LIKE '%utf8%';
+--------------------+----------------------+---------------+--------+
| CHARACTER_SET_NAME | DEFAULT_COLLATE_NAME | DESCRIPTION   | MAXLEN |
+--------------------+----------------------+---------------+--------+
| utf8               | utf8_general_ci      | UTF-8 Unicode |      3 |
| utf8mb4            | utf8mb4_general_ci   | UTF-8 Unicode |      4 |
+--------------------+----------------------+---------------+--------+
2 rows in set (0.000 sec)

SHOW CHARACTER SET

相对于 INFORMATION_SCHEMA.CHARACTER_SETS 表,SHOW CHARACTER SET 语句更常用于查看支持的字符集。

MariaDB [(none)]> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |      2 |
| dec8     | DEC West European           | dec8_swedish_ci     |      1 |
| cp850    | DOS West European           | cp850_general_ci    |      1 |
| hp8      | HP West European            | hp8_english_ci      |      1 |
| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |      1 |
| latin1   | cp1252 West European        | latin1_swedish_ci   |      1 |
| latin2   | ISO 8859-2 Central European | latin2_general_ci   |      1 |
...
...

或者查看包含 utf8 的字符集,可以使用LIKE子句或WHERE子句。

MariaDB [(none)]> SHOW CHARACTER SET LIKE '%utf8%';
+---------+---------------+--------------------+--------+
| Charset | Description   | Default collation  | Maxlen |
+---------+---------------+--------------------+--------+
| utf8    | UTF-8 Unicode | utf8_general_ci    |      3 |
| utf8mb4 | UTF-8 Unicode | utf8mb4_general_ci |      4 |
+---------+---------------+--------------------+--------+
2 rows in set (0.005 sec)

查看当前支持的排序规则Collation

一个给定的字符集至少有一个排序规则,大多数字符集有多个。

使用 INFORMATION_SCHEMA.COLLATIONS 表或 SHOW COLLATION 语句,可以显示一个字符集的排序规则。

SHOW COLLATION默认显示所有的排序规则,使用WHERE、LIKE等子句可以过滤显示指定字符集的排序规则。

查看默认字符集 utf8mb4 的排序规则。

MariaDB [(none)]> SHOW COLLATION WHERE Charset = 'utf8mb4';
+------------------------------+---------+------+---------+----------+---------+
| Collation                    | Charset | Id   | Default | Compiled | Sortlen |
+------------------------------+---------+------+---------+----------+---------+
| utf8mb4_general_ci           | utf8mb4 |   45 | Yes     | Yes      |       1 |
| utf8mb4_bin                  | utf8mb4 |   46 |         | Yes      |       1 |
| utf8mb4_unicode_ci           | utf8mb4 |  224 |         | Yes      |       8 |
| utf8mb4_icelandic_ci         | utf8mb4 |  225 |         | Yes      |       8 |
...
...
...
| utf8mb4_unicode_nopad_ci     | utf8mb4 | 1248 |         | Yes      |       8 |
| utf8mb4_unicode_520_nopad_ci | utf8mb4 | 1270 |         | Yes      |       8 |
+------------------------------+---------+------+---------+----------+---------+
33 rows in set (0.008 sec)

[为什么推荐使用utf8mb4?]MySQL中的utf8和utf8mb4字符集

真正支持存储Unicode字符

MySQL/MariaDB 的 utf8 编码并不是真正的UTF-8编码。在MySQL/MariaDB中,utf8最多只支持3个字节,上面的查询中已经可以看到。

而简体中文在 utf8 编码中,一般占3个字节,使用扩展区的中文会占用4个字节。UTF-8编码是可变长编码,英文等字母字符占1个字节,其他复杂字符会占用2-4个字节。

但是目前常使用的emoji表情、繁体字、特殊字符、不常见汉字等,会占用4个字节。这就导致,

如果在 MySQL/MariaDB 中使用 utf8 编码,将无法存储4个字节的汉字、emoji等表情。MySQL/MariaDB 的 utf8 编码占用最多3个字节是由于历史原因造成的。

utf8mb4才是真正的UTF-8编码,可以最多支持4个字节

所以实际开发中,推荐使用 utf8mb4 编码字符集,这是真正的Unicode编码。尤其是在用到中文、emoji等字符的环境中。否则就会遇到字节数不正确,报错不正确的字符值(Incorrect string value

MySQL 8.0开始,默认的字符编码已经改为utf8mb4,MariaDB 10.0仍为latin1

UTF-8编码是U+2528D,属于CJK Unified Ideographs Extension B(中日韩统一表意文字扩充B)字符集的字符,处于第二辅助平面(SIP,表意文字补充平面),最多支持4个字节。

而Mysql的utf8编码则属于常见的基本多文种平面(BMP,即Unicode编码范围在0000-FFFF之内)的字符,最多支持3个字节。

关于utf8mb4_bin排序规则

使用utf8mb4字符集时,通常推荐使用utf8mb4_unicode_ci(或utf8mb4_general_ci)排序规则。

但,如果想要区分区大小写且同时区分重音,推荐使用utf8mb4_bin。一般区分大小写的也都区分重音。utf8_unicode_cs并不是很推荐!

指定字符集和排序规则

修改MySQL服务器的排序规则

Linux系统下,修改 /etc/my.cnf /etc/my.cnf.d/server.cnf 文件;Windows系统下,修改MySQL/Mariadb安装目录下data/my.ini文件

在对应[mysql]、[mysqld]下设置:

[client]
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci

设置完后重启(restart)MySQL/MariaDB。

还可以使用命令行,设置字符集和排序规则

mysqld --character-set-server=utf8 --collation-server=utf8_unicode_cs

创建数据库时指定

创建数据库时指定字符集和排序规则的语法如下:

CREATE DATABASE IF NOT EXISTS database_name 
    DEFAULT CHARSET utf8mb4 COLLATE utf8mb4_unicode_ci;

创建并查看一个数据库的排序规则和字符集:

CREATE DATABASE IF NOT EXISTS CHARSETTest 
    DEFAULT CHARSET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- 查看数据库
show create database CHARSETTest;

-- 返回
-- CREATE DATABASE `CHARSETTest` /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci */

或:

CREATE DATABASE IF NOT EXISTS CHARSETTest1 
    DEFAULT CHARSET latin1 COLLATE latin1_swedish_ci;

-- 查看数据库
show create database CHARSETTest1;
-- 返回
-- CREATE DATABASE `CHARSETTest1` /*!40100 DEFAULT CHARACTER SET latin1 */

指定表和列的字符集和排序规则

如下,可以在创建一个表时,指定字符集和排序规则DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci。不指定时,默认使用所在数据库的字符集和排序规则。

同时,可以在列名后,指定当前列的排序规则,如 id char(4) COLLATE utf8mb4_unicode_ci

CREATE TABLE `Product` (
  `id` char(4) COLLATE utf8mb4_unicode_ci NOT NULL,
  `name` varchar(100) COLLATE utf8mb4_unicode_ci NOT NULL,
  `type` varchar(32) COLLATE utf8mb4_unicode_ci NOT NULL,
  `sale_price` int(11) DEFAULT NULL,
  `purchase_price` int(11) DEFAULT NULL,
  `regist_date` date DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

通过SHOW CREATE TABLE tablename;语句,可以完整查看创建表和列的语句,及其中用到的字符集、排序规则

对应的,可以通过 ALTER TABLE 语句实现对列使用的字符编码的修改。

修改数据库的字符集和排序规则

ALTER DATABASE <db_name> CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

示例:

MariaDB [test]> ALTER DATABASE CHARSETTest1 CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Query OK, 1 row affected (0.008 sec)

MariaDB [test]> show create database CHARSETTest1;
+--------------+-----------------------------------------------------------------------------------------------------+
| Database     | Create Database                                                                                     |
+--------------+-----------------------------------------------------------------------------------------------------+
| CHARSETTest1 | CREATE DATABASE `CHARSETTest1` /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci */ |
+--------------+-----------------------------------------------------------------------------------------------------+
1 row in set (0.006 sec)

修改表的字符集

ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

示例:

MariaDB [test]> ALTER TABLE OrderTest CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Query OK, 4 rows affected (0.035 sec)
Records: 4  Duplicates: 0  Warnings: 0

MariaDB [test]> show create table OrderTest;
+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table     | Create Table                                                                                                                                         |
+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| OrderTest | CREATE TABLE `OrderTest` (...) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci |
+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.002 sec)

修改某一个列的排序规则

ALTER TABLE t_name MODIFY c_name VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL;

查看不同对象的字符编码或排序规则

SHOW VARIABLES查看MySQL系统设置的排序规则

SHOW VARIABLES系统变量可以返回和排序规则相关的各个级别的默认设置。

如下,显示与字符集和排序规则相关的所有变量的设置:

MariaDB [(none)]> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+--------------------+
| Variable_name            | Value              |
+--------------------------+--------------------+
| character_set_client     | utf8mb4            |
| character_set_connection | utf8mb4            |
| character_set_database   | utf8mb4            |
| character_set_filesystem | binary             |
| character_set_results    | utf8mb4            |
| character_set_server     | utf8mb4            |
| character_set_system     | utf8               |
| collation_connection     | utf8mb4_general_ci |
| collation_database       | utf8mb4_unicode_ci |
| collation_server         | utf8mb4_unicode_ci |
+--------------------------+--------------------+
10 rows in set (0.001 sec)

最后三行分别是服务器、数据库和连接的默认排序规则。

SELECT @@collation_server;

查看当前数据库的字符集和排序规则

全局变量

use <db_name>;
SELECT @@character_set_database, @@collation_database;

如下查看:

MariaDB [test]> SELECT @@character_set_database, @@collation_database;
+--------------------------+----------------------+
| @@character_set_database | @@collation_database |
+--------------------------+----------------------+
| latin1                   | latin1_swedish_ci    |
+--------------------------+----------------------+
1 row in set (0.004 sec)

如果未指定数据库时,将返回MySQL系统级别的字符集和排序规则。

使用information_schema.schemata

查询information_schema.schemata表,也可以获取数据库的排序规则。而且不需要切换数据库,可直接查看指定数据库的collate。

如下:

MariaDB [(none)]> SELECT
    ->    default_character_set_name,
    ->    default_collation_name
    -> FROM information_schema.schemata
    -> WHERE schema_name = 'test';
+----------------------------+------------------------+
| default_character_set_name | default_collation_name |
+----------------------------+------------------------+
| latin1                     | latin1_swedish_ci      |
+----------------------------+------------------------+
1 row in set (0.001 sec)

使用环境变量variables

USE db_name;
show variables like "character_set_database";
-- or:
-- show variables like "collation_database";

查看指定表的排序规则

创建一个存放ABab的数据表。后续操作基于此表演示。

CREATE TABLE OrderTest(
  letter char(1) NOT NULL
);
INSERT INTO OrderTest values('B'),('b'),('A'),('a');

通过查询 INFORMATION_SCHEMA.TABLES 表,可以查看某个表的排序规则。

比如:

SELECT TABLE_SCHEMA  -- 会有警告,结果为dbname
    , TABLE_NAME
    , TABLE_COLLATION 
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME = 't_name';
MariaDB [test]> SELECT TABLE_SCHEMA
    ->     , TABLE_NAME
    ->     , TABLE_COLLATION
    -> FROM INFORMATION_SCHEMA.TABLES
    -> WHERE TABLE_NAME = 'OrderTest';
+--------------+------------+-------------------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION   |
+--------------+------------+-------------------+
| test         | ordertest  | latin1_swedish_ci |
+--------------+------------+-------------------+
1 row in set, 2 warnings (0.003 sec)

除此之外,还可以使用 SHOW TABLE STATUS LIKE 't_name';SHOW TABLE STATUS where name like 't_name'; 语句查看表的 Collation 。(缺点是无法选择输出列...)

MariaDB [test]> SHOW TABLE STATUS LIKE 'OrderTest'\G
*************************** 1. row ***************************
            Name: OrderTest
          Engine: InnoDB
         Version: 10
      Row_format: Dynamic
            Rows: 4
  Avg_row_length: 4096
     Data_length: 16384
 Max_data_length: 0
    Index_length: 0
       Data_free: 0
  Auto_increment: NULL
     Create_time: 2021-08-10 10:38:06
     Update_time: 2021-08-10 10:38:06
      Check_time: NULL
       Collation: latin1_swedish_ci
        Checksum: NULL
  Create_options:
         Comment:
Max_index_length: 0
       Temporary: N
1 row in set (0.001 sec)

查看列的字符集和排序规则

INFORMATION_SCHEMA.COLUMNS表

INFORMATION_SCHEMA.COLUMNS 表中,可以查看列的排序规则。

SELECT TABLE_NAME 
    , COLUMN_NAME 
    , character_set_name
    , COLLATION_NAME 
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 't_name';

如果想要查看所在数据库,可以在附加上TABLE_SCHEMA列。

如下:

MariaDB [(none)]> SELECT TABLE_NAME
    ->     , COLUMN_NAME
    ->     , character_set_name
    ->     , COLLATION_NAME
    -> FROM INFORMATION_SCHEMA.COLUMNS
    -> WHERE TABLE_NAME = 'OrderTest';
+------------+-------------+--------------------+-------------------+
| TABLE_NAME | COLUMN_NAME | character_set_name | COLLATION_NAME    |
+------------+-------------+--------------------+-------------------+
| ordertest  | letter      | latin1             | latin1_swedish_ci |
+------------+-------------+--------------------+-------------------+
1 row in set (0.011 sec)

SHOW FULL COLUMNS FROM <table>

SHOW FULL COLUMNS FROM <table>可以返回一个表所有列的信息,包括排序规则。

MariaDB [test]> SHOW FULL COLUMNS FROM OrderTest;
+--------+---------+-------------------+------+-----+---------+-------+---------------------------------+---------+
| Field  | Type    | Collation         | Null | Key | Default | Extra | Privileges                      | Comment |
+--------+---------+-------------------+------+-----+---------+-------+---------------------------------+---------+
| letter | char(1) | latin1_swedish_ci | NO   |     | NULL    |       | select,insert,update,references |         |
+--------+---------+-------------------+------+-----+---------+-------+---------------------------------+---------+
1 row in set (0.005 sec)

SQL语句中的显式使用排序规则

使用COLLATE子句指定排序规则

默认查询使用默认排序规则,不区分大小写:

MariaDB [test]> select * from OrderTest where letter='a';
+--------+
| letter |
+--------+
| A      |
| a      |
+--------+
2 rows in set (0.000 sec)

指定字符集,查找严格区分大小写:

MariaDB [test]> select * from OrderTest where letter collate latin1_general_cs ='a';
+--------+
| letter |
+--------+
| a      |
+--------+
1 row in set (0.007 sec)

MySQL/MariaDB排序中如何严格按照字符的码点排序?

从下面的查询,可以看到,MySQL/MariaDB的Order By子句排序查询中,指定排序列的Collation为区分大小写的latin1_general_cs,并不会使结果按照字符码点严格排序。

MariaDB [test]> select * from OrderTest order by letter;
+--------+
| letter |
+--------+
| A      |
| a      |
| B      |
| b      |
+--------+
4 rows in set (0.000 sec)

MariaDB [test]> select * from OrderTest order by letter collate latin1_general_cs;
+--------+
| letter |
+--------+
| A      |
| a      |
| B      |
| b      |
+--------+
4 rows in set (0.002 sec)

而要想实现按照字符的Unicode码值排序,可以使用二进制的latin1_binutf8mb4_bin等排序规则。如下:

MariaDB [test]> select * from OrderTest order by letter collate latin1_bin;
+--------+
| letter |
+--------+
| A      |
| B      |
| a      |
| b      |
+--------+
4 rows in set (0.002 sec)
转载自:https://juejin.cn/post/6995483137904115719
评论
请登录