【一文通关】Java正则表达式(看完这一篇就够了)

站长

2023年07月26日 16:46 · 阅读数 67

正则表达式

先了解什么是正则表达式

正则表达式是一种强大的文本处理工具，它可以用于匹配、查找、替换和提取字符串。

以下是一些使用正则表达式的主要原因：

匹配和验证文本：正则表达式可以用于验证和匹配文本，例如电子邮件地址、电话号码、网址等。通过使用正则表达式，可以快速准确地确定字符串是否符合特定的格式要求。
搜索和替换文本：正则表达式可以用于搜索和替换文本中的特定模式。例如，可以使用正则表达式搜索包含特定关键字的文件或文本，并将其替换为其他内容。
数据提取：正则表达式可以用于从文本中提取特定的数据，例如从网页中提取电子邮件地址、电话号码等。通过使用正则表达式，可以快速准确地提取所需的数据。
自动化处理：正则表达式可以用于自动化处理文本，例如自动生成代码、批量更改文件名、批量处理数据等。

总之，正则表达式是一种非常强大和灵活的文本处理工具，可以极大地提高处理文本的效率和准确性。

举个例子: 我们要验证用户输入的手机号是否符合规范:

public static void main(String[] args) {
    // 随意选择一个电话号
    String tel = "199999999999";
    int len = tel.length();
    if(len!=11) System.out.println("长度不对劲");
    for(int i= 0; i < len; i++){
        // 获得每个电话号数字
        char cur = tel.charAt(i);
        if(cur <= '0' || cur >= '9'){
            System.out.println("输入错误");
            break;
        }
    }
}

而我们使用正则表达式:

String tel = "199999999999";
System.out.println(tel.matches("\\d{11}"));

当然，电话号不仅仅是这些要求，只是举个例子说明使用正则表达式的好处。

使用方法

正则匹配就是匹配的字符串，调用String里的matches方法。 matches方法: 参数 : 正则表达式

具体的使用:

既然正则表达式只是一个字符串，我们可以简单的这样写: "123", "abc", 这样会完全匹配字符串"123", "abc"

完全匹配没意思啊！我还不如用equals呢！你说的对===》所以正则一般都是模糊匹配。

模糊匹配怎么个玩法

大家都知道 "\" 为转义字符, 它是正则表达式的关键。因此，我们可以使用转移字符"\" 加上 d,D,w,W等表示特殊含义。

比如 \d 表示匹配数字0-9， \w匹配字母，下划线以及数字等。

举个例子:

System.out.println("123".matches("\\d\\d\\d")); // true
System.out.println("a_bc".matches("\\w\\w\\w\\w")); // true
System.out.println("a_b$".matches("\\w\\w\\w\\w")); // false

我为什么要写"\\", 在Java中，写一个\表示转义字符，"\\" 表示普通的 "\"

我为什么要写很多的 \\d, \\w,因为一个\d ，只能匹配一个数字；一个\w，只能匹配一个字母（或下划线或数字）

我每个字母都这样匹配，岂不是很慢？？？？

Java肯定有解决办法的: ===》 重复匹配符

* 表示多次匹配: >=0 "123".matches("\\d*") //true
+ 表示一到多次匹配: >=1 "a12".matches("\\w+") // true System.out.println("ab12".matches("\\w+")); // true
? 表示0或一次匹配: 0 or 1 System.out.println("12".matches("\\w?12")); // true, System.out.println("a12".matches("\\w?12")); // true

那我要匹配指定次数呢??? 使用 {次数}指定匹配次数或者 {次数，次数}指定匹配次数的范围

System.out.println("12".matches("\\d{2}"));  // true
System.out.println("12".matches("\\d{3}"));  // false
System.out.println("a12".matches("\\w{1,3}")); // true
System.out.println("a123".matches("\\w{1,3}")); // false

基础用法小结

【一文通关】Java正则表达式(看完这一篇就够了)

来自廖雪峰老师的博客

小贴士

当要表达特殊字符时，记得使用 \ 来转义, 比如表示 & 使用 \&.来表示字符本身。^、$、.、|、?、*、+、(、)、{、}、[、]：这些字符需要使用反斜杠进行转义。

到此已经完成基本匹配了。

复杂匹配

开头与结尾

当我们想要匹配开头与结尾时，该使用什么?

在正则表达式中，^表示匹配字符串的开头，$表示匹配字符串的结尾。使用^和$可以确保正则表达式只匹配完全符合要求的字符串，而不是匹配字符串中的某个子串。

例如，如果要匹配一个字符串是否以数字开头，可以使用以下正则表达式：

String regex = "^\\d.*";

在这个正则表达式中，^表示匹配字符串的开头，\d表示匹配一个数字字符，.*表示匹配任意数量的字符。因此，这个正则表达式可以匹配以数字开头的任意字符串，但不会匹配包含数字的任意子串

类似地，如果要匹配一个字符串是否以字母结尾，可以使用以下正则表达式：

String regex = ".*[a-zA-Z]$";

在这个正则表达式中，$表示匹配字符串的结尾，[a-zA-Z]表示匹配一个字母字符，.*表示匹配任意数量的字符。因此，这个正则表达式可以匹配以字母结尾的任意字符串，但不会匹配包含字母的任意子串。

限定范围匹配

当我们使用 \d 来作匹配时，匹配结果是0到9，而我只想要3到5咋办? 答案是使用中括号[] 具体使用:

// 直接将匹配项写在中括号中
System.out.println("a34".matches("[abc][345][345]")); // true
// 使用-表示 多少 到 多少
System.out.println("a34".matches("[a-c][3-5][3-5]")); // true
// 可以与重复匹配符搭配使用
System.out.println("a34".matches("[a-c][3-5]+"));     // true

那我想要所有字符但是就不要a和c，该怎么写呢，总不能把其它的都写上吧: 我们可以使用^来表示排除:

System.out.println("a34".matches("[^ac]34")); // false
System.out.println("z34".matches("[^ac]34")); // true

规则匹配

问题: 我想要匹配，一个连续的单词该怎么办? 比如 i love you 或者 i love dog 你会这样写 String.matches("i love you") || String.matches("i love dog") 或者直接使用equals,对吧？这样太猪了

Java里可以使用 | 来表示或关系

System.out.println("i love you".matches("i love (you|dog)")); // true
System.out.println("i love dog".matches("i love (you|dog)")); // true
// 不能大写
System.out.println("i love dog".matches("i love (you|Dog)")); // false
// 必须要加括号，不然就是i love you 和 dog 二选一
System.out.println("i love dog".matches("i love you|dog")); // false
System.out.println("dog".matches("i love you|dog")); // true

分组匹配

当有一个字符串，结构不是单一的一连串，而是拥有前缀，中缀，后缀的形式，我们又需要获得其中的某一区域，又该怎么办呢? 比如: "110-1340-220"

为此，Java有专门的分组匹配符: ()，需要搭配着Pattern和Matcher 使用

先看看如何使用:

public class Main {
    public static void main(String[] args) {
        Pattern p = Pattern.compile("(\\d{3})-(\\d{4})-(\\d{3})");
        Matcher m = p.matcher("110-1340-220");
        if (m.matches()) {
            String g1 = m.group(1);
            String g2 = m.group(2);
            String g3 = m.group(3);
            System.out.println(g1);
            System.out.println(g2);
            System.out.println(g3);
        } else {
            System.out.println("匹配失败!");
        }
    }
}

下面讲的是为什么这样调用，可以直接跳过

调用Pattern的compile方法来构建Pattern对象，Pattern类里维护了一个String类型的变量pattern，通过调用compile方法来创建并为其赋值。然后调用matcher方法来创建一个Matcher对象;

public Matcher matcher(CharSequence input) {
    if (!compiled) {
        synchronized(this) {
            if (!compiled)
                compile();
        }
    }
    // 别的我看不懂，这个倒是能看出来
    // 将当前对象的实例传进去了
    Matcher m = new Matcher(this, input);
    return m;
}

//创建Matcher对象的同时，获得分组的个数，初始化基本的属性，比如group长度(表示你正则的分块个数)

Matcher(Pattern parent, CharSequence text) {
    this.parentPattern = parent;
    this.text = text;

    // Allocate state storage
    int parentGroupCount = Math.max(parent.capturingGroupCount, 10);
    groups = new int[parentGroupCount * 2];
    locals = new int[parent.localCount];
    localsPos = new IntHashSet[parent.localTCNCount];

    // Put fields into initial states
    reset();
}

然后matcher对象调用matches方法，matches方法又调用match方法，来填充group数组。至此，matcher对象就有了group数组并有分组信息，然后通过group方法来获得数组内容。 我只能看懂是这个道理，具体怎么实现的真看不懂

那我们使用的String.matches()是哪里的方法? 我们看看String的源码，发现还是新建一个Pattern对象，然后新建Matcher对象去作匹配，每次String调用matches方法都会新建一个Matcher和Pattern对象，其实没有必要的。

public boolean matches(String regex) {
    return Pattern.matches(regex, this);
}

public static boolean matches(String regex, CharSequence input) {
    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(input);
    return m.matches();
}

如果我们是要匹配一个固定的格式，我们不用每次都调用String.matches(), 我们可以直接new 一个Pattern来重复使用。

Pattern p = Pattern.compile("(\d{3})-(\d{4})-(\d{3})");

System.out.println(p.matcher("110-1340-220").matches()); // true
System.out.println(p.matcher("110-150-220").matches()); // false
System.out.println(p.matcher("1120-1340-220").matches()); // false

我发现有个reset()函数，甚至可以重复使用Matcher对象

Pattern p = Pattern.compile("(\d{3})-(\d{4})-(\d{3})");
Matcher matcher = p.matcher("110-1340-220");
System.out.println(matcher.matches()); // true
// 将内部状态初始化
matcher.reset();
matcher = p.matcher("115-1240-33");
System.out.println(matcher.matches()); // false

上面仅供了解

非贪婪匹配

字符串在匹配时是贪婪匹配，比如 \d+会尽可能多的匹配数字拿廖雪峰老师博客的例子: 获得末尾的所有0数字

public class Main {
    public static void main(String[] args) {
        Pattern pattern = Pattern.compile("(\\d+)(0*)");
        Matcher matcher = pattern.matcher("1230000");
        if (matcher.matches()) {
            System.out.println("group1=" + matcher.group(1)); // "1230000"
            System.out.println("group2=" + matcher.group(2)); // ""
        }
    }
}

按上面的步骤走下去，group1会匹配所有的数字因为\d+很贪婪，会匹配1到之后所有的数字，这显然不是我们的目的。为此，我们使用 ? 来将其限制为非贪婪匹配。

public class Main {
    public static void main(String[] args) {
        Pattern pattern = Pattern.compile("(\\d+?)(0*)");
        Matcher matcher = pattern.matcher("1230000");
        if (matcher.matches()) {
            System.out.println("group1=" + matcher.group(1)); // "123"
            System.out.println("group2=" + matcher.group(2)); // "0000"
        }
    }
}

当添加了非贪婪匹配的符号，后面的0就不会再匹配了，它变得不贪婪了，把0让给后面的 0* 去匹配，它会匹配到刚好能让后面正则匹配到的位置。

又比如 \d{2,8},他会尽可能少的匹配数字。

不加？

Pattern p = Pattern.compile("(\d{2,8})(0*)");
Matcher matcher = p.matcher("2220000000");
boolean matches = matcher.matches();
if (matches){
    System.out.println(matcher.group(1)); // 22200000
    System.out.println(matcher.group(2));// 00
}

加了？

Pattern p = Pattern.compile("(\d{2,8}?)(0*)");
Matcher matcher = p.matcher("2220000000");
boolean matches = matcher.matches();
if (matches){
    System.out.println(matcher.group(1)); // 222
    System.out.println(matcher.group(2)); // 0000000
}

很明显的看到区别吧！！

搜索和替换

分割字符串

String.split()，传入的参数正是正则表达式，使用正则表达式，可以剔除混乱的不符合规范的字符串。

"a b c".split("\\s"); // { "a", "b", "c" }
"a b  c".split("[\\s]+"); // { "a", "b", "c" }
"a, b ;; c".split("[\\,\\;\\s]+"); // { "a", "b", "c" }

搜索字符串

大家可能会想到Strnig.indexof(), 这种方法匹配是很常用的，但是不够灵活: 当我们想要一个字符串形如 xox,doc,wop (中间为o，两边为字母的该怎么办?) 我们可以使用

Pattern p = Pattern.compile("\wo\w");
Matcher matcher = p.matcher("i dog fox wo od hhh opp ppo and");
while (matcher.find()){
    System.out.println(matcher.group()); // dog fox
}

反向引用

当我们在使用String.replaceAll

String s = "the quick brown fox jumps over the lazy dog.";
String r = s.replaceAll("\\s([a-z]{4})\\s", " <b>$1</b> ");
System.out.println(r); // the quick brown fox jumps <b>over</b> the <b>lazy</b> dog.

上面表达式作用是把匹配到的内容用 <b></b> 括起来

那个 $1 又是什么, 为什么这样写?

仅仅看表面，并不能加深我的理解，因此看源码分析: String的replaceAll就是调用的matcher对象的replaceAll方法，而且每次都要创建新的对象Pattern和Matcher。

public String replaceAll(String regex, String replacement) {
    return Pattern.compile(regex).matcher(this).replaceAll(replacement);
}

进入matcher对象的replaceAll方法:

public String replaceAll(String replacement) {
    // 清空当前matcher对象状态
    reset();
    // 调用find()获得匹配的字符串或子串
    boolean result = find();
    if (result) {
    // 找到了就执行操作，
        StringBuilder sb = new StringBuilder();
        do {
            appendReplacement(sb, replacement);
            result = find();
            // 将所有匹配的全部进行替换
        } while (result);
        appendTail(sb);
        return sb.toString();
    }
    return text.toString();
}

一层层的点击，终于进入了一个叫appendExpandedReplacement的方法(里面详细的写了匹配方法)：当读取到\时，将\后面的字符拼上；读到$,还要判断一下是哪种捕获方式，{这种是命名组捕获(根据定义时命名，我在下面写)，出现数字是数字捕获组(根据圆括号位置进行捕获), 对其数字捕获组进行拼接，依次处理。

命名组捕获:

形如 (?<year>\\d{4}),以阔号包起来，在里面以?<命名>表示。

举个例子吧: 我把括号里的表达式命名为four，在外面以${名字}捕获

String s = "the quick brown fox jumps over the lazy dog.";
String r = s.replaceAll("\\s(?<four>[a-z]{4})\\s", " <b>${four}</b> ");
System.out.println(r); // the quick brown fox jumps <b>over</b> the <b>lazy</b> dog.

数字捕获组:

以 $ 加数字来匹配，按照圆括号的位置。

举例: 圆括号捕获组一组二并将其交换位置（匹配一对）

String input = "Hello, world! How are you?";
Pattern pattern = Pattern.compile("(\\w+),\\s+(\\w+)!");
Matcher matcher = pattern.matcher(input);
String output = matcher.replaceAll("$2, $1!");
System.out.println(output); // "world, Hello! How are you?"

圆括号捕获组一组二并将其交换位置（匹配两对）

public class Main {
    public static void main(String[] args) throws IOException {
        String input = "Hello, world! How, are you?";
        Pattern pattern = Pattern.compile("(\\w+),\\s+(\\w+)");
        Matcher matcher = pattern.matcher(input);
        String output = matcher.replaceAll("$2, $1");
        System.out.println(output); // world, Hello! are, How you?
    }
}

圆括号捕获组，为每一个匹配到的字符串添加<b></b>

String s = "the quick brown fox jumps over the lazy dog.";
String r = s.replaceAll("\\s([a-z]{4})\s", " <b>$1</b> ");
System.out.println(r); // the quick brown fox jumps <b>over</b> the <b>lazy</b> dog.

差不多就是这些了，再会。

转载自:https://juejin.cn/post/7219925045228519482