Skip to the content.

[TOC]

Pattern

贪婪与非贪婪匹配

定义

贪婪匹配的写法是.*,非贪婪匹配的写法是.*?,多了一个?

举例

例如:

贪婪匹配

/**
 * 贪婪匹配
 * 
 * 运行结果:
 * group:A12
 * group:3
 * replaceAll:
 */
@Test
public void test2() {
    Pattern pattern = Pattern.compile("WORD_(.*)(\\d+)_321");
    Matcher matcher = pattern.matcher("WORD_A123_321");
    while (matcher.find()) {
        System.out.println("group:" + matcher.group(1));
        System.out.println("group:" + matcher.group(2));
    }
    System.out.println("replaceAll:" + matcher.replaceAll(""));
}

非贪婪匹配

/**
 * 非贪婪匹配
 * 
 * 运行结果:
 * group:A
 * group:123
 * replaceAll:
 */
@Test
public void test3() {
    Pattern pattern = Pattern.compile("WORD_(.*?)(\\d+)_321");
    Matcher matcher = pattern.matcher("WORD_A123_321");
    while (matcher.find()) {
        System.out.println("group:" + matcher.group(1));
        System.out.println("group:" + matcher.group(2));
    }
    System.out.println("replaceAll:" + matcher.replaceAll(""));
}

但是这里需要注意,如果匹配的结果在字符结尾,.*?就有可能匹配不到任何结果了,因为他会尽可能匹配少的字符(换句话说,都到末尾了,我尽可能少,那就是不匹配),例:

/**
 * 贪婪匹配
 * 
 * 运行结果:
 * group:A12
 * group:3
 * group:1
 * replaceAll:
 */
@Test
public void test4() {
    Pattern pattern = Pattern.compile("WORD_(.*)(\\d+)_32(.*)");
    Matcher matcher = pattern.matcher("WORD_A123_321");
    while (matcher.find()) {
        System.out.println("group:" + matcher.group(1));
        System.out.println("group:" + matcher.group(2));
        System.out.println("group:" + matcher.group(3));
    }
    System.out.println("replaceAll:" + matcher.replaceAll(""));
}

/**
 * 非贪婪匹配
 * 
 * 运行结果:
 * group:A
 * group:123
 * group:
 * replaceAll:1
 */
@Test
public void test5() {
    Pattern pattern = Pattern.compile("WORD_(.*?)(\\d+)_32(.*?)");
    Matcher matcher = pattern.matcher("WORD_A123_321");
    while (matcher.find()) {
        System.out.println("group:" + matcher.group(1));
        System.out.println("group:" + matcher.group(2));
        System.out.println("group:" + matcher.group(3));
    }
    System.out.println("replaceAll:" + matcher.replaceAll(""));
}

从Compile到Matcher

下面是使用正则的一种经典写法:

    @Test
    public void test1() {
        Pattern pattern = Pattern.compile("\\d+");
        Matcher matcher = pattern.matcher("123A4234A234");
        while (matcher.find()) {
            System.out.println(matcher.group());
        }
    }

从这个写法中,我们不难看出其中两大重要方法:compile()、matcher()。

接下来我们将从这两个方法入手,深入看下,源码中如何玩转的。

Compile

阿里巴巴开发规范中,有这么一条:

【强制】在使用正则表达式时,利用好其预编译功能,可以有效加快正则匹配速度。

说明:不要在方法体内定义: Pattern pattern = Pattern.compile(“ 规则 ”);

大概我们能猜到compile()会做一些耗时的事情,那么,带着我们的猜想去看看compile()干了啥。

public static Pattern compile(String regex) {
    return new Pattern(regex, 0);
}

这边是主动调用了私有的一个构造器,我们接着看:

    private Pattern(String p, int f) {
        pattern = p;
        flags = f;

        // to use UNICODE_CASE if UNICODE_CHARACTER_CLASS present
        if ((flags & UNICODE_CHARACTER_CLASS) != 0)
            flags |= UNICODE_CASE;

        // Reset group index count
        capturingGroupCount = 1;
        localCount = 0;

        if (pattern.length() > 0) {
            // 长度大于0调用compile()
            compile();
        } else {
            root = new Start(lastAccept);
            matchRoot = lastAccept;
        }
    }

仔细看下构造器,其他的东西都不怎么引人注目,也不会很耗时,只有compile()这个方法(切记此compile(),不要和Pattern.compile()混淆哈),可能有点神秘,我们继续往下看:

/**
 * Copies regular expression to an int array and invokes the parsing
 * of the expression which will create the object tree.
 */
private void compile() {
    // Handle canonical equivalences
    if (has(CANON_EQ) && !has(LITERAL)) {
        normalize();
    } else {
        normalizedPattern = pattern;
    }
    patternLength = normalizedPattern.length();

    // Copy pattern to int array for convenience
    // Use double zero to terminate pattern
    temp = new int[patternLength + 2];

    hasSupplementary = false;
    int c, count = 0;
    // Convert all chars into code points
    for (int x = 0; x < patternLength; x += Character.charCount(c)) {
        c = normalizedPattern.codePointAt(x);
        if (isSupplementary(c)) {
            hasSupplementary = true;
        }
        temp[count++] = c;
    }

    patternLength = count;   // patternLength now in code points

    if (! has(LITERAL))
        RemoveQEQuoting();

    // Allocate all temporary objects here.
    buffer = new int[32];
    groupNodes = new GroupHead[10];
    namedGroups = null;

    if (has(LITERAL)) {
        // Literal pattern handling
        matchRoot = newSlice(temp, patternLength, hasSupplementary);
        matchRoot.next = lastAccept;
    } else {
        // Start recursive descent parsing
        // 开始递归下降解析,
        matchRoot = expr(lastAccept);
        // Check extra pattern characters
        if (patternLength != cursor) {
            if (peek() == ')') {
                throw error("Unmatched closing ')'");
            } else {
                throw error("Unexpected internal error");
            }
        }
    }

    // Peephole optimization
    if (matchRoot instanceof Slice) {
        root = BnM.optimize(matchRoot);
        if (root == matchRoot) {
            root = hasSupplementary ? new StartS(matchRoot) : new Start(matchRoot);
        }
    } else if (matchRoot instanceof Begin || matchRoot instanceof First) {
        root = matchRoot;
    } else {
        root = hasSupplementary ? new StartS(matchRoot) : new Start(matchRoot);
    }

    // Release temporary storage
    temp = null;
    buffer = null;
    groupNodes = null;
    patternLength = 0;
    compiled = true;
}

matchRoot = expr(lastAccept);,这一行,看起来有一点说法:

/**
 * The expression is parsed with branch nodes added for alternations.
 * This may be called recursively to parse sub expressions that may
 * contain alternations.
 */
private Node expr(Node end) {
    Node prev = null;
    Node firstTail = null;
    Branch branch = null;
    Node branchConn = null;

    for (;;) {
        Node node = sequence(end);
        Node nodeTail = root;      //double return
        if (prev == null) {
            prev = node;
            firstTail = nodeTail;
        } else {
            // Branch
            if (branchConn == null) {
                branchConn = new BranchConn();
                branchConn.next = end;
            }
            if (node == end) {
                // if the node returned from sequence() is "end"
                // we have an empty expr, set a null atom into
                // the branch to indicate to go "next" directly.
                node = null;
            } else {
                // the "tail.next" of each atom goes to branchConn
                nodeTail.next = branchConn;
            }
            if (prev == branch) {
                branch.add(node);
            } else {
                if (prev == end) {
                    prev = null;
                } else {
                    // replace the "end" with "branchConn" at its tail.next
                    // when put the "prev" into the branch as the first atom.
                    firstTail.next = branchConn;
                }
                prev = branch = new Branch(prev, node, branchConn);
            }
        }
        if (peek() != '|') {
            return prev;
        }
        next();
    }
}

Node node = sequence(end);,其他的,看起来都是在组装Node,只有这个是在获取Node,我们接着看:

/**
 * Parsing of sequences between alternations.
 */
private Node sequence(Node end) {
    Node head = null;
    Node tail = null;
    Node node = null;
LOOP:
    for (;;) {
        int ch = peek();
        switch (ch) {
        case '(':
            // Because group handles its own closure,
            // we need to treat it differently
            node = group0();
            // Check for comment or flag group
            if (node == null)
                continue;
            if (head == null)
                head = node;
            else
                tail.next = node;
            // Double return: Tail was returned in root
            tail = root;
            continue;
        case '[':
            node = clazz(true);
            break;
        case '\\':
            ch = nextEscaped();
            if (ch == 'p' || ch == 'P') {
                boolean oneLetter = true;
                boolean comp = (ch == 'P');
                ch = next(); // Consume { if present
                if (ch != '{') {
                    unread();
                } else {
                    oneLetter = false;
                }
                node = family(oneLetter, comp);
            } else {
                unread();
                node = atom();
            }
            break;
        case '^':
            next();
            if (has(MULTILINE)) {
                if (has(UNIX_LINES))
                    node = new UnixCaret();
                else
                    node = new Caret();
            } else {
                node = new Begin();
            }
            break;
        case '$':
            next();
            if (has(UNIX_LINES))
                node = new UnixDollar(has(MULTILINE));
            else
                node = new Dollar(has(MULTILINE));
            break;
        case '.':
            next();
            if (has(DOTALL)) {
                node = new All();
            } else {
                if (has(UNIX_LINES))
                    node = new UnixDot();
                else {
                    node = new Dot();
                }
            }
            break;
        case '|':
        case ')':
            break LOOP;
        case ']': // Now interpreting dangling ] and } as literals
        case '}':
            node = atom();
            break;
        case '?':
        case '*':
        case '+':
            next();
            throw error("Dangling meta character '" + ((char)ch) + "'");
        case 0:
            if (cursor >= patternLength) {
                break LOOP;
            }
            // Fall through
        default:
            node = atom();
            break;
        }

        node = closure(node);

        if (head == null) {
            head = tail = node;
        } else {
            tail.next = node;
            tail = node;
        }
    }
    if (head == null) {
        return end;
    }
    tail.next = end;
    root = tail;      //double return
    return head;
}

看到这里,好像有点清晰了,我们在来梳理下,Node node = sequence(end);负责根据不同的字符,产生不同的Node,然后拼接到Node中去,而matchRoot = expr(lastAccept);看起来就更清晰了,根据输入的字符串通过Node node = sequence(end);生成不同的Node,然后在连接在一起,所以我们可以很自信的讲出,原来我们写的正则表达式,最后会在compile()方法中,被matchRoot = expr(lastAccept);解析成一种单链表结构。当然compile()方法还有一些其他判断处理,这些对于常规的表达式来说不是重点,就现在而言,你可以认为她是在处理一些特殊情况,后续我们在详细过一遍源码的细节。

防止大家还有点模糊,说这一大堆文字,谁看的懂呀,不画个图意思意思?OK,图来:

st=>start: Pattern.compile()
o1=>operation: new Pattern(regex, 0)
o2=>operation: compile()
o3=>operation: matchRoot = expr(lastAccept)
o4=>operation: Node node = sequence(end);

c1=>condition: pattern.length()>0
c2=>condition: peek() != 竖线
end=>end: 放行

st->o1->c1
c1(yes)>o2->o3->o4->c2
c1(no)->end

c2(yes)->end
c2(no)->o4

这样够清晰了吗,嘻嘻,好叻这样的一个流程,我们应该十分清晰了,下面将从其他细节部分再次深入了解。

Matcher

好了,看完Compile,我们再来看看经典写法:

    @Test
    public void test1() {
        Pattern pattern = Pattern.compile("\\d+");
        Matcher matcher = pattern.matcher("123A4234A234");
        while (matcher.find()) {
            System.out.println(matcher.group());
        }
    }

很明显,我们每次判断是否匹配上时,都通过matcher.find()进行查找。于是,我们大胆猜测,匹配的算法就在find方法中,让我们一起打开她的神秘面纱:

public boolean find() {
    int nextSearchIndex = last;
    if (nextSearchIndex == first)
        nextSearchIndex++;

    // If next search starts before region, start it at region
    if (nextSearchIndex < from)
        nextSearchIndex = from;

    // If next search starts beyond region then it fails
    if (nextSearchIndex > to) {
        for (int i = 0; i < groups.length; i++)
            groups[i] = -1;
        return false;
    }
    return search(nextSearchIndex);
}

同样的,我们抛去不太重要的代码行,直接来看核心search(nextSearchIndex)

boolean search(int from) {
    this.hitEnd = false;
    this.requireEnd = false;
    from        = from < 0 ? 0 : from;
    this.first  = from;
    this.oldLast = oldLast < 0 ? from : oldLast;
    for (int i = 0; i < groups.length; i++)
        groups[i] = -1;
    acceptMode = NOANCHOR;
    boolean result = parentPattern.root.match(this, from, text);
    if (!result)
        this.first = -1;
    this.oldLast = this.last;
    return result;
}

可以非常清晰的看到 boolean result = parentPattern.root.match(this, from, text);这一行,在结合我们上面Compile的探索,发现此行的目的为:从根节点Root开始match。

有心的小伙伴,肯定有些纳闷,单链表的访问至少也有个指针负责遍历吧,这里咋就看起来遍历一次,就没了呢,别急,继续往下看。

之前我们看Compile也讲过,Pattern类中有着不同的Node实现,她们之前有一个叫做Start的类,上面的parentPattern.root便是这个类的实例,下面我们来看看:

    static class Node extends Object {
        Node next;
        Node() {
            next = Pattern.accept;
        }
        /**
         * This method implements the classic accept node.
         */
        boolean match(Matcher matcher, int i, CharSequence seq) {
            matcher.last = i;
            matcher.groups[0] = matcher.first;
            matcher.groups[1] = matcher.last;
            return true;
        }
        /**
         * This method is good for all zero length assertions.
         */
        boolean study(TreeInfo info) {
            if (next != null) {
                return next.study(info);
            } else {
                return info.deterministic;
            }
        }
    } 

    static class Start extends Node {
        int minLength;
        Start(Node node) {
            this.next = node;
            TreeInfo info = new TreeInfo();
            next.study(info);
            minLength = info.minLength;
        }
        boolean match(Matcher matcher, int i, CharSequence seq) {
            if (i > matcher.to - minLength) {
                matcher.hitEnd = true;
                return false;
            }
            int guard = matcher.to - minLength;
            for (; i <= guard; i++) {
                if (next.match(matcher, i, seq)) {
                    matcher.first = i;
                    matcher.groups[0] = matcher.first;
                    matcher.groups[1] = matcher.last;
                    return true;
                }
            }
            matcher.hitEnd = true;
            return false;
        }
        boolean study(TreeInfo info) {
            next.study(info);
            info.maxValid = false;
            info.deterministic = false;
            return false;
        }
    }

看到这边,是不是有点熟悉的味道了,有循环,有下一个节点,但是又有点不同,还是没有找到我们最熟悉的指针,仔细观察方法的返回值,是个boolean类型,那看起来是每个实现类都完成自己的匹配,并返回匹配结果,然后不属于自己匹配的部分,交由Next去匹配,我们暂且称作链表匹配,带着我们猜测,再去看看其他Node的实现:

static final class UnixCaret extends Node {
    boolean match(Matcher matcher, int i, CharSequence seq) {
        int startIndex = matcher.from;
        int endIndex = matcher.to;
        if (!matcher.anchoringBounds) {
            startIndex = 0;
            endIndex = matcher.getTextLength();
        }
        // Perl does not match ^ at end of input even after newline
        if (i == endIndex) {
            matcher.hitEnd = true;
            return false;
        }
        if (i > startIndex) {
            char ch = seq.charAt(i-1);
            if (ch != '\n') {
                return false;
            }
        }
        return next.match(matcher, i, seq);
    }
}
static final class UnixDollar extends Node {
    boolean multiline;
    UnixDollar(boolean mul) {
        multiline = mul;
    }
    boolean match(Matcher matcher, int i, CharSequence seq) {
        int endIndex = (matcher.anchoringBounds) ?
            matcher.to : matcher.getTextLength();
        if (i < endIndex) {
            char ch = seq.charAt(i);
            if (ch == '\n') {
                // If not multiline, then only possible to
                // match at very end or one before end
                if (multiline == false && i != endIndex - 1)
                    return false;
                // If multiline return next.match without setting
                // matcher.hitEnd
                if (multiline)
                    return next.match(matcher, i, seq);
            } else {
                return false;
            }
        }
        // Matching because at the end or 1 before the end;
        // more input could change this so set hitEnd
        matcher.hitEnd = true;
        // If a $ matches because of end of input, then more input
        // could cause it to fail!
        matcher.requireEnd = true;
        return next.match(matcher, i, seq);
    }
    boolean study(TreeInfo info) {
        next.study(info);
        return info.deterministic;
    }
}

等等。如果你有耐心,大可以看完所有的Node实现,最后一定会发现,每个Node完成自己的任务后,调用Next的Match方法。

最后,我们还是画上一张图,来说明下核心调用过程(涉及到的Node,仅为了演示,不分先后顺序):

st=>start: Start
o1=>operation: UnixCaret
o2=>operation: Dollar
o3=>operation: ...
end=>end: End

st->o1->o2->o3->end