正規語言相關文章參考資料為陳穎平教授上課講義

L2 Context-Free Languages

Context-Free Grammars

CFGs: 上下文無關文法,指的是字元在進行替換時不需要考慮上下文,其擁有足夠的表現力來表示大多數程式設計語言的語法

Formal Definition of CFGs

Definition

A context-free grammar is a 4-tuple (V,Σ,R,S)(V, Σ, R, S), where

  1. VV is a finite set called the variables;
  2. ΣΣ is a finite set, disjoint from VV ,called the terminals;
  3. RR is a finite set of rules (or productions), with each rule being a variable and a string of variables and terminals;
  4. SVS ∈ V is the start variable.

Each rule consists of

  • head: variable in VV
  • symbol →
  • body: a string in (VΣ)(V ∪ Σ)^∗
  • If u,v,w(VΣ)u, v, w ∈ (V ∪ Σ)^∗, and AwA → w is a rule of the grammar, we say that uAvuAv yields uwvuwv, written uAvuwvuAv ⇒ uwv.
  • Say that uu derives vv, written uvu ⇒^* v, if u=vu = v or if a sequence u1,u2,,uku_1, u_2, ··· , u_k exists for k0k ≥ 0 and uu1u2vu ⇒ u_1 ⇒ u_2 ⇒ ··· ⇒ v .
  • The language of the grammar is {wΣ  Sw}\{w ∈ Σ^∗~|~S ⇒^* w\}.

Ambiguity

Leftmost derivation: A derivation of a string w in a grammar G is a leftmost derivation if at every step the leftmost remaining variable is the one replaced.

Definition

A string ww is derived ambiguously in context-free grammar GG if it has two or more different leftmost derivations. Grammar GG is ambiguous if it generates some string ambiguously.

Inherently ambiguous: The context-free languages that can be generated only by ambiguous grammars. For example, {aibjck  i=j or j=k}\{a^ib^jc^k~|~i = j~or~j = k\}.

Chomsky Normal Form

Definition

A context-free grammar is in Chomsky normal form if every rule is of the form ABC or AaA → BC~or~A → a, where aa is any terminal and AA, BB, and CC are any variables except that BB and CC may not be the start variable. In addition we permit the rule SεS → ε, where SS is the start variable.

如果原有的 grammar 包含 SεS → ε,則有兩種處理方式:

  1. L(G)=L(G){ε}L(G') = L(G) - \{ε\}
  2. 允許 SεS → ε rule
Theorem

Any context-free language is generated by a context-free grammar in Chomsky normal form.

  • 證明: Convert any context-free grammar into Chomsky normal form:
    1. Add a new start variable;
    2. Eliminate all εε rules (AεA → ε);
    3. Handle all unit rules (ABA → B);
    4. Convert all rules into the proper form.

Pushdown Automata

PDA: 下推自動機

Formal Definition of PDAs

Definition

A pushdown automaton is a 6-tuple (Q,Σ,Γ,δ,q0,F)(Q, Σ, Γ, δ, q_0, F ),where QQ, ΣΣ, ΓΓ, and FF are all finite sets, and

  1. QQ is the set of states;
  2. ΣΣ is the input alphabet;
  3. ΓΓ is the stack alphabet;
  4. δδ : Q×Σε×ΓεP(Q×Γε)Q × Σ_ε × Γ_ε → P(Q × Γ_ε) is the transition function;
  5. q0Qq0 ∈ Q is the start state;
  6. FQF ⊆ Q is the set of accept states.

A pushdown automaton M=(Q,Σ,Γ,δ,q0,F)M = (Q, Σ, Γ, δ, q_0, F ) accepts input ww if ww can be written as w=w1w2wmw = w_1w_2 ··· w_m, where wiΣεw_i ∈ Σ_ε and sequences of states r0,r1,,rmQr_0, r_1, ··· , r_m ∈ Q and strings s0,s1,,smΓs_0, s_1, ··· , s_m ∈ Γ^∗ exist that satisfy the following three conditions. The stings sis_i represent the sequences of stack contents that MM has on the accepting branch of the computation.

  1. r0=q0r_0 = q_0 and s0=εs_0 = ε;
  2. For i=0,,m1i = 0, ··· , m−1, we have (ri+1,b)δ(ri,wi+1,a)(r_{i+1}, b) ∈ δ(r_i, w_{i+1}, a), where si=ats_i = at and si+1=bts_{i+1} = bt for some a,bΓεa, b ∈ Γ_ε and tΓt ∈ Γ^∗;
  3. rmFr_m ∈ F .

Frequently used mechanisms:

  • Empty stack detection: $\$;
    • 使用一個不屬於 stack alphabet 的符號用來檢測 stack 是否為空,可以發現之前的範例,會在開始以及最後 push, pop 這個符號
  • Input end detection: Accept states.
    • 為了與其他自動機統一格式而加上 Accept states 來判斷,實際上也是在檢測 stack 是否為空來判斷是否接受這個 string

Equivalence with CFGs

Theorem

A language is context free if and only if some pushdown automaton recognizes it.

Lemma

If a language is context free, then some pushdown automaton recognizes it.

  • 證明: Convert the given CFG into a PDA.
    • 利用 PDA 的 nondeterminism 特性,用 leftmost derivation 產生所有可能的 substitution sequence, 將 intermediate strings 儲存在 stack 中並且與 input string 進行對應,如果能到達 Accept states 代表能夠接受
    • For the symbol on the stack top, two kinds of transitions:
      • For terminal aa, match the input: a,aεa, a → ε
      • For variable AA, apply the rule: ε,Awε, A → w
Lemma

If a pushdown automaton recognizes some language, then it is context free.

  • 證明: Convert the given PDA into an equivalent CFG.
    • PDAeach pair of states 產生一個 ApqA_{pq} 代表從 ppqq 能夠產生的所有 strings
    • Modify the PDA PP such that
      1. It has a single accept state, qacceptq_{accept}.
        • 將原有的 accept states 指向新的 qacceptq_{accept} (εε arrow)
      2. It empties its stack before accepting.
        • 使用 $\$ 符號
      3. Each transition either pushes a symbol onto the stack
        (a push move) or pops one off the stack (a pop move),
        but it does not do both at the same time.
        • aba → b 拆為 aε,εba → ε, ε → b
    • For a PDA P=(Q,Σ,Γ,δ,q0,{qaccept})P = (Q, Σ, Γ, δ, q_0, \{q_{accept}\}), we construct a CFG G=(V,Σ,R,S)G = (V, Σ, R, S) with the following steps:
      1. V={Apq  p,qQ}V = \{A_{pq}~|~p, q ∈ Q\}
      2. S=Aq0,qacceptS = A_{q_0,q_{accept}}
      3. RR contains the following three sets of rules:
        • For each p,q,r,sQp, q, r, s ∈ Q, tΓt ∈ Γ, and a,bΣεa, b ∈ Σ_ε, if (r,t)δ(p,a,ε)(r, t) ∈ δ(p, a, ε) and (q,ε)δ(s,b,t)(q, ε) ∈ δ(s, b, t), the rule ApqaArsbA_{pq} → aA_{rs}b
          • tt 進入到 stack 中,之後又被排出的過程
        • For each p,q,rQp, q, r ∈ Q, the rule ApqAprArqA_{pq} → A_{pr}A_{rq}
        • For each pQp ∈ Q, the rule AppεA_{pp} → ε
Claim

If ApqA_{pq} generates xx, then xx can bring PP from pp with empty stack to qq with empty stack.

  • 證明: Induction on the number of steps in the derivation. (數學歸納法,從 state 轉移的角度來看)
Claim

If xx can bring PP from pp with empty stack to qq with empty stack, ApqA_{pq} generates xx.

  • 證明: Induction on the number of steps in the computation. (數學歸納法,從 stack 的情況來看)
Corollary

Every regular language is context free.
* NFA 可以視為 PDA 的特例

Non-Context-Free Languages

我們要如何去證明一個 LanguagesNon-Context-Free Languages 呢?
ex: B={anbncn  n0}B = \{ a^nb^nc^n~|~n ≥ 0 \} .

Pumping Lemma for CFLs

Theorem

If AA is a context-free language, then there is a number p (the pumping length) where, if ss is any string in AA of length at least pp, then ss may be divided into five pieces, s=uvxyzs = uvxyz satisfying the conditions:

  1. for each i0i ≥ 0, uvixyizAuv^ixy^iz ∈ A;
  2. vy>0|vy| > 0; v, y 不同時為空字串
  3. vxyp|vxy| ≤ p.
  • 證明: 類似 Pumping Lemma for RL 的證明方式,透過鴿籠原理,當 parse tree 的樹高足夠時 (> variables 總數),必定存在 variable R 在一條路徑中經過兩次以上,因此可以對其產生出的 string 進行 Pumping 而仍然屬於 CFL

注意:

  • context-free language 一定滿足 Pumping Lemma for CFLs
  • Pumping Lemma for CFLscontext-free language 的必要條件,不是充分條件
    • 如果一個 language 滿足 Pumping Lemma for CFLs,其不一定為 context-free language

有了 Pumping Lemma for CFLs,我們可以利用反證法去證明一個 Languagesnon-context-free languages:

  1. Assume the language, say, BB, is context free in order to obtain a contradiction.
  2. By the pumping lemma for CFLs, a pumping length p exists, and any string wBw ∈ B can be pumped if wp|w| ≥ p.
  3. Find a string sBs ∈ B, sp|s| ≥ p, that ss cannot be pumped as described in the pumping lemma. 尋找反例
  4. The contradiction is obtained, and therefore, BB is proved to be non-context-free.

Ogden’s Lemma for CFLs

Pumping Lemma for CFLs 的進階版

Theorem

If AA is a context-free language, then there is a number p where, if ss is any string in AA of length at least pp, in which we select any pp or more positions to be distinguished, then ss may be divided into five pieces, s=uvxyzs = uvxyz satisfying the conditions:

  1. for each i0i ≥ 0, uvixyizAuv^ixy^iz ∈ A;
  2. vyvy has at least one distinguished positions;
  3. vxyvxy has at most pp distinguished positions.
  • 如果挑選的 positions 是整個字串,就等價於 Pumping Lemma for CFLs

Example: L={aibjckdl  i=0 or j=k=l}L = \{a^ib^jc^kd^l~|~i = 0~or~j = k = l \} .

  • 使用 Pumping Lemma for CFLs 無法證明其為 non-context-free

Ogden’s Lemma for CFGs

Lemma

Let GG be a CFG. There exists a natural number p (depending on G) such that for all sL(G)s ∈ L(G), if ss is marked so that it has pp or more distinguished positions, then there exists a variable AA and terminal strings u,v,x,y,zu, v, x, y, z satisfying the following conditions:

  1. s=uvxyzs = uvxyz;
  2. either vv or yy has distinguished positions;
  3. vxyvxy has no more than pp distinguished positions;
  4. SuAzS ⇒^* uAz;
  5. AvAyA ⇒^* vAy;
  6. AxA ⇒^* x;
  7. for each i0i ≥ 0, AvixyiA ⇒^* v^ixy^i. 5 + 6

Inherently Ambiguous CFLs

證明 L={aibjck  i=j or j=k}L = \{a^ib^jc^k~|~i = j~or~j = k\}Inherently Ambiguous,所有能夠產生它的 grammer 都是 Ambiguous

  • 透過 Ogden’s Lemma for CFGs,可以證明不論用什麼 CFG 產生 LL,其中的 ambmcma^mb^mc^m 必定存在兩種不同的 parse trees,代表該 CFGAmbiguous,故 LLInherently Ambiguous

Deterministic Context-Free Languages

[skip]

Properties of CFLs

Substitutions

is a generalization of the homomorphism we defined
in the study of regular languages

  • Homomorphism: h:ΣΠh : Σ → Π^∗
  • Substitutions: h:ΣP(Π)h : Σ → P(Π^∗)
    • 一個 symbol 可以對應到多個 string
  • can be extended in for strings and languages:
    • s:ΣP(Π)s: Σ^* → P(Π^∗)
      • s(ε)=εs(ε) = ε
      • s(xa)=s(x)s(a)s(xa) = s(x)s(a)
    • s:P(Σ)P(Π)s: P(Σ^*) → P(Π^∗)
      • s(L)={s(w)  wL}s(L) = \{s(w)~|~w ∈ L\}
Theorem

If LL is a CFL over ΣΣ, ss is a substitution on ΣΣ such that s(a)s(a) is a CFL for each aΣa ∈ Σ, then s(L)s(L) is also a CFL.

  • 證明:
    • Let G=(V,Σ,R,S)G = (V, Σ, R, S) be the CFG for LL and Ga=(Va,Σa,Ra,Sa)G_a = (V_a, Σ_a, R_a, S_a) be the CFG for s(a)s(a), for each symbol aΣa ∈ Σ. Construct a new grammar G=(V,Σ,R,S)G' = (V', Σ', R', S') for s(L)s(L):
      • VV' is the union of VV and all the VaV_a’s for aΣa ∈ Σ.
      • ΣΣ' is the union of all the ΣaΣ_a’s for aΣa ∈ Σ.
      • RR' consists of
        1. All rules in each RaR_a;
        2. The rule in RR but with each terminal a in their
          bodies replaced by SaS_a everywhere a occurs.
    • 將原有的 parse trees 後面接上 s(a)s(a)parse trees

Closure Properties

Let AA and BB be context-free languages. Let RR be a regular language. The results of the following operations are all context-free languages:

  • Substition: s(A)s(A);
  • Union: ABA ∪ B;
    • 證明: 加上 SSA  SBS → S_A~|~S_B
  • Concatenation: ABAB;
    • 證明: 加上 SSASBS → S_AS_B
  • Star: AA^∗;
    • 證明: 加上 SAε  SASAS_A → ε~|~S_AS_A
  • Reversal: ARA^R;
    • 證明: 將所有 rules 的 body 部分顛倒
  • Homomorphism: h(A)h(A);
    • 證明: Substition 的特例
  • Inverse homomorphism: h1(A)h^{−1}(A);
  • Intersection: ARA ∩ R;
    • ABA ∩ B 不一定是 context-free languages
    • 證明: 類似合併兩 DFA 的方式,將 AAPDABBDFA 合併
  • Difference: ARA − R (AR=ARA − R = A ∩ \overline R).

A\overline A 不一定是 context-free languages

  • 因為 ABA ∪ Bcontext-free languages,但 ABA ∩ B 不一定是 context-free languages,根據笛摩根定律,如果 A\overline Acontext-free languages 會產生矛盾

Membership Test: CYK Algorithm

檢測 wL?w ∈ L?

  1. Let G=(V,Σ,R,S)G = (V, Σ, R, S) is the context-free grammar for LL in Chomsky normal form.
  2. The input string w=a1a2anw = a_1a_2 ··· a_n is of length nn.
  3. Construct the table with horizontal axis (ii) being the positions of the symbols in ww, XijX_{ij} is the set of variables AA such that Aaiai+1ajA ⇒^* a_ia_{i+1} ··· a_j. We ask whether SX1nS ∈ X_{1n} (i.e., Sa1a2anS ⇒^* a_1a_2 ··· a_n.)
  • Basis: XiiX_{ii} is the set of variables AA such that AaiA → a_i can be found in the rule.
  • Induction step: XijX_{ij} contains AA if we can find variables BB and CC, and integer kk such that
    1. ik<ji ≤ k < j ;
    2. BB is in XikX_{ik} ;
    3. CC is in Xk+1,jX_{k+1,j} ;
    4. ABCA → BC is a rule.
  • 由下往上填滿,最後檢查 SS 有沒有在最上面的格子中
  • 也能夠檢查 ww 的子串列是否也屬於 LL (對應的格子中是否有 SS)