yaobin.wen

Yaobin's Blog

View on GitHub
21 February 2017

Understanding Python BNF Notation

by yaobin.wen

The Python 2 documentation uses a modified BNF grammar notation which is detailed in 1.2. Notation.

Although it is described as a modified version of BNF, I personally think it is more like a mixture of BNF and regular expressions. In the remaining part of this article, I will address this modified BNF as Python BNF.

The Python BNF, in general, follows the form of the standard BNF that each statement is written as follows:

symbol ::= expression
If the expression may have alternative forms, they are divided by the vertical bar “ ”.

For example, the grammar token for a letter in Python is defined here:

letter     ::=  lowercase | uppercase
lowercase  ::=  "a"..."z"
uppercase  ::=  "A"..."Z"

The definition shows that a letter is either a lowercase or an uppercase character, and the lowercase and uppercase are subsequently defined in the next two lines.

These two symbols, the ::= and |, are probably the only elements Python BNF inherits from the standard BNF.

The expression is defined more like an regular expression. The document lists the other supported expressions:

We can look at some examples.

The identifier is defined as follows:

identifier ::= (letter|"_") (letter | digit | "_")*

This expression can be interpreted as below:

We can look at a more complex example, the string literals:

stringliteral   ::=  [stringprefix](shortstring | longstring)
stringprefix    ::=  "r" | "u" | "ur" | "R" | "U" | "UR" | "Ur" | "uR"
                     | "b" | "B" | "br" | "Br" | "bR" | "BR"
shortstring     ::=  "'" shortstringitem* "'" | '"' shortstringitem* '"'
longstring      ::=  "'''" longstringitem* "'''"
                     | '"""' longstringitem* '"""'
shortstringitem ::=  shortstringchar | escapeseq
longstringitem  ::=  longstringchar | escapeseq
shortstringchar ::=  <any source character except "\" or newline or the quote>
longstringchar  ::=  <any source character except "\">
escapeseq       ::=  "\" <any ASCII character>

The interpretation goes as follows:

Tags: Tech