21 February 2017

Understanding Python BNF Notation

by yaobin.wen

The Python 2 documentation uses a modified BNF grammar notation which is detailed in 1.2. Notation.

Although it is described as a modified version of BNF, I personally think it is more like a mixture of BNF and regular expressions. In the remaining part of this article, I will address this modified BNF as Python BNF.

The Python BNF, in general, follows the form of the standard BNF that each statement is written as follows:

symbol ::= expression

If the expression may have alternative forms, they are divided by the vertical bar “

”.

For example, the grammar token for a letter in Python is defined here:

letter     ::=  lowercase | uppercase
lowercase  ::=  "a"..."z"
uppercase  ::=  "A"..."Z"

The definition shows that a letter is either a lowercase or an uppercase character, and the lowercase and uppercase are subsequently defined in the next two lines.

These two symbols, the ::= and |, are probably the only elements Python BNF inherits from the standard BNF.

The expression is defined more like an regular expression. The document lists the other supported expressions:

A star (*) means the zero or more repetitions of the preceding item.
A plus (+) means one or more repetitions.
A phrase enclosed in square brackets ([ ]) means zero or one occurrences (in other words, the enclosed phrase is optional).
Parentheses are used for grouping.
Literal strings are enclosed in quotes.
Two literal characters separated by three dots mean a choice of any single character in the given (inclusive) range of ASCII characters.
A phrase between angular brackets (<…>) gives an informal description of the symbol defined.

We can look at some examples.

The identifier is defined as follows:

identifier ::= (letter|"_") (letter | digit | "_")*

This expression can be interpreted as below:

An identifier consists of two parts: The first part is a letter or an underscore, as specified in (letter|"_"); the second part is optional because it is followed by a star(*).
Because the first part doesn’t contain a digit, this means a valid identifier does not start with a digit.

We can look at a more complex example, the string literals:

stringliteral   ::=  [stringprefix](shortstring | longstring)
stringprefix    ::=  "r" | "u" | "ur" | "R" | "U" | "UR" | "Ur" | "uR"
                     | "b" | "B" | "br" | "Br" | "bR" | "BR"
shortstring     ::=  "'" shortstringitem* "'" | '"' shortstringitem* '"'
longstring      ::=  "'''" longstringitem* "'''"
                     | '"""' longstringitem* '"""'
shortstringitem ::=  shortstringchar | escapeseq
longstringitem  ::=  longstringchar | escapeseq
shortstringchar ::=  <any source character except "\" or newline or the quote>
longstringchar  ::=  <any source character except "\">
escapeseq       ::=  "\" <any ASCII character>

The interpretation goes as follows:

A string literal consists of two parts: The first part is a stringprefix which is optional because it is enclosed in square brackets; the second part is either a shortstring or a longstring.
A stringprefix is one of all the listed prefixes. This one is easy to understand.
A shortstring has two forms. In the first form, the shortstringitems are enclosed in a pair of single quotation marks, while in the second form, they are enclosed in a pair of double quotation marks.
A longstring is different from a shortstring in two aspects: The first aspect is that a longstring is enclosed in a pair of triple quotation marks; the second aspect is that a longstring can contain newline or the quote characters.

Tags: Tech

yaobin.wen

Yaobin's Blog

Understanding Python BNF Notation

Copyright © 2016-2023 Yaobin Wen All rights reserved.

If present, the "License" field in the articles overrides the copyright disclaimer. See "LICENSE.md" in the GitHub repository for more details.