BNF

Covfefe allows context free grammars to be specified using a language that is a superset of BNF.

Contents

  1. Productions
  2. Terminals
  3. Alternations
  4. Sequence Grouping
  5. Optional Sequences
  6. Sequence Repetitions
  7. Character Ranges
  8. Full Grammar

Productions

A production in a context free grammar has the form X -> b, which indicates that X can be replaced by b. X is a non-terminal, b is a string of terminals and non-terminals.

In BNF, a non-terminal is written as <A>, a terminal is written as 'a' or "a". An assignment is written as lhs ::= rhs.

A grammar that produces Hello World can thereby be expressed as

<S> ::= 'hello' <whitespace> 'world'
<whitespace> ::= ' '
<whitespace> ::= '\t'

Terminals

Terminals are strings of characters that are delimited either by 's or "s.

Escaping

To express characters such as newlines, tabs and unicode symbols, backslashes can be used:

<S> ::= '\n' | '\t' | '\\' | '\u{1F602}'

This grammar produces a single newline, a single tab, a single \ or a single 😂.

Warning

When adding grammars as strings in Swift code, backslashes need to be escaped.

The above grammar then becomes:

let grammarString = "<S> ::= '\\n' | '\\t' | '\\\\' | '\\u{1F602}'"

Alternations

In the above example, whitespace can be replaced either by ' ' or by '\t'. This can be written as

<whitespace> ::= ' ' | '\t'

Concatenation has a higher precedence than alternations, so the following grammar produces ab and cd

<S> ::= 'a' 'b' | 'c' 'd'

Sequence Grouping

Parentheses (( and )) can be used to group symbols together.

The following grammar produces abd and acd:

<S> ::= 'a' ('b' | 'c') 'd'

Optional Sequences

To mark a sequence as optional, brackets ([ and ]) can be used.

The following grammar produces abc and ac:

<S> ::= 'a' ['b'] 'c'

Sequence Repetitions

To repeat a sequence, braces ({ and }) can be used.

The following grammar produces aba, abba, abbba, etc. but not aa:

<S> ::= 'a' {'b'} 'a'

To also generate aa, repetitions can be matched with optional sequences:

<S> ::= 'a' [{'b'}] 'a'

It is preferred to make the entire repetition optional instead of repeating an optional sequence.

The repetition is generated using a left-recursive auxiliary rule.

Character Ranges

To make it easier to specify grammars that recognize a large alphabet, character ranges can be used. The following grammar recognizes all upper and lower case roman letters:

<S> ::= 'A' ... 'Z' | 'a' ... 'z'

The lower bound of a character range must be less than or equal the upper bound (e.g. 'b' ... 'a' is an invalid range). The bounds of a range must be exactly one character. The characters can consist of multiple unicode scalars.

Full Grammar

The grammar that is recognized is the following:

(* Production rules are separated by newlines and optional whitespace *)
syntax = optional-whitespace | newlines | rule | rule, newlines | syntax, newlines, rule, newlines | syntax, newlines, rule;

(* A rule consists of a non-terminal pattern name, an assignment operator and a production expression *)
rule = optional-whitespace, rule-name-container, optional-whitespace, assignment-operator, optional-whitespace, expression, optional-whitespace;

(* Rule names are strings of alphanumeric characters *)
rule-name-container = "<", rule-name, ">";
rule-name = rule-name, rule-name-char | "";
rule-name-char = "[a-zA-Z0-9-_]";

assignment-operator = ":", ":", "=";

(* An expression can either be a concatenation or an alternation *)
expression = concatenation | alternation;

(* An alternation is a sequence of one or more concatenations separated by | *)
alternation = expression, optional-whitespace, "|", optional-whitespace, concatenation;

(* A concatenation is a string of terminals and non-terminals *)
concatenation = expression-element | concatenation, optional-whitespace, expression-element;

(* An atom of a expression can either be a terminal literal, a non-terminal or a group *)
expression-element = literal | rule-name-container | expression-group | expression-repetition | expression-optional;

(* Expression containers *)
expression-group = "(", optional-whitespace, expression, optional-whitespace, ")";
expression-optional = "[", optional-whitespace, expression, optional-whitespace, "]";
expression-repetition = "{", optional-whitespace, expression, optional-whitespace, "}";

(* Terminal literals *)
literal = "'", string-1, "'" | '"', string-2, '"' | range-literal;
range-literal = single-char-literal, optional-whitespace, ".", ".", ".", optional-whitespace, single-char-literal;

(* Strings and characters *)
string-1 = string-1, string-1-char | "";
string-1-char = ? any character except '"' ? | string-escaped-char | escaped-single-quote;
string-2 = string-2, string-2-char | "";
string-2-char = ? any character except "'" ? | string-escaped-char | escaped-double-quote;

single-char-literal = "'", string-1-char, "'" | '"', string-2-char, '"';

(* Escape sequences *)
string-escaped-char = unicode-scalar | carriage-return | line-feed | tab-char | backslash;

backslash = "\\", "\\";
line-feed = "\\", "n";
carriage-return = "\\", "r";
tab-char = "\\", "t";

escaped-double-quote = "\\", '"';
escaped-single-quote = "\\", "'";

unicode-scalar = "\\", "u", "{", unicode-scalar-digits, "}";
unicode-scalar-digits = [digit], [digit], [digit], [digit], [digit], [digit], [digit], digit;
digit = '0' ... '9' | 'A' ... 'F' | 'a' ... 'f';

(* Whitespace *)
newlines = "\n" | "\n", optional-whitespace, newlines;

optional-whitespace = "" | whitespace, optional-whitespace;
whitespace = " " | "\t" | "\n" | comment;

(* Comments *)
comment = "(", "*", comment-content, "*", ")" | "(", "*", "*", "*", ")";
comment-asterisk = comment-asterisk, "*" | "*";
comment-content = comment-content, comment-content-char | "";
comment-content-char = "[^*(]" | comment-asterisk, "[^)]" | comment-open-parenthesis, "[^*]" | comment;
comment-open-parenthesis = comment-open-parenthesis, "(" | "(";