Version

Terminal Symbols

Topic Overview

Purpose

This topic explains a Grammar’s terminal symbols.

Required background

The following topics are prerequisites to understanding this topic:

Topic Purpose

This topic provides an overview of the Syntax Parsing Engine.

This topic provides an overview of the Syntax Parsing Engine’s Grammar.

This topic explains the lexical analysis performed by the Syntax Parsing Engine.

In this topic

This topic contains the following sections:

Terminal Symbols Overview

Terminal symbols summary

Terminal symbols define a group of characters the lexical analyzer should recognize as a unit when it is creating tokens. In addition to a unique name allowing the parser to distinguish it from other symbols, it also defines what text it can possibly represent. For example, a grammar might have a terminal symbol named “SemicolonToken” which indicates it can only represent “;” and a terminal symbol named “IdentifierToken” which indicates it can represent an underscore or letter followed by zero or more underscores, letters, or digits.

The terminal symbols are represented by the TerminalSymbol class in the Infragistics.Documents.Parsing namespace.

Defining Terminal Symbols

Defining terminal symbols summary

Each LexerState has a Symbols collection containing the TerminalSymbol instances that can be matched by the lexer when it is in that state. In addition to containing the TerminalSymbol instances, the Symbols collection is also the means by which new TerminalSymbol`s are created since `TerminalSymbol does not have a public constructor. Here is the signature of the Add method on the Symbols collection which creates a TerminalSymbol:

In C#:

public TerminalSymbol Add(
    string name,
    string value = null,
    TerminalSymbolComparison comparison = TerminalSymbolComparison.Literal,
    bool isExitSymbol = false)
Note
Note

There is another Add method on the Symbols collection which takes a TerminalSymbol instance. This can be used to add a TerminalSymbol to multiple lexer states. It can be created in one LexerState, and then the created instance can be added into other LexerState in the same Grammar. In this way, a TerminalSymbol can be matched in multiple states of the lexer.

Regular Expression Support

Summary

In the Symbols collection Add method, If the comparison specified for a new TerminalSymbol is RegularExpression, the value parameter is interpreted as a regular expression string. Regular expressions allow for a concise description of a set of strings.

Basic features

In addition to literal strings in the expression, there are certain operators that can be used to combine and enhance regular expressions to define larger and more complex sets. Here is a breakdown of some very basic regular expression features supported by the Syntax Parsing Engine:

Construct Description Example * Set of strings described by the example*

literal character

Any literal character which is not reserved as another kind of construct. This represents the set consisting of one string of just that character.

A

{ “A” }

( expression )

Grouping. Groups a regular expression within the parentheses and does not change the set described by that expression.

(A)

{ “A” }

expression1

expression2

Alternation. Any two regular expressions separated by a ‘

’. Represents the set of all strings described by either the first expression or the second.

A

B

{ “A”, “B” }

expression1 expression2

Concatenation. Any regular expression followed by another regular expression. Represents the set of all strings in the set of the first expression concatenated with all strings in the set of the second expression.

(A

B)C

{ “AC”, “BC” }

expression *

Kleene Closure. A regular expression followed by ‘*’. Represents zero or more concatenations of the set described by the expression preceding the ‘* ’.

(A

B)*

Supported functionality

The Syntax Parsing Engine conforms to the same syntax as .NET Framework Regular Expressions but only supports a subset of its functionality. This allows for a much faster lexical analysis at a cost of ability to define certain regular expressions. Here is the list of .NET Framework Regular Expression features as well as the features currently supported by the Parsing Engine (this list is taken from MSDN, which can be found here):

Feature Description Is supported in Syntax Parsing Engine* ?*

\a

Matches a bell character, \u0007.

Yes.png

\b

In a character class, matches a backspace, \u0008.

Yes.png

\t

Matches a tab, \u0009.

Yes.png

\r

Matches a carriage return, \u000D.

Yes.png

\v

Matches a vertical tab, \u000B.

Yes.png

\f

Matches a form feed, \u000C.

Yes.png

\n

Matches a new line, \u000A.

Yes.png

\e

Matches an escape, \u001B.

Yes.png

\ nnn

Uses octal representation to specify a character ( nnn consists of two or three digits).

Yes.png

\x nn

Uses hexadecimal representation to specify a character ( nn consists of exactly two digits).

Yes.png

\c X

\c x

Matches the ASCII control character that is specified by X or x, where X or x is the letter of the control character.

No.png

\u nnnn

Matches a Unicode character by using hexadecimal representation (exactly four digits, as represented by nnnn ).

Yes.png

\

When followed by a character that is not recognized as an escaped character in this tables in this topic, matches that character. For example, []$$$$* is the same as \x2A, and \. is the same as \x2E. This allows the regular expression engine to disambiguate language elements (such as * or ?) and character literals (represented by []$$* $$ or \?).

Yes.png

[ character_group * ]*

Matches any single character in character_group . By default, the match is case-sensitive. Example: [ae]

Yes.png

[^ character_group * ]*

Negation: Matches any single character that is not in character_group . By default, characters in character_group are case-sensitive. Example: [^ae]

Yes.png

first - last

Character range: Matches any single character in the range from first to last. Can only be used within a character group or negation character group.

Yes.png

.

Wildcard: Matches any single character except \n.

To match a literal period character (. or \u002E), you must precede it with the escape character (\. ).

Yes.png

\p{ name * }*

Matches any single character in the Unicode general category or named block specified by name . Example: \p{IsCyrillic}

Yes.png

\P{ name * }*

Matches any single character that is not in the Unicode general category or named block specified by name . Example: \P{Lu}

Yes.png

\w

Matches any word character.

Yes.png

\W

Matches any non-word character.

Yes.png

\s

Matches any white-space character.

Yes.png

\S

Matches any non-white-space character.

Yes.png

\d

Matches any decimal digit.

Yes.png

\D

Matches any character other than a decimal digit.

Yes.png

^

The match must start at the beginning of the string or line.

No.png

$

The match must occur at the end of the string or before \n at the end of the line or string.

No.png

\A

The match must occur at the start of the string.

No.png

\Z

The match must occur at the end of the string or before \n at the end of the string.

No.png

\z

The match must occur at the end of the string.

Yes.png

\G

The match must occur at the point where the previous match ended.

No.png

\b

The match must occur on a boundary between a \w (alphanumeric) and a \W (nonalphanumeric) character.

No.png

\B

The match must not occur on a \b boundary.

No.png

( subexpression * )*

Captures the matched subexpression and assigns it a zero-based ordinal number.

Yes.png

(?< name > subexpression * )*

Captures the matched subexpression into a named group.

No.png

(?< name1 - name2 > subexpression * )*

Defines a balancing group definition.

No.png

(?: subexpression * )*

Defines a noncapturing group.

Yes.png

(?:imnsx-imnsx: subexpression * )*

Applies or disables the specified options within subexpression . Available options:

  • i – Use case-insensitive matching.

  • m – Use multiline mode. ^ and $ match the beginning and end of a line, instead of the beginning and end of a string.

  • n – Do not capture unnamed groups.

  • s – Use single-line mode.

  • x – Ignore unescaped white space in the regular expression pattern.

Note
Note

(but only for the option i )

(?= subexpression * )*

Zero-width positive lookahead assertion.

No.png

(?! subexpression * )*

Zero-width negative lookahead assertion.

No.png

(?<= subexpression * )*

Zero-width positive lookbehind assertion.

No.png

(?<! subexpression * )*

Zero-width negative lookbehind assertion.

No.png

(?> subexpression * )*

Nonbacktracking (or "greedy") subexpression.

No.png

*

Matches the previous element zero or more times. Example: \d*\.\d

Yes.png

+

Matches the previous element one or more times.

Yes.png

?

Matches the previous element zero or one time.

Yes.png

{ n * }*

Matches the previous element exactly n times. Example: ,\d{3}

Yes.png

{ n * ,}*

Matches the previous element at least n times.

Yes.png

{ n , m * }*

Matches the previous element at least n times, but no more than m times.

Yes.png

* ?

Matches the previous element zero or more times, but as few times as possible.

No.png

+?

Matches the previous element one or more times, but as few times as possible.

No.png

??

Matches the previous element zero or one time, but as few times as possible.

No.png

{ n * }?*

Matches the preceding element exactly n times.

No.png

{ n * ,}?*

Matches the previous element at least n times, but as few times as possible.

No.png

{ n , m * }?*

Matches the previous element between n and m times, but as few times as possible.

No.png

\ number

Backreference. Matches the value of a numbered subexpression. Example: (\w)\1

No.png

\k< name * >*

Named backreference. Matches the value of a named expression. Example: (?<char>\w)\k<char>

No.png

*

*

Matches any one element separated by the vertical bar (

) character. Example: th(e

is

at)

Yes.png

(?( expression ) yes *

  • no * )*

Matches yes if the regular expression pattern designated by expression matches; otherwise, matches the optional no part. expression is interpreted as a zero-width assertion. Example: (?(A)A\d{2}\b

\b\d{3}\b)

No.png

(?( name ) yes *

  • no * )*

Matches yes if name, a named or numbered capturing group, has a match; otherwise, matches the optional no . Example: (?<quoted>")?(?(quoted).+?"

\S+\s)

No.png

(?# comment * )*

Inline comment. The comment ends at the first closing parenthesis.

Yes.png

# [to end of line]

Related Content

Topics

The following topics provide additional information related to this topic.

Topic Purpose

This topic explains a grammar’s non-terminal symbols.

This topic explains the syntax analysis performed by the Syntax Parsing Engine.

This topic explains the grammar analysis performed by the Syntax Parsing Engine.