Hey guys! I was silent for a while due to a lack of good topics to discuss. And today I want to present another piece of my class work at university for “Compiler Development” course. The task is to write a manual lexical parser for a language of my choice. I decided to take JSON language, because its syntax is relatively simple and requires most common techniques to parse. In addition, it has well-looking BNF grammar for custom parser implementations.

The purpose of lexical analysis is to read the source code and convert them to a sequence of tokens (lexemes) which are minimal parts of each language. It is important understand that lexical analysis doesn’t perform semantic (meaning) validation. That is, lexical analysis determines whether the source code can be written in a specific language’s alphabet. It doesn’t mean that the code will be executed successfully. Source code semantic is validated only after lexical analysis and uses its product (a set of tables, keywords, operators, literals, identifiers, etc.).

You can think that there is no need to write your own lexical parser, because there are LOTS of them. For example, PowerShell contains built-in JSON encoder and decoder via ConvertTo-JSON and ConvertFrom-JSON cmdlets. Though, these cmdlets completely hide parsing result and perform object conversion. You can’t access internal parser to look at exact results of the parsing. But results of lexical parsers are actively used in web. For example, JS-based syntax highlighters use lexical parser to split the source code into tokens and colorize or highlight them for better readability. And my website does it as well (though, not via JS). For example, all XML and PowerShell code snippets on my blog are colorized by using lexical parsers. For PowerShell code I’m using Tokenize method in System.Management.Automation.PSParser class. For XML strings I’m using custom XML tokenizer. And cororize them according to token types.

JSON BNF grammar is presented on project’s main page at https://www.json.org/. Useful thing is that it is presented in two forms, as a text and graph diagram. The idea of JSON lexical parsing consist of defining the following collections:

  • language keywords ("{", "}", "[", "]", ":", ",", "\"");
  • literal constants (true, false, null);
  • separators (not really used in this particular example);
  • special characters ("\"", "\\", "\/", "\b", "\f", "\n", "\r", "\t", "\u").

Then loop over each character and traverse the graph according to BNF grammar. If parser successfully converted entire text into tokens, then the text is valid for this particular language. My script may not be very efficient, because I’m allowed (it is still class work) to use only basic functions, data structures and operations. For example, I’m not allowed to use regex directly, but can use built-in functions to determine the class of the particular character (is digit, letter, special char and so on). On the other hand, this simplicity allows to read the code line by line and follow its logic. The code is refactored into several functions either, to reuse or to move specific logic in to its own function rather than writing them in main code:

  • newToken – is used to create empty token object. This object represents basic properties of the token: class, start offset (base on source string), length and exact value
  • getNextChar – reads next char from the source text
  • appendChar – appends current char to a token value and advances to the next char
  • validateEscape – validates whether the string contains valid escape characters which are defined in a BNF grammar
  • parseString – is called when string literal is processed. The function checks if string is well-formed and validates escape characters (by calling validateEscape function)
  • parseNumber – is called when current char is number literal
  • parseExponent – numbers in JSON can be written in various modes including exponent notation. This function parses exponent part of the number.

and here is the full code:

function __newToken ($type) {
    # generate default token object
    $token = New-Object psobject -Property @{
        Type = $type;
        Value = [string]::Empty;
        StartOffset = $i;
    }
    # add script-property to get the length of the Value property automatically
    $token | Add-Member -MemberType ScriptProperty -Name "Length" -Value {
        # if token type is "String", its length includes quotes which are not
        # included in Value member
        if ($this.Type -eq "String") {$this.Value.Length + 2} else {$this.Value.Length}
    } -PassThru
}
function __getNextChar {
    # use $script prefix to modify variable defined in outer scope
    $script:i++
    $str[$i]
}
function __appendChar($token) {
    $c = $str[$i]
    $token.Value += $c
    __getNextChar
}
function __validateEscape($token) {
    $c = __appendChar $token
    # use inverse of "-notcontains", because it is case-sensitive
    if (!($escapes -ccontains $c)) {
        Write-Warning "Bad escape at $i"
    }
    # again, case-sensitive check
    if ($c -ceq "u") {
        for ($n = 0; $n -lt 4; $n++) {
            $c = __appendChar $token
            if (![char]::IsDigit($c)) {
                Write-Warning "Bad unicode escape at $i"
            }
        }
    }
}
function __parseString($token) {
    $c = $str[$i]
    do {
        if ($c -eq "\") {
            __validateEscape $token
        }
        $c = __appendChar $token
    } while($c -ne '"')
    # if it is escaped quote (e.g. \"), it is a part of the string
    # and continue to read subsequent characters until we reach
    # unescaped quote. The content of the string is irrelevant.
    if ($str[$i - 1] -eq "\") {
        __parseString $token
    }
}
function __parseExponent($token) {
    # exponent part consist of "E" letter followed by sign and 1 or more digits.
    $c = __getNextChar
    # exponent sign can be [optionally] followed by a plus or minus sign
    if ($c -eq "+" -or $c -eq "-") {
        $c = __appendChar $token
    }
    # and must be followed by 1 or more digit chars. Multiple leading
    # zero signs are allowed in exponent.
    if (![char]::IsDigit($c)) {
        Write-Warning "Bad exponent at $i"
    }
    do {
        $c = __appendChar $token
    } while ([char]::IsDigit($c))
}
function __parseNumber($token) {
    $c = $str[$i]
    # if number starts with "0", it can be followed by either, decimal dot or exponent sign,
    # or whitespace to denote simple "0" number.
    if ($c -eq "0") {
        $c = __appendChar $token
    } elseif ([char]::IsDigit($c)) {
        # if it is non-zero digit, read a sequence of numbers (1 or more)
        do {
            $c = __appendChar $token
        } while ([char]::IsDigit($c))
    }
    # now we finished reading integer part of the number. Lookup if decimal dot
    # presented.
    if ($c -eq ".") {
        if (![char]::IsDigit($str[$script:i + 1])) {
            Write-Warning "Wrong fract at $i"
            return
        }
        do {
            $c = __appendChar $token
        } while ([char]::IsDigit($c))
    }
    # after integer and fractional parts are obtained, lookup if exponent sign
    # is presented
    # 'e' letter can be upper or lower-case. Use case-insensitive comparison
    if ($c -eq "E") {
        $token.Value += $c
        __parseExponent $token
    }
}
# define JSON keywords and constants
$keywords = @("{", "}", "[", "]", ":", ",", "`"")
$constants = @("true, false, null)
$escapes = @(`", "\", "/", "b", "f", "n", "r", "t", "u")
# main loop
for ($i = 0; $i -lt $str.Length; $i++) {
    $c = $str[$i]
    if ($keywords -contains $c) {
        $token = __newToken "String"
        $token.Value = $str[$i]
        switch ($c) {
            "{" {$token.Type = "GroupStart"}
            "}" {$token.Type = "GroupEnd"}
            "[" {$token.Type = "ArrayStart"}
            "]" {$token.Type = "ArrayEnd"}
            ":" {$token.Type = "NameSeparator"}
            "," {$token.Type = "ValueSeparator"}
            '"' {
                # quotation mark is not added to the token value
                $token.Value = [string]::Empty
                $i++
                __parseString $token
            }
        }
    } elseif ([char]::IsDigit($c) -or $c -eq "-") {
        $token = __newToken "Number"
        if ($c -eq "-") {
            $c = __appendChar $token
            if (![char]::IsDigit($c)) {
                Write-Warning "NAN: minus sign is followed by a non-number at $i"
            }       
        }
        __parseNumber $token
        $i--
    } elseif ([char]::IsLetter($c)) {
        $token = __newToken "ConstantLiteral"
        do {
            $c = __appendChar $token
        } while ([char]::IsLetter($c))
        if (!($constants -ccontains $token.Value)) {
            Write-Warning "Wrong constant at $i : $($token.Value)"
            return
        }
        $i--
    } elseif ([char]::IsWhiteSpace($c)) {
        continue
    } else {
        Write-Warning "unknown literal at $i : $($str[$i])"
        return
    }
    $token
}

And we parse sample test JSON text that includes all common JSON features:

{
    "key": 3.5,
    "true": true,
    "false": false,
    "null": null,
    "list": [1, 2, 3],
    "object": {
        "k1": "v1"
    },
    "numbers": [0e1, 1e-1, -1e-1, -1e+1],
    "string": "\"\u1234'abc[]{}..."
}

The result will be as follows:

StartOffset               Value Type            Length
-----------               ----- ----            ------
          0                   { GroupStart           1
          6                 key String               5
         11                   : NameSeparator        1
         13                 3.5 Number               3
         16                   , ValueSeparator       1
         22                true String               6
         28                   : NameSeparator        1
         30                true ConstantLiteral      4
         34                   , ValueSeparator       1
         40               false String               7
         47                   : NameSeparator        1
         49               false ConstantLiteral      5
         54                   , ValueSeparator       1
         60                null String               6
         66                   : NameSeparator        1
         68                null ConstantLiteral      4
         72                   , ValueSeparator       1
         78                list String               6
         84                   : NameSeparator        1
         86                   [ ArrayStart           1
         87                   1 Number               1
         88                   , ValueSeparator       1
         90                   2 Number               1
         91                   , ValueSeparator       1
         93                   3 Number               1
         94                   ] ArrayEnd             1
         95                   , ValueSeparator       1
        101              object String               8
        109                   : NameSeparator        1
        111                   { GroupStart           1
        121                  k1 String               4
        125                   : NameSeparator        1
        127                  v1 String               4
        136                   } GroupEnd             1
        137                   , ValueSeparator       1
        143             numbers String               9
        152                   : NameSeparator        1
        154                   [ ArrayStart           1
        155                 0e1 Number               3
        158                   , ValueSeparator       1
        160                1e-1 Number               4
        164                   , ValueSeparator       1
        166               -1e-1 Number               5
        171                   , ValueSeparator       1
        173               -1e+1 Number               5
        178                   ] ArrayEnd             1
        179                   , ValueSeparator       1
        185              string String               8
        193                   : NameSeparator        1
        195 \"\u1234'abc[]{}... String              21
        217                   } GroupEnd             1

and if you want to make your custom colorizer, just process token by token and apply colors based on token types.


Share this article:

Comments:


Post your comment:

Please, solve this little equation and enter result below. Captcha