General Expression Language (GEL)

This document provides a detailed technical specification and data sheet for the General Expression Language (GEL). The specification covers its lexing, parsing, and AST data structures, serving as a reference for both implementers and power users.

1. Introduction
2. Overview of GEL
3. Lexer
4. Parser
5. Token Types
6. Grammar and AST
7. Configuration and Known Fields
8. Usage Examples
9. Error Handling
10. References

1. Introduction

GEL (General Expression Language) is a domain-specific language intended for constructing logical and arithmetic expressions. These expressions can filter data records, define conditions, or implement advanced matching logic. The language supports:

Logical operators (and, or, not).
Comparison operators (eq, ne, lt, le, gt, ge, in, has).
Field references (e.g., file.name).
Function calls with type-checking.
Literal values (numbers, strings, bytes, boolean).
Collections and set-literals ({...} syntax for in checks).
Array and map subscript notation (arr[index] or obj[key]).

2. Overview of GEL

GEL expressions are written in a compact form that closely resembles pseudo-code:

Example Expression:
not process.name eq "cmd.exe" and (file.extension in { "exe" "dll" })

Internally, GEL undergoes two phases of processing:

Tokenization (via the Lexer)
Parsing into an AST (via the Parser)

Once parsed into an Expression AST, the language elements can be further analyzed or executed by an interpreter or evaluator.

3. Lexer

The lexer is defined in lexer_optimized.ts, providing an optimized tokenization phase. Key optimizations include:

Inline checks for digits, alpha characters, etc.
Early short-circuit when reaching end-of-input.
Single-pass loops for reading strings, raw strings, and byte strings.

3.1 `GelLexer` Class

The GelLexer converts an input string into an array of Token objects. Notable methods:

tokenize(): Main entry point that returns Token[].
readString(quoteChar): Reads a quoted string literal.
readRawString(): Reads a raw string (e.g. r#"..."# syntax).
readBytesString(): Reads a bytes literal prefixed by b or B.
readNumber(): Reads a numeric literal (integer or decimal).
readIdentifier(): Reads an identifier, such as field names or operator keywords.

Note: The GelLexer also tracks line and column information to aid in detailed error reporting for invalid tokens.

4. Parser

The parser, defined in parser.ts (or parser_optimized.ts in your final distribution), reads the tokens produced by the lexer and constructs an Abstract Syntax Tree (AST). The parser enforces correct syntax and performs semantic checks like type matching for function calls.

4.1 `GelParser` Class

constructor(input: string, config: GelConfig): Accepts a raw expression string and a GelConfig for field/function definitions. Internally calls the lexer.
parse(): Main entry point. Returns a root Expression node. Throws ParseError if syntax or semantic checks fail.

4.1.1 Key Parsing Methods

parseExpression():
Handles the top-level parse logic.
parseOrExpression(), parseAndExpression():
Implements logical operator precedence. E.g. expr1 or expr2, expr1 and expr2.
parseUnaryExpression():
Parses optional not prefix for expressions.
parsePrimaryExpression():
Parses parentheses or delegates to parseOperandOrFunctionWithOperator().
parseOperandOrFunction():
Determines if an identifier is a field reference or a function call.
parseInSet(), parseInSetElement():
Parses set-literal syntax ({ ... }) for in operations.

The parser also performs type inference (for instance, ensuring lt only applies to numeric types). It references this.config.fields for known field types and this.config.signatures for known function signatures.

5. Token Types

The lexer outputs tokens defined by TokenType. Below is a summary table:

TokenType	Description / Example
`lparen`, `rparen`	Left `(` or Right `)` parenthesis
`lbrace`, `rbrace`	Left `{` or Right `}` brace
`lbracket`, `rbracket`	Left `[` or Right `]` bracket
`comma`	Comma (`,`) separator
`dot`	Period (`.`) used for dotted field paths or part of `dotdot`
`dotdot`	Double-dot (`..`) used for numeric ranges in set literals
`star`	Asterisk (``) used for subscript expansions (`[]`)
`identifier`	Keywords (`and`, `or`, `not`) or user-defined fields (`process`) or operator tokens (`eq`, `lt`, etc.)
`string`	A quoted string (`"foo"` or `'bar'`) or raw string (`r#"something"#`)
`bytes`	A bytes literal (`b"\\x41\\x42"`)
`number`	Numeric literal (`123`, `3.14`)
`less_than`, `greater_than`	`<` or `>` (not generally used in the parser, but recognized by the lexer)
`eof`	End of input marker

6. Grammar and AST

The overall grammar hierarchy is as follows (simplified BNF notation):


Expression           := OrExpression

OrExpression         := AndExpression ("or" AndExpression)*
AndExpression        := UnaryExpression ("and" UnaryExpression)*
UnaryExpression      := ("not")? PrimaryExpression

PrimaryExpression    := "(" Expression ")"
                      | OperandOrFunctionWithOperator

OperandOrFunctionWithOperator
                     := OperandOrFunction (ComparisonOperator OperandOrFunction)?

OperandOrFunction    := Operand ("(" ArgList? ")" )?
                        ("[" SubscriptIndex "]")*

Operand              := NumberLiteral
                      | StringLiteral
                      | BooleanLiteral
                      | BytesLiteral
                      | InSet
                      | FieldReference

InSet                := "{" (InSetElement (InSetElement)*)? "}"

InSetElement         := NumberLiteral (".." NumberLiteral)?
                      | StringLiteral
                      | ...

ComparisonOperator   := "eq" | "ne" | "lt" | "le" | "gt" | "ge" | "in" | "has"

ArgList              := Expression ("," Expression)*

SubscriptIndex       := "*" | Expression

The parser transforms the tokens into a strongly typed AST. The core node types are:

Node Type	Role
`LogicalNode`	Represents `expr1 AND expr2` or `expr1 OR expr2`.
`ComparisonNode`	Binary operator (`eq`, `lt`, etc.) with `left` and `right` sub-expressions.
`FunctionCallNode`	Function invocation (`starts_with(field, "xyz")`). Holds argument list and return type.
`FieldReferenceNode`	Refers to a field name, possibly dotted (`file.path`). Has an inferred `fieldType`.
`LiteralNode`	Represents a literal (`number`, `string`, `boolean`, or `bytes`).
`InSetNode`	Used for the `in` operator to hold multiple possible values or expansions from numeric ranges.
`SubscriptNode`	Array or map indexing (`arr[idx]`, `obj[key]`, or `arr[*]` for expansion).

Note: Most node interfaces share a negated field. This indicates that the node is prefixed with a not (if applicable). In practice, the parser often sets or toggles this during unary expression parsing.

7. Configuration and Known Fields

The parser references a GelConfig object containing:

fields: A map of field name strings to FieldType, e.g.:

{
  "file.name": "string",
  "process.pid": "number",
  "network.ports": "array"
}

signatures: Known function signatures. For example:

{
  "starts_with": {
    name: "starts_with",
    parameters: [
      { paramName: "haystack", allowedTypes: ["string"] },
      { paramName: "needle", allowedTypes: ["string"] }
    ],
    returnType: "boolean"
  }
}

8. Usage Examples

8.1 Simple Expression


    not user.is_admin eq true and file.extension in {"exe" "dll"}

After tokenization, a set of Token objects is produced, e.g. (identifier=not), (identifier=user), (dot=.), (identifier=is_admin), (identifier=eq), (identifier=true), (identifier=and), .... The parser then builds an AST:

LogicalNode {
  operator: "and",
  left: ComparisonNode {
    operator: "eq",
    left: FieldReferenceNode("user.is_admin"),
    right: LiteralNode(boolean=true),
    negated: true
  },
  right: ComparisonNode {
    operator: "in",
    left: FieldReferenceNode("file.extension"),
    right: InSetNode { values: ["exe", "dll"] },
    negated: false
  },
  negated: false
}

8.2 Function Call


    starts_with(file.name, "test") or ends_with(file.name, ".txt")

If starts_with and ends_with are known functions returning booleans, the parser will produce a FunctionCallNode for each call and wrap them in a LogicalNode with operator="or".

9. Error Handling

ParseError is thrown for invalid syntax, type mismatches, or unknown functions. It captures line and column for user-friendly error messages:

throw new ParseError(
  "Unknown function 'some_bad_function'.",
  token.line,
  token.column
);

When evaluating the AST, an UndefinedFieldError may be thrown if a required field is missing from the data at runtime.

10. References

Reference	Description	Link
evaluator.ts	Implementation of `GelEvaluator`.	View Code
engine.ts	Higher-level `GelEngine` showcasing parsing + evaluation.	View Code

End of Technical Manual
For more examples, see unit tests or integration tests in the main repository.