StarLasu: Overview

StarLasu is a methodology to design and implement Language Engineering applications. The methodology is supported by a set of libraries that permit its application on different platforms.

The libraries are:

Examples of applications that could use StarLasu:

StarLasu libraries permit to work with Abstract Syntax Trees (ASTs). These are tree-like data structures to represent the information contained in a piece of formal language or “code”. For example, the statements in a procedural program, the data elements in a SQL query, or the steps in a business workflow.

We can use the StarLasu libraries to build different modules producing ASTs, enriching ASTs, transforming ASTs, and finally consuming ASTs. These modules can be combined into different pipelines for different goals. E.g., a parser will transform a lower-level parse tree into a higher-level AST, then perhaps resolve names and references, and type-check the code. A transpiler will also perhaps transform the source code into some intermediate representation, then into the target language AST, from which it will generate the resulting code string.

The name StarLasu is derived from *lasu, to indicate all the libraries part of the family. The suffix lasu stands for Language Support. The original library has been Kolasu, which stands for Kotlin Language Support. Successive libraries have taken derived names.

Support on the different platforms

The different libraries have a different level of maturity. While eventually all libraries should be aligned, as of this moment Kolasu has all the features, Tylasu has most of them, while Pylasu has support for the main features.

Goal of this document

The goal of this document is to list the features provided by StarLasu and to describe the strategy suggested for implementing certain features. It should also help ensuring that the different libraries for the different languages provide the same features with the same logic.

This is not a substitution for library-specific tutorials. Please refer to each individual library for additional resources, API documentation, etc.

Typical usages of StarLasu

These are the most typical applications of StarLasu:

StarLasu can be used to develop other Language Engineering applications (for example, inside editors, to provide Language Intelligence). One of the goals of StarLasu is indeed to provide modularity, in according to the principles of Model-Driven Development. Because of this, the different modules built with StarLasu can be combined in original ways, for many different purposes.

Features of StarLasu

At its core StarLasu permits to define ASTs.

The nodes of each AST can specify positions. Related to this topic, nodes can specify their origin and destination (i.e., the object they’ve been generated from, and the object that’s been generated from them). This is discussed under Origin and Destination.

StarLasu provides APIs for traversing ASTs and for transforming ASTs.

On top of the above core features, StarLasu builds several advanced concepts, detailed in the following sections.

Symbol Resolution

Symbol Resolution is the process of giving a meaning to the each name in the source code. For example: * to connect a given use of a variable to its declaration; * to reconstruct the definition of a SQL column from an alias in a subquery; * to identify a user-defined therapy plan in a medical support DSL.

Symbol resolution relies on the naming facilities in StarLasu.

Type Systems

We’re also planning to add support for the definition of type systems.

Code Generation

Initially, with StarLasu we focused on creating parsers. However, ASTs are also useful to build code generators. For this reason, work is going on to support code generation in StarLasu. When we have both a parser and a code generator for the same language, we can define a language module.

Other Features

Other APIs include:

Building a parser

Building parsers is the most traditional use case. Kolasu was initially created with this goal in mind and tens of commercial parsers have been implemented with it.

Our preferred approach is to initially use an ANTLR parser and then wrap it with StarLasu ASTs.

Process

Initially, we define an ANTLR grammar. Then, we define AST classes are using StarLasu. An initial version may also be generated from the ANTLR grammar, by using StarLasu-Tools.

We then organize the parser into a pipeline: 1. First-stage parsing using ANTLR. We obtain a parse tree and, possibly, a series of errors. 2. Second-stage parsing. The parse tree is mapped into the AST 3. Potentially, for some parsers, we perform additional steps, such as: 1. Symbol resolution 2. Type checking 3. Advanced calculations such as lineage, data flow analysis, linting, etc.

The StarLasu ASTs provide a more convenient API with respect to the ANTLR APIs for parse trees. In fact, StarLasu comes with additional features such as serialization and support for symbol resolution. Finally, when mapping the parse tree into an AST, we drop implementation details of the grammar from the tree, and we organize the information in the most convenient way for users.

Building a transpiler

The general approach to design transpilers has been described in the article How to write a transpiler.

In general these are the steps:

For example, suppose we want to translate RPG into Java:

Building a code generator

A code generator is a component that given an AST permits to generate code (i.e., text).

It can be a component inside a transpiler, or it can be used by itself to output source code from an AST that we’ve constructed programmatically (that is, explicitly creating and combining nodes, rather than starting from the output of a parser).

This use case is currently being improved in StarLasu. At the moment, we’re experimenting in internal projects such as StarLasu-Tools (to generate Kotlin and Python). In the future more support is planned to be added in StarLasu itself.

For more information about code generation options and techniques, please refer to our Guide to Code Generation. # AST Definition

An Abstract Syntax Tree (AST) defined with StarLasu, has these characteristics:

All StarLasu implementations target object-oriented languages. So, the preferred way of defining a node type is by inheritance, i.e. subclassing the Node class. [Kolasu] [Pylasu] [Tylasu]

Given we support references, the AST is actually a graph. However, references are usually rarer than containment relations, and the parent-child relationship between nodes is the most common path for traversal, so the tree aspect is dominant.

Suppose we have a Node representing a Java method declaration. It may have the following properties:

Attribute Values

Attribute values can have one of these types:

Derived Properties

Some properties could be calculated, i.e., derived from other properties. They are typically not serialized, e.g. to JSON, with the rationale that we can recompute them.

These properties cannot represent containment, only references. To mark a property as derived, we mark it with @Derived. [Kolasu]

Currently, neither Tylasu nor Pylasu supports this.

References by Name

In textual languages we have typically references resolved by name (see Naming). For this reason in StarLasu we have special support for the type ReferenceByName. This type has a generic parameter type, which should extend PossiblyNamed. This type indicates the type of thing that can be referred to.

Suppose to have a language with the Node Type ReferenceExpression. The Node Type ReferenceExpression could have an attribute with type ReferenceByName<VariableDeclaration>, as it permits to refer to Variable Declarations. We would expect VariableDeclaration to extend PossiblyNamed (or Named).

Note that we can accept to refer to Node Type which could potentially be anonymous, however we would typically only be able to refer to instances which actually have a name. Imagine a language that permits to define Classes and anonymous Classes. The Node Type ClassDeclaration would be PossiblyNamed, however when using a class in the typical class instantiation we would not be able to refer to anonymous classes but only to classes with a name.

It has two values:

The referred value is typically null initially and it is later set to a proper value during symbol resolution.

Error nodes

Special AST nodes can be used to represent errors.

All nodes representing errors should implement the same interface called ErrorNode. [Kolasu] [Pylasu] [Tylasu]

Notes on Kolasu implementation

When a reference has no associated string, the property is marked as Link.

For example:

class MethodDeclaration {
  @Link val overriddenMethod: MethodDeclaration?
}

NodeType

The interface NodeType can be used to specify that a certain Class represent a Node. This is intended to be used on abstract classes that do not directly extend Node or interfaces. All concrete classes deriving from them should also extend Node, indirectly.

For example, if we define:

@NodeType
interface Declaration

and later we have a Node using it:

class CompilationUnit {
  val elements: List<Declaration> // we know this is a containment, because elements are expected to be Nodes
}

We could then have different classes implementing Declaration:

sealed class Statement: Node

class VarDeclaration : Statement, Declaration // VarDeclaration indirectly extend Node

Position

To indicate positions in textual files we use the class Position.

A Position has a start and an end Point. Each position is relative to a certain Source, which is optional.

A Point has a Line and a Column (both Int). The first line of a file is Line 1. The first Column of each line is 0. The starting point of file has therefore Line=1 and Column=0.

Important Source sub-classes are:

See in Kolasu

Origin and Destination

A Node can have an Origin. An Origin indicates where the nodes come from.

Let’s see some examples: - In the case of Nodes built by a parser, the Origin will indicate a point in some source file. - In the case of Nodes built by a transformer, the Origin may indicate some Node which provided the original information.

Consider this example: we are building a transpiler from RPG to Java.

We first build an RPG parser that given RPG code product an RPG AST. The root of the AST may have type RPGCompilationUnit. The origin of an RPGCompilationUnit will represent the RPG file name and the position in that file. In the case of the root a position ranging from the first character of the first line, to the last character of the last line.

Then we build a transformer that given an RPG AST produces a Java AST. The root of such AST may have type JavaCompilationUnit. The origin of a JavaCompilationUnit will refer to another node, in this case having type RPGCompilationUnit.

In a Node we may also indicate what can be created from a Node. For example, if we build programmatically an AST and then we want to generate code from it, we may want to specify for each node what portion of code has been generated. For example, a node representing an If Statement in Java could end up being represented by a 5 lines of Java code inside a large Java file. In that case, the Destination will specify the name of the file and the lines and columns representing our If Statement.

Traversing the AST

To process the information contained in an AST, we’ll have to traverse it. That is, process each node in a given sequence.

Manual Traversal

We can traverse a tree “manually” i.e. programmatically following all relevant references in the nodes. E.g. start from the root node Program, then visit the contents of a child collection statements, and so on.

However, this approach produces a traversal algorithm that depends on the precise structure of the AST. Therefore, the algorithm is not reusable, and this approach doesn’t scale over a growing number of node instances and node types. It’s also brittle in evolving code bases where the structure of the AST changes over time.

Traversal Functions

So, StarLasu provides a number of generic traversal functions that we can apply and adapt to our ASTs for our use cases. In languages where it’s possible to do so, StarLasu exposes these functions as “extension methods” on the Node class. These functions do not return a collection of nodes, rather they are “generators” or “iterators” and similar concepts depending on the implementation language. These are facilities that return or “yield” one node after the other, in the intended sequence.

The basic traversal method is walk, that visits all the nodes in the tree in a depth-first sequence.

See an example in: - Kolasu - Pylasu - Tylasu

Most other methods are derived from walk and may change the order of traversal.

An exception is methods that travel upwards from a node to its ancestors, rather than downwards to its children. This is the case of the walkAncestors method.

Listeners and Visitors

In StarLasu we do not encourage the usage of visitors or listeners, that are commonly used with ANTLR for example. These have to be generated and produce interfaces with a number of methods depending on the number of node types, which can grow high. Moreover, listeners/visitors make it difficult to organize code in a structured manner.

Finding Nodes of a Given Type

Given an AST there is one essential operation we may want to do: find nodes of a given type. We can search for nodes of a given type in two directions: looking among descendants of the current node and look among ancestors.

For example, given a field declaration we may want to know in which class is declared. In that case we will look for the closest ancestor of type ClassDeclaration.

On the other hand, we may identify all the return statement inside a MethodDeclaration. In that case we will look for them among the descendants of the node.

See in: - Kolasu - Tylasu

Transforming ASTs

In general we may want to process ASTs and use their information to produce something else. For example, to generate a diagram or to generate code.

A particular case is the transformation of an AST into another AST. This is typically done within transpilers.

Example of refactoring within the same language

One usage of transformations is to perform refactoring within the same AST. For example, let’s suppose that we have this language:

enum class Operator {
    PLUS, MULT
}
sealed class Expression : Node()
data class IntLiteral(val value: Int) : Expression()
data class GenericBinaryExpression(val operator: Operator, val left: Expression, val right: Expression) : Node()
data class Mult(val left: Expression, val right: Expression) : Node()
data class Sum(val left: Expression, val right: Expression) : Node()

We then decide to transform an AST by removing instances of GenericBinaryExpression and replace them with Mult or Sum. We can do that in this way:

val myTransformer = ASTTransformer(allowGenericNode = false).apply {
    registerNodeFactory(GenericBinaryExpression::class) { source: GenericBinaryExpression ->
        when (source.operator) {
            Operator.MULT -> Mult(transform(source.left) as Expression, transform(source.right) as Expression)
            Operator.PLUS -> Sum(transform(source.left) as Expression, transform(source.right) as Expression)
        }
    }
    // This may benefit of specific support: for example a NodeFactory that returns the same element
    registerNodeFactory(IntLiteral::class) { source: IntLiteral -> source }
}
assertASTsAreEqual(
    Mult(IntLiteral(7), IntLiteral(8)),
    myTransformer.transform(GenericBinaryExpression(Operator.MULT, IntLiteral(7), IntLiteral(8)))!!
)
assertASTsAreEqual(
    Sum(IntLiteral(7), IntLiteral(8)),
    myTransformer.transform(GenericBinaryExpression(Operator.PLUS, IntLiteral(7), IntLiteral(8)))!!
)

Example of translation to another language

Let’s consider two languages. In this example they have exactly the same structure:

sealed class ALangExpression : Node()
data class ALangIntLiteral(val value: Int) : ALangExpression()
data class ALangSum(val left: ALangExpression, val right: ALangExpression) : ALangExpression()
data class ALangMult(val left: ALangExpression, val right: ALangExpression) : ALangExpression()

sealed class BLangExpression : Node()
data class BLangIntLiteral(val value: Int) : BLangExpression()
data class BLangSum(val left: BLangExpression, val right: BLangExpression) : BLangExpression()
data class BLangMult(val left: BLangExpression, val right: BLangExpression) : BLangExpression()

While this is a toy example it is true that many languages shares similar structures. Think of literals, mathematical operations, or basic control flow structures such as if-statements: they have the same structures in languages which are very different.

We could build a transformer that given an AST of ALang produces the corresponding AST of BLang:

val myTransformer = ASTTransformer(allowGenericNode = false).apply {
    registerNodeFactory(ALangIntLiteral::class) { source: ALangIntLiteral -> BLangIntLiteral(source.value) }
    registerNodeFactory(ALangSum::class) { source: ALangSum ->
        BLangSum(transform(source.left) as BLangExpression, transform(source.right) as BLangExpression)
    }
    registerNodeFactory(ALangMult::class) { source: ALangMult ->
        BLangMult(transform(source.left) as BLangExpression, transform(source.right) as BLangExpression)
    }
}
assertASTsAreEqual(
    BLangMult(
        BLangSum(
            BLangIntLiteral(1),
            BLangMult(BLangIntLiteral(2), BLangIntLiteral(3))
        ),
        BLangIntLiteral(4)
    ),
    myTransformer.transform(
        ALangMult(
            ALangSum(
                ALangIntLiteral(1),
                ALangMult(ALangIntLiteral(2), ALangIntLiteral(3))
            ),
            ALangIntLiteral(4)
        )
    )!!
)

Debug Print Format

This functionality permits to print as a String an AST, in a format easy to inspect for a human.

For example, given this AST: Add(Add(Number(3), Number(9)), Number(1))

We could produce this string:

Add {
  left = [
    Add {
      left = [
        Number {
          value = 3
        } // Number
      ]
      right = [
        Number {
          value = 9
        } // Number
      ]
    } // Add
  ]
  right = [
    Number {
      value = 1
    } // Number
  ]
} // Add

We can use a DebugPrintConfiguration instance to control which elements to filter, if to print empty lists, etc.

This feature is present in Kolasu but it will not be necessarily replicated in all libraries.

See in Kolasu

This is not yet supported in Pylasu.

Serialization

In addition to what is described here, there is also EMF serialization, which is discussed separately. See EMF.

StarLasu supports exporting ASTs to JSON and XML.

See in Kolasu.

Naming

Two interfaces are defined:

For example, a FunctionDeclaration will be PossiblyNamed in languages which permits anonymous functions.

In the StarLasu libraries we provide special support for things which are Named or PossiblyNamed. We can used these interfaces in Symbol Resolution, for example.

Symbol Resolution

The objective of symbol resolution consists in linking name-based textual references to the corresponding node entity in the Abstract Syntax Tree (AST). StarLasu provides support for implementing such process with the following building blocks:

Representing references among nodes

References between nodes are implemented using ReferenceByName instances in StarLasu. These keep track of the relationship between a name and the referred node, which might be absent until the symbol resolution phase and must be a sub-type of PossiblyNamed.

In Kolasu, for example, we can define a node Person containing a reference friend towards another Person as follows:

data class Person(
    override val name: String,
    val friend: ReferenceByName<Person> // <-- reference to another `Person`
) : PossiblyNamed

Instances can then be created providing the name of the referred Person instance. As regards the actual referenced object, it might be provided as initialReferred value if known or left unresolved until symbol resolution.

// reference by name using `name` only
val first: Person = Person(friend = ReferenceByName("second"))
// reference by name using `initialReferred` value and `name`
val second: Person = Person(friend = ReferenceByName("first", first))

In general, references can be resolved using one or more candidates, as follows

second.tryToResolve(first) // <-- `first` is the only candidate
second.tryToResolve(listOf(first, second, others)) // <-- list of candidates

While it is possible to manually implement symbol resolution by traversing the AST and updating the referred value for each ReferenceByName property, StarLasu provides support for the declarative specification of symbol resolution rules, as shown in the next section.

Declarative symbol resolution

As mentioned in the previous section, it is surely possible to manually implement symbol resolution as some kind of tree-traversal algorithm. However, StarLasu provides support to ease such task and allows the developer to focus on language-specific concerns by providing rules for each reference in a given AST.

Symbol resolution rule specifications consist of three parts:

Each rule produces a Scope instance that is used to resolve a given property or all properties of a given type. Given a property, StarLasu adopts a precise rule resolution schema. Considering Person::friend, for example, the following steps will be performed:

  1. lookup for a property-based rule having Person::friend as guard and Person as context;
  2. lookup for a property-based rule having Person::friend as guard and any ancestor of the Person node as context;
  3. lookup for a type-based rule having Person as guard and Person as context;
  4. lookup for a type-based rule having Person as guard and any ancestor of the Person node as context;

As soon as one rule is found, the symbol resolver will use it to resolve the reference.

In our example, we could define that friend reference candidates should correspond to aggregating all Person instances of the AST as follows:

val symbolResolver = symbolResolver {
    // property-based rule for Person::friend property
    scopeFor(Person::friend) { context: CompilationUnit ->
        Scope().apply {
            context.walk().filterIsInstance<Person>().forEach { define(it) }
        }
    }
    // type-based rule for references to Person instances
    scopeFor(Person::class) { context: CompilationUnit -> 
        Scope().apply {
            context.walk().filterIsInstance<Person>().forEach { define(it) }
        }
    }
}

In the example, CompilationUnit represent a container node in the AST. The type-based rule will never be executed as the property-based rule will be found before following the above mentioned resolution schema. Analogously, a property-based rule with Person as context would be executed in place of the first rule in our example. # Typesystem

Support for typesystem has not yet been implemented as part of StarLasu. It has been implemented in several parsers built using StarLasu but it has yet to be ported to StarLasu itself.

Code generation

Support for code generation is currently being worked on. It is planned to be included in Kolasu 1.6 and successively in Tylasu and Pylasu.

Language Module

A LanguageModule is an object that can both parse code into an AST and generate code from an AST. Basically it is the combination of a parser and a code generator operating on the same node types.

See in Kolasu

AST Common Elements

Most programming languages share some concepts. We identified these common concepts and defined marker types for them. In this way, we can treat these elements similary in all languages.

They are:

See in Kolasu

This is not yet supported in other StarLasu libraries.

Command Line Tools

StarLasu supports the definition of CLI tools for each parser. These CLI tools are intended for the parser users.

These tools permit to:

In addition to that, a series of tools for supporting parser developers. These are part of a separate (private) project named StarLasu tools.

See in Kolasu

This is not yet supported in Pylasu and Tylasu.

Parse Tree to AST mapping

Suppose we have a simple language and we want to parse this piece of code:

set foo = 123
display 456

We would like to parse this with ANTLR, and then translate the resulting parse-tree into an AST. StarLasu offers transformers to implement such mappings. For example, with Kolasu we may write:

val transformer = ParseTreeToASTTransformer()
transformer.registerNodeFactory(SimpleLangParser.CompilationUnitContext::class, CU::class)
    .withChild(SimpleLangParser.CompilationUnitContext::statement, CU::statements)
transformer.registerNodeFactory(SimpleLangParser.DisplayStmtContext::class) { ctx ->
    DisplayIntStatement(value = ctx.expression().INT_LIT().text.toInt())
}
transformer.registerNodeFactory(SimpleLangParser.SetStmtContext::class) { ctx ->
    SetStatement(variable = ctx.ID().text, value = ctx.expression().INT_LIT().text.toInt())
}

val transformedCU = transformer.transform(pt)!!
// The result would be equivalent to the following:
val cu = CU(
    statements = listOf(
        SetStatement(variable = "foo", value = 123).withParseTreeNode(pt.statement(0)),
        DisplayIntStatement(value = 456).withParseTreeNode(pt.statement(1))
    )
).withParseTreeNode(pt)

Read more about this topic for: - Kolasu (source code) - Pylasu - Tylasu

Validation

We have a concept of Issue. Each issue has a message, a position, a IssueSeverity, and an IssueType.

The IssueType can be lexical, syntactic, semantic, or translation. The IssueSeverity can be info, warning, or error.

Results are objects that provide a value and a collection of issues.

See in Kolasu.

Testing

StarLasu offers support for comparing parse trees and ASTs.

See assertParseTreeStr, assertParsingResultsAreEqual, and assertASTsAreEqual in Kolasu.

Coverage of the grammar

Related to this, there is experimental support for verifying the Coverage of a grammar by the examples we have. See CoverageListener in Kolasu.

The goal is that, given a grammar and a set of examples, we want to understand:

Another solution is to verify the coverage of the generated ANTLR Parser.

Performance testing

To be written.

Test the parser on examples

In practice it is often convenient to run the parser on a larget set of examples and just check if the parser can handle them without finding errors.

EMF

The Eclipse Modeling Framework (EMF) has been very successful in the MDE world. It can be regarded as an exchange format supported by different tools.

For this reason we built support for exporting Metamodels and Models to EMF. We in particular support the serialization to EMF-JSON. EMF-JSON is not as well-defined and supported as XMI (based on XML), but the request for JSON-based formats on some of the platforms supported brought us to focus on it.

In our case, metamodels are the definition of the node types. For example, they specify which properties each node has.

Instead, we can represent and serialize ASTs as models (i.e., instances of metamodels). A model describes the value of each node’s properties, and the relationships among the nodes.

In some of the StarLasu libraries, we can also import EMF metamodels and models (generated with other StarLasu libraries) in order to consume the results of a tool (e.g. a parser) from a different language. This is what the Strumenta Playground web app does, for example. This is also useful for cross-platform parsers.

Read more about this topic in: - Kolasu (source code) - Pylasu (source code) note: support in Pylasu is a work in progress. - Tylasu (source code)

Strumenta Playground

Strumenta Playground is a tool under development to visualize ASTs and transpilation traces.

This is intended to help users of our parsers and transpilers to assess our work.

Cross-platform parsers

Different users may want to use our parsers from different platforms such as the browser, Python, the JVM, etc.

On the other hand we want to avoid having to rewrite the parsers for the same languages across multiple platforms. For this reason we have developed an approach to use parsers across platforms. In essence, we write a parser for any platform we want, using Kolasu, Pylasu, or Tylasu. We then access the AST produced by such parsers on another platform by serializing the AST produced on one platform and unserializing it on another one. To make this possible we built tools to:

In this way we could write a parser for RPG in Kotlin, using Kolasu. We would then automatically generate equivalent AST classes in Pylasu, and code to unserialize an AST instantiating those AST classes. In the end, we would obtain a parser usable from Python, which expose AST classes in Python. The implementation would call the parser written in Kotlin, obtain the AST serialized and unserialize it behind the scenes.