Transpiling TypeScript to C
The code for this project is available on GitHub.
Over the past few days, I've been working on a little program which transpiles
TypeScript to C. It's called type-c
. This is an interesting task, because:
- TypeScript and C are two very different languages,
- Using TypeScript allows us to skip the analysis stage and work straight with the AST
type-c
doesn't do much. It doesn't inject polyfills into your code, so your
TypeScript program must adapt to C's conventions and paradigm. That goes
without saying that it doesn't have support for JS/TS's standard library, like
console
, Array
, or any FP-related functions like forEach
or map
.
This design choice made the project's goal clearer: to be able to write C with the syntax of TypeScript. Obviously, there's no particular reason that anyone would want to do this, but it's still a fun experiment.
What it looks like
I mentioned the fact that type-c
doesn't expose a console
object, so how
does one log something? the answer is printf
.
printf("Hello, C!\n");
But that's not a correct type-c
program. You need to export a main
function, just like C programs. Oh, and you need to import printf
.
import { printf } from "stdio.h";
export function main(argc: number, argv: string[]): number {
printf("Hello, C!\n");
return 0;
}
This compiles to the following:
#include <stdio.h>
int main(int argc, char** argv) {
printf("Hello, C!\n");
return 0;
}
Imports
You likely noticed the import -- it imports printf
from stdio.h
. This is
the way to #include
header files from within TypeScript in type-c
. The item
that is being imported doesn't really matter, but it matters very much that
there is an import statement with the header file which needs to be included.
Of course, the same applies for booleans.
import { true, false } from "stdbool.h";
Types
You may also notice that we only really have one number type: number
, which
is just translated to int
in C. That is because JS/TS uses a single type
for all kinds of numbers (backed by a f64
).
I decided to map number
to an int
rather than a double
, because int
is
more commonly used, particularly in critical parts such as the main
function's return type.
Pointers
One thing that was fun to implement were pointers. TypeScript obviously won't
allow us to use &
and *
characters like we do in C, so I opted for a
solution that feels more native to TypeScript.
let foo: number = 8;
let fooPtr: Pointer<number> = ptr(foo);
foo = deref(fooPtr);
This transpiles to:
int foo = 8;
int *fooPtr = &foo;
foo = *fooPtr;
ptr
and deref
are currently the only builtin functions. They're obviously
not implemented in TypeScript -- type-c
syntactically replaces them to the
equivalent C syntax.
Arrays
Since pointers work, I also implemented arrays, sort of. I couldn't implement fixed-sized arrays because of TypeScript not allowing the syntax.
let primes: number[] = [2, 3, 5, 7, 11, 13];
transpiles into the following:
int primes[] = {2, 3, 5, 7, 11, 13};
You can even use sizeof
like so:
let primes: number[] = [2, 3, 5, 7, 11, 13];
let primes_len: number = sizeof(primes) / sizeof(primes[0]);
A more common practice in C when wanting to get the length of an array is to
divide by the size of the data type which the array holds, like sizeof(int)
.
This is possible in type-c
:
sizeof(int); // => 4
Control Flow
while
loops were trivial to implement -- the syntax is for the most part
identical between TypeScript and C:
let i: number = 0;
while (i < 99) {
printf("i = %d\n", i);
i++;
}
transpiles to:
int i = 0;
while (i < 99) {
printf("i = %d\n", i);
i = i + 1;
}
I did have some trouble implementing for
loops, primarily because of the
semicolons.
Here is my for
loop struct:
pub struct ForStatement {
pub init: Box<Statement>,
pub test: Expression,
pub update: Box<Statement>,
pub body: Box<Statement>,
}
You can see that it primarily contains three fields (the body is not very relevant to this issue.)
To make sure we're on the same page, this is what each of the fields mean.
-
init
is the first part of the for loop where you can initialize or declare a variable. It's a statement. -
test
is the second part, after the first semicolon, which tests whether we should continue with the loop. It's an expression. -
update
is the last part after the second semicolon which updates the counter, usually the same variable declared byinit
. It's a statement.
In the following case:
for(int i = 0; i < 5; i++) {
...
}
int i = 0
isinit
i < 5
istest
i++
isupdate
The problem is that my transpiler would always write a semicolon after
statements. If I was to revisit the project I would change that, but as of
writing, this is how it's currently done. The problem is that the update
statement would also have a semicolon, which is invalid syntax in C, at least
for gcc
.
I wasn't sure of how to fix it so I just opted in for a hacky solution, which I'm really not proud of:
statement.update.trim_end_matches(";");
Now, it works:
for (let j: number = 0; j < 99; j++) {
printf("j = %d\n", j);
}
transpiles to
for (int i = 0; i < 99; i = i + 1) {
printf("i = %d\n", i);
}
Bubble Sort
type-c
can successfuly compile Bubble sort.
export function bubbleSort(arr: number[], n: number): void {
let i: number;
let j: number;
let temp: number;
for (i = 0; i < n - 1; i++) {
for (j = 0; j < n - i - 1; j++) {
if (arr[j] > arr[j + 1]) {
temp = arr[j];
arr[j] = arr[j + 1];
arr[j + 1] = temp;
}
}
}
}
This transpiles into the following:
void bubbleSort(int arr[], int n) {
int i;
int j;
int temp;
for (i = 0; i < n - 1; i = i + 1) {
for (j = 0; j < n - i - 1; j = j + 1) {
if (arr[j] > arr[j + 1]) {
temp = arr[j];
arr[j] = arr[j + 1];
arr[j + 1] = temp;
}
}
}
}
Implementation
My first implementation was in TypeScript itself, since I thought It would have
better support for the language. I used swc's parser's bindings for
TypeScript. I quickly realized that I was not enjoying the library or the
workflow, so I opted in for Rust, which swc
is written in.
I decided to create my own IR, which is basically a dumbed down AST which is derived from SWC's AST.
This is what my IR's root looks like:
pub struct Program {
pub imports: Vec<Import>,
pub methods: Vec<Method>,
}
This is what the Method
struct looks like:
pub struct Method {
pub name: String,
pub return_type: Type,
pub parameters: Vec<MethodParameter>,
pub body: Vec<Statement>,
}
And my Statement
struct.
pub enum Statement {
VariableDeclaration(VariableDeclaration),
If(IfStatement),
While(WhileStatement),
For(ForStatement),
Return(ReturnStatement),
Expression(ExpressionStatement),
Block(BlockStatement),
}
I used SWC's Visit API. I initially didn't fully understand how the library or the Visitor Pattern worked, but I eventually figured it out.
I implemented a Visitor
struct implementing its Visit
trait. The struct
contains some state:
pub struct Visitor {
pub program: Program,
pub current_function_name: Option<String>,
pub current_function_params: Option<Vec<MethodParameter>>,
pub current_function_return_type: Option<Type>,
pub current_function_body: Option<Vec<Statement>>,
}
It implements these functions:
fn visit_import_decl(&mut self, node: &swc_ecma_ast::ImportDecl) { ... }
fn visit_fn_decl(&mut self, node: &swc_ecma_ast::FnDecl) { ... }
fn visit_function(&mut self, node: &swc_ecma_ast::Function) { ... }
fn visit_param(&mut self, node: &swc_ecma_ast::Param) { ... }
fn visit_stmt(&mut self, node: &swc_ecma_ast::Stmt) { ... }
I won't go into detail as to what they do specifically, you can look into that yourself but short: they recursively visit the AST nodes and convert them into my IR. In the codebase I call this step "parsing", even though it's not really parsing, but it's close enough.
I came up with a trait called ToIR
pub trait ToIR<T> {
fn to_ir(&self) -> Result<T>;
}
Obviously I could've just used From
but I think this organizes the code
better. I might eventually refactor the codebase to use From
.
My code structure went through quite a few different phases, but it ultimately ended up looking like this.
├── expr
│ ├── array.rs
│ ├── call.rs
│ ├── mod.rs
│ └── ...
├── statement
│ ├── return_s.rs
│ ├── while_s.rs
│ ├── mod.rs
│ └── ...
├── types
│ ├── array.rs
│ ├── mod.rs
│ ├── primitive.rs
│ └── ...
└── ...
A "parser" (an implementation of ToIR
) looks like this:
def_parser!(CallExpr, Expression, |expr| {
let callee = *expr.callee.as_expr().unwrap().clone();
let name = match callee {
Expr::Ident(ident) => ident.sym.to_string(),
other => bail!("non-supported callee kind: {:?}", other),
};
let arguments: Vec<Expression> = expr
.args
.iter()
.map(|expr| expr.expr.to_ir())
.map(Result::unwrap)
.collect();
Ok(Expression::MethodCall(MethodCall {
name, arguments
}))
});
Right now, the error handling is nasty. I am not proud of the code quality. I will slowly start to refactor it now that the project is for the most part functional.
The important part is that I implemented a macro for implementing parsers, and it greatly helped remove boilerplate.
I did the same for code generation. I have almost the same structure for code
generation: a ToC
struct and a def_codegen!
macro.
def_codegen!(Import, |import| {
let mut buffer = CodeBuffer::default();
buffer.write("#include <");
buffer.write(import.module.as_str());
buffer.write(">");
Ok(buffer)
});
You will notice the CodeBuffer
struct. This is my poor attempt at making some
sort of StringBuilder equivalent for Rust. It's not fully needed but I
think it's a pretty clean way to write code generation code. This is roughly
what its API looks like.
pub struct CodeBuffer {
lines: Vec<String>,
}
impl CodeBuffer {
pub fn write<S: Into<String>>(&mut self, content: S) { ... }
pub fn write_line<S: Into<String>>(&mut self, content: S) { ... }
pub fn concat(&mut self, other: &CodeBuffer) { ... }
}
I found the concat
method useful for combining CodeBuffer
s. I look at them
as some sort of combinators, because I implement ToC
for tiny units of code,
and then concatenate them like this:
def_codegen!(Expression, |expr| {
let mut buffer = CodeBuffer::default();
match &expr {
...
Self::MemberAccess(access) => {
buffer.write(access.object.to_c()?);
buffer.write("[");
buffer.write(access.index.to_c()?);
buffer.write("]");
}
...
}
Ok(buffer);
});
Recap
To recap, the code goes through this process:
- Parsed by SWC, converting it to an AST
- AST is walked, converting it to the IR
- IR is walked, converting it into C
PS: The smallest part of the codebase is the codegen, so It'd be super easy to change the language it's transpiled to.