[RFC] Architecture of the Language Server #107

psteinroe · 2024-01-07T15:54:49Z

psteinroe
Jan 7, 2024
Maintainer

I did some research about language server architecture and want to share my findings and a proposal for this project.

First, I will present some resources I used to get a better understanding and links to other projects that can serve as inspiration, before describing the special requirements a Postgres language server has and finally presenting my proposal.

Please note that this is just the result of a bit of research. I am in no means an expert and am very happy to get feedback on this.

This blog post on Three Architectures for a responsive IDE is a great starting point. It describes different architectures with their benefits and drawbacks. I can also recommend going through its references:

Another great resource is this blog post that explores a new idea for a languages server data model for the zig language server.

One important learning is that even though every language server is basically doing the same thing, the design of the language determines the architecture of the underlying data model. For example, if the language supports header files and has a declaration before use rule, the language server can compile the headers once and cache them. When the user types within a file, the compiler restarts from the point just after the header section of the very same file. All other files and headers are read from cache, and the compilation unit is reasonably small enough.

// start of header. iostream is parsed and analyzed once.
#include <iostream>
// end of header
// if the user types below, we just have to re-analyse these three lines. Everything else, including types and definitions from other files, can be read from cache.
void main() {
    std::cout << "Hello, World!" << std::
}

This does not work for every language though. For example, in Rust, the compilation unit is a whole crate, not a single file. And there is no declaration before use rule.

But we do not want to build a language server for C++ or Rust, but Postgres. And there is one very significant difference. Usually, types are defined within the codebase, and resolved from there. In the example below, the language server first has to compilefoo.ts and up a data base with Foo to resolve the type of bar and check whether it is correct.

// foo.ts
export type Foo = {
    bar: string
}

// bar.ts
import { Foo } from ‘./foo’

const bar: Foo = { bar: “I am a string” };

So the data model of the languages server is built by parsing and analysing the entire codebase. For Postgres (or any SQL dialect really), this is very different for the most part: while types can be defined within the source code e.g. within a migration file by using declarative schema management, the database itself is the single source of truth, and there is no relation between files. The variety of schema and migration management tools employed today means we cannot make a lot of assumptions about how the source code is structured and whether it reflects the current state.

There are three exceptions to this. First, types that are declared or altered in the same sql source before the current statement. Take for example

alter table contact add column email text;

select email from contact;

Until this migration is executed, the schema cache does not include the column email and our language server will report an error that the column does not exist on contact. What we could do to fix this is to alter the schema cache with every statement above the one currently analysed. From my experience, this is not a super important case, since ddl and dml are rarely written in the same file. And a developer could always execute the ddl statements once to update the schema cache before writing any dml statements.

The second execution that breaks the “no relationship between files” assumption is \include, which allows executing another sql file as it would have been part of the current one. Again, this is only relevant for ddl statements, and can be supported by simply treating the statements from the imported file as part of the current file.

The third exception is plpsql functions, e.g.

CREATE OR REPLACE FUNCTION public.f1()
RETURNS void
LANGUAGE plpgsql
AS $function$
DECLARE r record;
BEGIN
  FOR r IN SELECT * FROM t1
  LOOP
    RAISE NOTICE '%', r.c; -- there is bug - table t1 missing "c" column
  END LOOP;
END;
$function$;

where local variable declarations are possible. At first, we can rely on https://github.com/okbob/plpgsql_check/tree/master for the static analysis. Eventually, we might want to write our own plpgsql linter. But either way: all types are declared local to the create function statement, or can be queried from the database schema.

Based on the above, we can make the following assumptions:

The smallest unit of compilation is a single statement
The database is the single source of truth for any type information
A parsed statement is only used to analyse the very same statement

Given the assumptions, I propose the following architecture:

We use a schema cache similar to PostgREST to store all relevant schema information such as tables, columns, functions and the like in-memory.
We process every statement on its own. This means we can use map-reduce to parse and analyse the source in parallel. First per-file, and once we cut a source into its statements, also per-statement.
We can identify every statement with a file id and a statement index.
Any user input just triggers re-processing of the statement(s) that were changed.
There will be just a single level of abstraction above the AST node of a statement designed exclusively for analysing. For example, to resolve the columns used in a select statement and e.g. report a nonexistent one.
based on feedback on our recent HN post, I would also like to add tree-sitter-sql as a fallback parser for invalid statements.
In later iterations, we can apply any ddl statement located before the currently analysed one in the source to the schema cache.

That’s it. Its still very high level, but I will start will implementing some proof of concepts now and would love some feedback!

Best
Philipp

Donnerstagnacht · 2025-05-22T08:07:26Z

Donnerstagnacht
May 22, 2025

Since this language server project is somehow connected to supabase and supabase encourages to use declarative schemas, I wanted to ask if it would be (besides your mentioned limitations, theoretically) possible to define a directory or schema (for example supabase/schemas) that can (somehow) extend/overwrite the ground truth provided by the database.

In that case, database types/symbols managed by the developer declaratively could be mapped to the code/files. The latter would have the benefit that a saved file change could be immediately available to the language server, so that developers are not required to run a migration after every (small) change and that required changes in dependent sql structures would be immediately visible.

It also could enable many unique features that are not supported by any existing sql tooling tool and therefore set this project apart from others in the long term - like goto definitions that actually jump to the source code instead of generating a definition from the database.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Architecture of the Language Server #107

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

[RFC] Architecture of the Language Server #107

Uh oh!

Uh oh!

psteinroe Jan 7, 2024 Maintainer

Replies: 1 comment

Uh oh!

Uh oh!

Donnerstagnacht May 22, 2025

psteinroe
Jan 7, 2024
Maintainer

Donnerstagnacht
May 22, 2025