Skip to content

Mechanics of interpreter generation from DSL #479

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gvanrossum opened this issue Oct 12, 2022 · 8 comments
Closed

Mechanics of interpreter generation from DSL #479

gvanrossum opened this issue Oct 12, 2022 · 8 comments
Labels
epic-generator DSL-based code generator for the interpreter(s)

Comments

@gvanrossum
Copy link
Collaborator

Assuming we have a DSL roughly like this (modified per #477), we have a few ways to integrate this.

  1. We could place the generator input in a separate file, and the generator script (automatically run by make when the DSL file is updated) writes the interpreter cases to a separate file (I'll call it cases.h for now, we can bikeshed the actual name later). In ceval.c we replace all the cases (roughly 4000 lines) with a single #include "cases.h". (Presumably we'd check cases.h into git, just so people can bootstrap more easily.)
  2. Place the generator input in a separate file, but have the generator update ceval.c in place, sort of like . A downside is that people will undoubtedly confused and try to edit the generator output. (We could add comments warning them off, but we'd probably have to repeat those comments for each case.)
  3. Place the generator input in ceval.c, exactly where the cases are currently, but surrounded by #if etc. so that the compiler uses the #include "cases.h" but tooling (e.g. VS Code) sees the generator input and believes this is the actual code. (This may actually be a little bit complicated, because at least VS Code understands #if and expands them according to the configured variables. We'd have to fool it, otherwise it'll think that the generator input is all dead code and dim its appearance.) The generator can find its input by reading ceval.c, looking for the appropriate marker #if and #else. It still writes its output to cases.h.

(3) feels like a bit more complicated to implement, but is probably the most user-friendly for people trying to read or edit the generator input, since they see it in a context where their editor can show them both the C code for an instruction and the surrounding C code (static functions they call, and the non-generated parts of the function containing the switch). It also preserves the git history of the C code included in the generator input.

I'm not keen on (2), because of the mechanics of rewriting ceval.c in place. But it does preserve git history, like (3).

The simplest to build is probably (1). It does not preserve git history, however (the code of all instructions would be owned by the initial creator of the input file).

I think I've convinced myself to prototype (3) and see how this works out. Now on to designing the architecture of the generator. :-)

@gvanrossum
Copy link
Collaborator Author

gvanrossum commented Oct 13, 2022

I've got a very early prototype of a hybrid of (1) and (3) in my branch generate-ceval-switch. The generator script is Tools/scripts/generate_cases.py. It reads Python/bytecodes.inst and writes Python/cases.h.

It's a hybrid because the code generator reads bytecodes.inst and writes cases.h, which is then #included by ceval.c -- that's (1), but it retains the original cases in ceval.c, and there's a separate script (Tools/scripts/extract_cases.py) that regenerates bytecodes.inst from the original cases in ceval.c -- the two scripts together are almost (3). It round-trips, and the contents of cases.h is identical to the full list of cases (really TARGETs) in ceval.c.

Most things here are still fake -- the code generator has a regex that basically looks for 'inst' and then copies a block of code without looking at it, and it ignores the stack effect. The extractor tries to guess the numeric stack effect by calling dis.stack_effect(). But it turns out that for super-instructions and specialized instructions the dis module doesn't know the stack effect. And there are additional complications relating to variable stack effects. Moreover, even for those opcodes whose stack effect is fixed and known, we don't actually know how many variables it consumes from the stack and how many it produces -- e.g. BINARY_OP consumes two values and produces one, but all we know is that the net stack effect is -1, which we represent as (__0 -- ), whereas it should really be something like (left, right -- value).

Next steps? I have many ideas.

  • Write an actual parser for use by the code generator.
  • Generate family declarations from dis._specializations.
  • Derive stack effect for specialized and super-instructions from their components, in the extractor.
  • Derive linear variable stack effects in the extractor.
  • Many cases end withDISPATCH(); -- we could remove this in the extractor (leaving behind a flag, maybe) and put it back in the generator.
  • Auto-insert PREDICTED() macros.
  • If a case ends with PUSH(value), and its stack effect is 1, mark that in the extractor and use it in the generator.
  • Ditto with simple POP() cases.
  • Work through the examples from this.

@carljm
Copy link

carljm commented Oct 14, 2022

Are there any design decisions here that would preclude adding the ability to have multiple separate input files to the code generator? (This would be useful for an extension like Static Python that wants to generate an extended interpreter loop.) I read over the design doc and didn't see anything that looked like it would be a problem.

@gvanrossum
Copy link
Collaborator Author

Yeah, that should be simple enough.

@markshannon
Copy link
Member

I would prefer 1 (separate file). Having the input in a separate file makes it simpler to have two forms of the macros, the "fake" ones in at the top of the input to help IDEs and such, and the real ones in ceval.c. Unless, we don't want any macros in ceval.c and have the tooling generate the expanded form directly.

Having separate inputs files, also makes the tooling more composable.
e.g.
interpreter_gen --template ceval_template.c instructions.c > ceval.c
interpreter_gen --template trace_template.c instructions.c trace_super_instructions.c > trace_eval.c
For Cinder:
interpreter_gen --template cinder_template.c instructions.c cinder_extras.c > ceval.c

@markshannon
Copy link
Member

I'd suggest delaying doing any stack analysis, until after we have a generator merged.
For now the generator only need to understand the form without stack comment.

The definition file in generate-ceval-switch has both stack comments and POPs, which would mean popping the values twice.
The stack comment should tell the code generator which POPs and PUSHes to generate.

@gvanrossum
Copy link
Collaborator Author

I would prefer 1 (separate file). [...]

Sure, I can do that. Although I don't think we should generate all of ceval.c (too much duplication) -- I like the current process that generates cases.h which is #included by ceval.c. The cases can then be removed from ceval.c, but the rest will stay. No need for a template.

I'd suggest delaying doing any stack analysis, until after we have a generator merged. [...]

Agreed. In the extractor (a temporary hack) I just generated the stack comments (to the extent that I could derive them) so I wouldn't have to do that later. The generator ignores them. I have a vague transition plan where the generator ignores stack effects if they start with __. I'll leave this alone for now.

@gvanrossum
Copy link
Collaborator Author

Once python/cpython#98830 is merged we should close this.

@gvanrossum
Copy link
Collaborator Author

This has served its purpose. We've settled on the following workflow:

  • The instruction definitions exist only in Python/bytecodes.c.
  • Tools/cases_generator/generate_cases.py reads Python/bytecodes.c and writes Python/generated_cases.c.h.
  • Python/ceval.c #includes Python/generated_cases.c.h.
  • For most people, make regen-cases (or make regen-all) runs the cases generator script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic-generator DSL-based code generator for the interpreter(s)
Projects
None yet
Development

No branches or pull requests

3 participants