[RFC]: support for structured package data #1147

kgryte · 2023-11-29T21:16:22Z

Description

This RFC proposes adding structured package data to facilitate automation and scaffolding.

Overview

The need for structured package data has been discussed at various points during stdlib development. This need has become more paramount when seeking to automate specialized package generation for packages which wrap "base" packages for use with other data structures. The most prominent example being math/base/special/* APIs which are wrapped to generate a variety of higher-order packages, including

math/iter
math/strided
math/ generics supporting ndarrays, arrays, and scalars

and more recently in work exposing those APIs in spreadsheet contexts. In each context, one needs to

specify parameters, including types, names, and descriptions
add example values
generate random values for benchmarking, examples, and tests
specify aliases
specify related keywords

and in some contexts

create native implementations

While various attempts have been made to automate scaffolding of higher-order packages, where possible, each attempt has relied on manual entry of necessary scaffold data, including parameter names, descriptions, and example values. To date, we have not created a centralized database from which we pull desired package meta data.

Proposal

In this RFC, I propose adding structured meta data to "base" packages. This structured meta data can then be used in various automation contexts, most prominent of which is automated scaffolding.

The meta data would be stored as JSON in a subfield of the __stdlib__ configuration object of package.json files. The choice of JSON stems from the ability to use JSON Schema for validation and linting.

Examples

I've included two examples below.

math/base/ops/add:

{
    "$schema": "math/[email protected]",
    "base_alias": "add",
    "alias": "add",
    "pkg_desc": "add two double-precision floating-point numbers",
    "desc": "adds two double-precision floating-point numbers",
    "short_desc": "",
    "parameters": [
        {
            "name": "x",
            "desc": "first input value",
            "type": {
                "javascript": "number",
                "jsdoc": "number",
                "c": "double",
                "dtype": "float64"
            },
            "domain": [
                {
                    "min": "-infinity",
                    "max": "infinity"
                }
            ],
            "rand": {
                "prng": "random/base/uniform",
                "parameters": [
                    -10.0,
                    10.0
                ]
            },
            "example_values": [
                -1.2,
                2.0,
                -3.1,
                -4.7,
                5.5,
                6.7
            ]
        },
        {
            "name": "y",
            "desc": "second input value",
            "type": {
                "javascript": "number",
                "jsdoc": "number",
                "c": "double",
                "dtype": "float64"
            },
            "domain": [
                {
                    "min": "-infinity",
                    "max": "infinity"
                }
            ],
            "rand": {
                "prng": "random/base/uniform",
                "parameters": [
                    -10.0,
                    10.0
                ]
            },
            "example_values": [
                3.1,
                -4.2,
                5.0,
                -1.0,
                -2.0,
                6.2
            ]
        }
    ],
    "returns": {
        "desc": "sum",
        "type": {
            "javascript": "number",
            "jsdoc": "number",
            "c": "double",
            "dtype": "float64"
        }
    },
    "keywords": [
        "sum",
        "add",
        "addition",
        "total",
        "summation"
    ],
    "extra_keywords": []
}

stats/base/dists/arcsine/pdf:

{
    "$schema": "stats/base/[email protected]",
    "base_alias": "pdf",
    "alias": "pdf",
    "pkg_desc": "arcsine distribution probability description function (PDF)",
    "desc": "evaluates the probability density function (PDF) for an arcsine distribution with parameters `a` (minimum support) and `b` (maximum support)",
    "short_desc": "probability density function (PDF) for an arcsine distribution",
    "parameters": [
        {
            "name": "x",
            "desc": "input value",
            "type": {
                "javascript": "number",
                "jsdoc": "number",
                "c": "double",
                "dtype": "float64"
            },
            "domain": [
                {
                    "min": "-infinity",
                    "max": "infinity"
                }
            ],
            "rand": {
                "prng": "random/base/uniform",
                "parameters": [
                    -10.0,
                    10.0
                ]
            },
            "example_values": [
                2.0,
                5.0,
                0.25,
                1.0,
                -0.5,
                -3.0
            ]
        },
        {
            "name": "a",
            "desc": "minimum support",
            "type": {
                "javascript": "number",
                "jsdoc": "number",
                "c": "double",
                "dtype": "float64"
            },
            "domain": [
                {
                    "min": "-infinity",
                    "max": "infinity"
                }
            ],
            "rand": {
                "prng": "random/base/uniform",
                "parameters": [
                    -10.0,
                    10.0
                ]
            },
            "example_values": [
                0.0,
                3.0,
                -2.5,
                1.0,
                -1.25,
                -5.0
            ]
        },
        {
            "name": "b",
            "desc": "maximum support",
            "type": {
                "javascript": "number",
                "jsdoc": "number",
                "c": "double",
                "dtype": "float64"
            },
            "domain": [
                {
                    "min": "-infinity",
                    "max": "infinity"
                }
            ],
            "rand": {
                "prng": "random/base/uniform",
                "parameters": [
                    10.0,
                    20.0
                ]
            },
            "example_values": [
                3.0,
                7.0,
                2.5,
                2.0,
                10.0,
                -2.0
            ]
        }
    ],
    "returns": {
        "desc": "evaluated PDF",
        "type": {
            "javascript": "number",
            "jsdoc": "number",
            "c": "double",
            "dtype": "float64"
        }
    },
    "keywords": [
        "probability",
        "pdf",
        "arcsine",
        "continuous",
        "univariate"
    ],
    "extra_keywords": []
}

Annotated Overview

{
    // Each configuration object should include the schema name and version so that tooling can gracefully handle migrations and eventual schema evolution:
    "$schema": "math/[email protected]", // math/base indicates that this schema applies those packages within the math/base namespace. Different namespaces are likely to have different schema needs; hence, the requirement to specify which schema the structured package meta data is expected to conform to.

    // The "base" alias is the alias without, e.g., Hungarian notation prefixes and suffixes:
    "base_alias": "add",

    // The alias is the "base" alias and any additional type information:
    "alias": "add",

    // The package description used in the `package.json` and README:
    "pkg_desc": "add two double-precision floating-point numbers",

    // The description used when documenting JSDoc and REPL.txt files:
    "desc": "adds two double-precision floating-point numbers",

    // A short description which can be used by higher order packages or in other contexts:
    "short_desc": "",

    // A list of API parameters:
    "parameters": [
        {
            // The parameter name as used in API signatures and JSDoc:
            "name": "x",

            // A parameter description:
            "desc": "first input value",

            // Parameter type information as conveyed in various implementation contexts:
            "type": {
                "javascript": "number",
                "jsdoc": "number",
                "c": "double",

                // This field would have more prominence in higher-order APIs, such as those involving ndarrays, where the JavaScript value may be `ndarray`, but we want to ensure we use an ndarray object having a float64 data type:
                "dtype": "float64"
            },

            // The mathematical domain of accepted values (note: this is an array as some math functions have split domains):
            "domain": [
                {
                    "min": "-infinity",
                    "max": "infinity"
                }
            ],

            // Configuration for generating valid random values for this parameter:
            "rand": {
                // A package name for a suitable PRNG:
                "prng": "random/base/uniform",

                // Parameter values to be supplied to the PRNG:
                "parameters": [
                    -10.0,
                    10.0
                ]
            },

            // Concrete values to be used in examples (note: these could possibly be automatically generated according to the `rand` configuration above):
            "example_values": [
                -1.2,
                2.0,
                -3.1,
                -4.7,
                5.5,
                6.7
            ]
        },
        ...
    ],

    // Configuration for the return value (if one exists):
    "returns": {
        // Return value description, as might be used in JSDoc and REPL.txt:
        "desc": "sum",

        // Return value type information:
        "type": {
            "javascript": "number",
            "jsdoc": "number",
            "c": "double",
            "dtype": "float64"
        }
    },

    // A list of keywords without all the boilerplate keywords commonly included in `package.json`:
    "keywords": [
        "sum",
        "add",
        "addition",
        "total",
        "summation"
    ],

    // Additional keywords (e.g., the built-in API equivalent, such as Math.abs):
    "extra_keywords": []
}

Discussion

The most prominent risk is that this is yet another place where meta data can drift and something more we need to maintain. While true, I think having structured meta data has benefits which outweigh the additional costs, particularly when we consider how commonly we often wrap "base" functionality as part of higher order APIs. Given that we've had a recurring need for such meta data, we'll eventually need some sort of standardized way of storing this meta data.
One benefit of having this structured meta data is that this could better enable AI tools, such as those provided by OpenAI, to scaffold new packages involving "base" implementations.

Related Issues

No.

Questions

What other data, if any, should be included?
One open question is whether we should include support for constraints? E.g., in the arcsine PDF function a < b. In the example JSON, I've simply manually adjusted the PRNG parameters and the example values to ensure we don't run afoul of that constraint. It was not clear to me how we might include such constraints in a universal way which is machine parseable and actionable in scaffolding tools.
Which other package namespaces might benefit from structured meta data and how would their schemas differ from the examples above?
The proposal above suggests adding the meta data to package.json files. This could lead to bloat in the package.json files. Another possibility is putting such info in a separate .stdlibrc file in the root package directory. Would this be preferrable?

Other

No.

cc @Planeshifter

Checklist

I have read and understood the Code of Conduct.
Searched for existing issues and pull requests.
The issue name begins with RFC:.

The text was updated successfully, but these errors were encountered:

kgryte · 2023-12-01T01:40:08Z

@Planeshifter Given your previous efforts to build scaffolding tooling, would be good to get your opinion on the above proposal and what, if any, additional structured information might be useful.

kgryte · 2024-01-28T06:57:44Z

@Planeshifter Pinging you here, in case you have forgotten about this issue.

Snehil-Shah · 2024-03-06T20:50:15Z

@kgryte is this in the works?

kgryte · 2024-03-07T01:24:53Z

@Snehil-Shah Sort of. We've created a Google sheet for collecting this information, but that effort has stalled. Something like this would be rather useful, but it involves a fair amount of manual labor, and we haven't had the bandwidth to push forward.

adityacodes30 · 2024-04-09T11:04:55Z

Opening up a tracking issue for this one should help us move forward with since it does require a good number of additions to be made. Should we open one ?

adityacodes30 · 2024-04-09T11:10:23Z

From what I gather resolving this will help in the scaffolding process of both the Gsheets project and developing C implementations, right ?

kgryte · 2024-04-09T20:53:20Z

@adityacodes30 Before opening up a tracking issue, we need to settle on the desired path forward. But, yes, this is also relevant to the scaffolding process for both GSheets and the C implementation work.

adityacodes30 · 2024-04-12T09:02:10Z

Generally, I think we should start with math/base/special, the second priority would be blas/ext/base. But this is as pertains to Gsheets. Would have to see what the community thinks

Planeshifter · 2024-05-02T01:59:45Z

My my main concern, and it is for me a serious one, is that this increases duplication of package documentation even more, which is already quite excessive. If we undertake this, I think it's necessary to at the same time build tooling (either LLM-assisted or just deterministic) that scaffolds out the required other files such as repl.txt. There is a trap that this will be decently easy to add for existing packages but then cause an additional burden when trying to add new packages. I feel we have encountered this several times in the past so we should have a good answer to address this concern.

As for the proposed schema, it seems sensible. I would drop keywords and extra_keywords and instead follow the previously discussed approach of excluding all boilerplate keywords from the keywords array of the package.json in the development repo and then populate them during the release process.

Here are my answers to the raised questions:

What other data, if any, should be included?

Not sure. Maybe something for testing. Should it support options object definitions?

One open question is whether we should include support for constraints? E.g., in the arcsine PDF function a < b. In the example JSON, I've simply manually adjusted the PRNG parameters and the example values to ensure we don't run afoul of that constraint. It was not clear to me how we might include such constraints in a universal way which is machine parseable and actionable in scaffolding tools.

This is a pretty deep rabbit hole and not something that should be encoded in metadata, I think. Burden would be on the person populating the metadata to make sure any constraints are satisfied.

Which other package namespaces might benefit from structured meta data and how would their schemas differ from the examples above?

Probably most that have base implementations and those that need package variants that operate on ndarrays and strided arrays. So stats/base/dists, string, etc. But keeping scope limited and not branching out to all kinds of packages seems prudent.

The proposal above suggests adding the meta data to package.json files. This could lead to bloat in the package.json files. Another possibility is putting such info in a separate .stdlibrc file in the root package directory. Would this be preferrable?

In my view, bloat will not be an issue. Metadata would be stripped when publishing packages, so this would only affect the development environment. package.json is familiar to folks as source of package metadata, and easily loadable as JSON. Custom file would be unfamiliar to developers.

kgryte · 2024-05-02T02:51:00Z

Re: extra keywords. The point here is that there are keywords which are universal for a particular conceptual function and which should be included in all downstream scaffolded packages, and others which are not universal and which scaffolding tool may, or may not, be interested in using.

kgryte · 2024-05-02T02:55:30Z

increases duplication of package documentation even more...this will be decently easy to add for existing packages but then cause an additional burden when trying to add new packages.

I don't have a simple answer here. To me, it is a balance of trade-offs. Right now, the situation is not tenable, as we need to individually define example ranges, aliases, etc, for all higher order packages (e.g., strided, iter, ndarray), which vastly outweighs the maintenance and creation burden if we bite the bullet when creating a base package in the first place.

PR-URL: #2893 Ref: #1147 Co-authored-by: Athan Reines <[email protected]> Reviewed-by: Athan Reines <[email protected]> Signed-off-by: Gunj Joshi <[email protected]>

PR-URL: #2912 Ref: #1147 Reviewed-by: Athan Reines <[email protected]>

PR-URL: #2914 Ref: #1147 Co-authored-by: Athan Reines <[email protected]> Reviewed-by: Athan Reines <[email protected]> Signed-off-by: Athan Reines <[email protected]>

PR-URL: #2922 Ref: #1147 Co-authored-by: Athan Reines <[email protected]> Reviewed-by: Athan Reines <[email protected]>

PR-URL: #2927 Ref: #1147 Co-authored-by: Athan Reines <[email protected]> Reviewed-by: Athan Reines <[email protected]> Signed-off-by: Gunj Joshi <[email protected]> Signed-off-by: Athan Reines <[email protected]>

kgryte added RFC Request for comments. Feature requests and proposed changes. Feature Issue or pull request for adding a new feature. labels Nov 29, 2023

kgryte added the Needs Discussion Needs further discussion. label Mar 7, 2024

gunjjoshi mentioned this issue Mar 17, 2024

[RFC]: develop C implementations for base special mathematical functions stdlib-js/google-summer-of-code#41

Closed

5 tasks

gunjjoshi mentioned this issue Sep 11, 2024

chore: add structured package data for math/base/special/exp #2893

Merged

1 task

This was referenced Sep 17, 2024

chore: add structured package data for math/base/special/cbrt #2909

Merged

chore: add structured package data for math/base/special/pow #2912

Merged

docs: update descriptions and add structured package data for math/base/special/gcd #2914

Merged

kgryte pushed a commit that referenced this issue Sep 17, 2024

chore: add structured package data for math/base/special/pow

f51140f

PR-URL: #2912 Ref: #1147 Reviewed-by: Athan Reines <[email protected]>

gunjjoshi mentioned this issue Sep 18, 2024

build: add a script to find packages with structured meta data #2922

Merged

1 task

kgryte added a commit that referenced this issue Sep 18, 2024

build: add a script to find packages with structured meta data

d965575

PR-URL: #2922 Ref: #1147 Co-authored-by: Athan Reines <[email protected]> Reviewed-by: Athan Reines <[email protected]>

gunjjoshi mentioned this issue Sep 20, 2024

build: update scaffolding for creating unary math iterator packages #2927

Merged

1 task

gunjjoshi mentioned this issue Oct 8, 2024

build: update scaffolding for creating unary math strided packages #2993

Open

1 task

Neerajpathak07 mentioned this issue Mar 17, 2025

[RFC]: develop C implementation for base special mathematical & base statistical distribution functions stdlib-js/google-summer-of-code#107

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: support for structured package data #1147

[RFC]: support for structured package data #1147

kgryte commented Nov 29, 2023 •

edited

Loading

kgryte commented Dec 1, 2023

kgryte commented Jan 28, 2024

Snehil-Shah commented Mar 6, 2024

kgryte commented Mar 7, 2024

adityacodes30 commented Apr 9, 2024

adityacodes30 commented Apr 9, 2024

kgryte commented Apr 9, 2024

adityacodes30 commented Apr 12, 2024

Planeshifter commented May 2, 2024

kgryte commented May 2, 2024

kgryte commented May 2, 2024

[RFC]: support for structured package data #1147

[RFC]: support for structured package data #1147

Comments

kgryte commented Nov 29, 2023 • edited Loading

Description

Overview

Proposal

Examples

Annotated Overview

Discussion

Related Issues

Questions

Other

Checklist

kgryte commented Dec 1, 2023

kgryte commented Jan 28, 2024

Snehil-Shah commented Mar 6, 2024

kgryte commented Mar 7, 2024

adityacodes30 commented Apr 9, 2024

adityacodes30 commented Apr 9, 2024

kgryte commented Apr 9, 2024

adityacodes30 commented Apr 12, 2024

Planeshifter commented May 2, 2024

kgryte commented May 2, 2024

kgryte commented May 2, 2024

kgryte commented Nov 29, 2023 •

edited

Loading