Skip to content

Enable Excel function implementations for use in "array formulae" #2551

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
MarkBaker opened this issue Feb 4, 2022 · 2 comments
Open

Enable Excel function implementations for use in "array formulae" #2551

MarkBaker opened this issue Feb 4, 2022 · 2 comments
Labels
calculation engine pinned pinned issue to avoid them becoming stale
Milestone

Comments

@MarkBaker
Copy link
Member

MarkBaker commented Feb 4, 2022

This is:

- [ ] a bug report
- [X] a feature request
- [ ] **not** a usage question (ask them on https://stackoverflow.com/questions/tagged/phpspreadsheet or https://gitter.im/PHPOffice/PhpSpreadsheet)

Preparation of Excel function implementations allowing them to be used in "array formulae".

Typically in MS Excel, arguments passed to a function are simple "scalar" values, e.g.
=ABS(-3)

But MS Excel does allow for that argument passed to a function also to be a matrix/array of values:
=ABS({-3, 2.5, -1; 0, -1, -2})
This is still considered a single argument, but the argument is a matrix rather than a simple "scalar". In this case, MS Excel will return a matrix with the same dimensions, where the function has been applied to each individual "scalar" value in that matrix. In the above case, this will result in a response of
{3, 2.5, 1; 0, 1, 2}
That result may then be processed by other Excel functions in the formula (e.g. MAX()) to reduce it to a single value; or it may be displayed across multiple cells.


Currently, if we write a cell with the formula =MAX(ABS({-3, 2.5, -1; 0, -1, -12})), the Calculation Engine will make the call to ABS() with the array of arguments, but will discard all but the first value before determining the absolute of that value, which gives a result of 3 (the absolute value of the first array argument of -3); so the subsequent call to MAX() will only receive the value 3, giving a final (incorrect) result of 3.

Evaluation Log:
Array
(
    [0] => Testing cache value for cell Formula Test!A1
    [1] => Evaluating formula for cell Formula Test!A1
    [2] => Formula for cell Formula Test!A1 is MAX(ABS({-3, 2.5, -1; 0, -1, -12}))
    [3] => Formula Test!A1 => Evaluating Negation of 3
    [4] => Formula Test!A1 => Evaluation Result is an integer number with a value of -3
    [5] => Formula Test!A1 => Evaluating Negation of 1
    [6] => Formula Test!A1 => Evaluation Result is an integer number with a value of -1
    [7] => Formula Test!A1 => Evaluating Negation of 1
    [8] => Formula Test!A1 => Evaluation Result is an integer number with a value of -1
    [9] => Formula Test!A1 => Evaluating Negation of 12
    [10] => Formula Test!A1 => Evaluation Result is an integer number with a value of -12
    [11] => Formula Test!A1 => Evaluating Function ABS() with 1 argument
    [12] => Formula Test!A1 => Evaluating ABS( { -3, 2.5, -1; 0, -1, -12 } )
    [13] => Formula Test!A1 => Evaluation Result for ABS() function call is an integer number with a value of 3
    [14] => Formula Test!A1 => Evaluating Function MAX() with 1 argument
    [15] => Formula Test!A1 => Evaluating MAX( 3 )
    [16] => Formula Test!A1 => Evaluation Result for MAX() function call is an integer number with a value of 3
)

With these changes, ABS() evaluates every entry in the array, and returns an array of absolute values that is then passed to MAX(), so the call to max has the full set of values (3, 2., 1, 1, 12) and can identify the correct maximum value of 12.

Evaluation Log:
Array
(
    [0] => Testing cache value for cell Formula Test!A1
    [1] => Evaluating formula for cell Formula Test!A1
    [2] => Formula for cell Formula Test!A1 is MAX(ABS({-3, 2.5, -1; 0, -1, -12}))
    [3] => Formula Test!A1 => Evaluating Negation of 3
    [4] => Formula Test!A1 => Evaluation Result is an integer number with a value of -3
    [5] => Formula Test!A1 => Evaluating Negation of 1
    [6] => Formula Test!A1 => Evaluation Result is an integer number with a value of -1
    [7] => Formula Test!A1 => Evaluating Negation of 1
    [8] => Formula Test!A1 => Evaluation Result is an integer number with a value of -1
    [9] => Formula Test!A1 => Evaluating Negation of 12
    [10] => Formula Test!A1 => Evaluation Result is an integer number with a value of -12
    [11] => Formula Test!A1 => Evaluating Function ABS() with 1 argument
    [12] => Formula Test!A1 => Evaluating ABS( { -3, 2.5, -1; 0, -1, -12 } )
    [13] => Formula Test!A1 => Evaluation Result for ABS() function call is a matrix with a value of { 3, 2.5, 1; 0, 1, 12 }
    [14] => Formula Test!A1 => Evaluating Function MAX() with 1 argument
    [15] => Formula Test!A1 => Evaluating MAX( { 3, 2.5, 1; 0, 1, 12 } )
    [16] => Formula Test!A1 => Evaluation Result for MAX() function call is an integer number with a value of 12
)

This series of changes will ensure that arrays are returned by function calls when appropriate (rather than single values), and are then passed correctly through the Calculation Engine call stack as arrays rather than single scalar values. In the case of our formula MAX(ABS({-3, 2.5, -1; 0, -1, -12})), it will allow a correct result of 12 rather than an incorrect result of 3 to be returned.


It will not change the way that a cell handles the result of a matrix being returned to the getCalculatedValue() call; it will still be reduced to the value of the first entry in that matrix. Handling for that change is ongoing as part of PR #2539, which includes a BC break, meaning that it will only be released with PhpSpreadsheet 2.0.0.

image

However, in addition to resolving returns from the Excel function implementations so that values aren't lost in a chain of calls, this is also preparatory work for that final PR #2539 handling of array results. This change ensures that a call passing an array to an Excel function results in the return of an array from that function so that the final cell-level handling in PR #2539 receives an array when it should.

Example of an existing function (The Math/Trig ABS() function)

This is relatively straightforward to implement for functions that only accept a single argument: the following shows the changes required for a typical function.

Current implementation:

class Absolute
{
    /**
     * ABS.
     *
     * Returns the result of builtin function abs after validating args.
     *
     * @param mixed $number Should be numeric, 
     *
     * @return float|int|string Rounded number
     */
    public static function evaluate($number)
    {
        try {
            $number = Helpers::validateNumericNullBool($number);
        } catch (Exception $e) {
            return $e->getMessage();
        }

        return abs($number);
    }
}

Updated (array function enabled implementation)

class Absolute
{
    use ArrayEnabled;

    /**
     * ABS.
     *
     * Returns the result of builtin function abs after validating args.
     *
     * @param mixed $number Should be numeric, or can be an array of numbers
     *
     * @return array|float|int|string rounded number
     *         If an array of numbers is passed as the argument, then the returned result will also be an array
     *            with the same dimensions
     */
    public static function evaluate($number)
    {
        if (is_array($number)) {
            return self::evaluateSingleArgumentArray([self::class, __FUNCTION__], $number);
        }

        try {
            $number = Helpers::validateNumericNullBool($number);
        } catch (Exception $e) {
            return $e->getMessage();
        }

        return abs($number);
    }
}

Changes to the unit tests to verify correct behaviour when functions are enabled for use in array formulae

    /**
     * @dataProvider providerAbsArray
     */
    public function testAbsoluteArray(array $expectedResult, string $array): void
    {
        $calculation = Calculation::getInstance();

        $formula = "=ABS({$array})";
        $result = $calculation->_calculateFormulaValue($formula);
        self::assertEqualsWithDelta($expectedResult, $result, 1.0e-14);
    }

    public function providerAbsArray(): array
    {
        return [
            'row vector' => [[[1, 0, 1]], '{-1, 0, 1}'],
            'column vector' => [[[1], [0], [1]], '{-1; 0; 1}'],
            'matrix' => [[[1, 0], [1, 12]], '{-1, 0; 1, -12}'],
        ];
    }

Example of an existing function that accepts multiple arguments (The Math/Trig ATAN2() function)... (comments deleted to reduce code noise)

Current implementation:

class Tangent
{
    /**
     * ATAN2.
     *
     * Excel Function:
     *        ATAN2(xCoordinate,yCoordinate)
     *
     * @param mixed $xCoordinate should be float, the x-coordinate of the point
     * @param mixed $yCoordinate should be float, the y-coordinate of the point
     *
     * @return float|string the inverse tangent of the specified x- and y-coordinates, or a string containing an error
     */
    public static function atan2($xCoordinate, $yCoordinate)
    {
        try {
            $xCoordinate = Helpers::validateNumericNullBool($xCoordinate);
            $yCoordinate = Helpers::validateNumericNullBool($yCoordinate);
        } catch (Exception $e) {
            return $e->getMessage();
        }

        if (($xCoordinate == 0) && ($yCoordinate == 0)) {
            return Functions::DIV0();
        }

        return atan2($yCoordinate, $xCoordinate);
    }
}

Updated (array function enabled implementation)

class Tangent
{
    use ArrayEnabled;

    /**
     * ATAN2.
     *
     * Excel Function:
     *        ATAN2(xCoordinate,yCoordinate)
     *
     * @param mixed $xCoordinate should be float, the x-coordinate of the point, or can be an array of numbers
     * @param mixed $yCoordinate should be float, the y-coordinate of the point, or can be an array of numbers
     *
     * @return array|float|string
     *         The inverse tangent of the specified x- and y-coordinates, or a string containing an error
     *         If an array of numbers is passed as one of the arguments, then the returned result will also be an array
     *            with the same dimensions
     */
    public static function atan2($xCoordinate, $yCoordinate)
    {
        if (is_array($xCoordinate) || is_array($yCoordinate)) {
            return self::evaluateArrayArguments([self::class, __FUNCTION__], $xCoordinate, $yCoordinate);
        }

        try {
            $xCoordinate = Helpers::validateNumericNullBool($xCoordinate);
            $yCoordinate = Helpers::validateNumericNullBool($yCoordinate);
        } catch (Exception $e) {
            return $e->getMessage();
        }

        if (($xCoordinate == 0) && ($yCoordinate == 0)) {
            return Functions::DIV0();
        }

        return atan2($yCoordinate, $xCoordinate);
    }
}

Changes to the unit tests to verify correct behaviour when functions are enabled for use in array formulae

    /**
     * @dataProvider providerAtan2Array
     */
    public function testAtan2Array(array $expectedResult, string $argument1, string $argument2): void
    {
        $calculation = Calculation::getInstance();

        $formula = "=ATAN2({$argument1},{$argument2})";
        $result = $calculation->_calculateFormulaValue($formula);
        self::assertEqualsWithDelta($expectedResult, $result, 1.0e-14);
    }

    public function providerAtan2Array(): array
    {
        return [
            'first argument row vector' => [
                [[1.81577498992176, 1.17600520709514]],
                '{-0.75, 1.25}',
                '3',
            ],
            'first argument column vector' => [
                [[1.17600520709514], [0.98279372324733]],
                '{1.25; 2}',
                '3',
            ],
            'first argument matrix' => [
                [[2.03444393579570, 1.48765509490646], [1.57079632679490, 1.24904577239825]],
                '{-1.5, 0.25; 0, 1}',
                '3',
            ],
            'second argument row vector' => [
                [[-0.24497866312686, 0.39479111969976]],
                '3',
                '{-0.75, 1.25}',
            ],
            'second argument column vector' => [
                [[0.39479111969976], [0.58800260354757]],
                '3',
                '{1.25; 2}',
            ],
            'second argument matrix' => [
                [[-0.46364760900081, 0.08314123188844], [0.0, 0.32175055439664]],
                '3',
                '{-1.5, 0.25; 0, 1}',
            ],
            'A row and a column vector' => [
                [
                    [-2.21429743558818, 2.81984209919315, 2.55359005004223, 1.92956699706547],
                    [-1.69515132134166, 2.03444393579570, 1.81577498992176, 1.63321513679085],
                    [-1.01219701145133, 0.38050637711237, 0.67474094222355, 1.26791145841993],
                    [-0.51914611424652, 0.14189705460416, 0.27829965900511, 0.85196632717327],
                ],
                '{-1.5; -0.25; 1.25; 3.5}',
                '{-2, 0.5, 1, 4}',
            ],
            'Two row vectors' => [
                [[-2.21429743558818, 2.03444393579570, 0.67474094222355, 0.85196632717327]],
                '{-1.5, -0.25, 1.25, 3.5}',
                '{-2, 0.5, 1, 4}',
            ],
            'Two column vectors' => [
                [[-2.21429743558818], [2.03444393579570], [0.67474094222355], [0.85196632717327]],
                '{-1.5; -0.25; 1.25; 3.5}',
                '{-2; 0.5; 1; 4}',
            ],
        ];
    }

The function test for ROUND() contains a lot more variations on different combinations and sizes of arrays.

Functions that accept arguments using the splat operator, or more than 2 arguments, need to be assessed on a case by case basis. However, for the example of WORKDAYS(), which accepts two "static" arguments, then uses the splat operator to accept additional arguments, the evaluateArrayArgumentsSubset() method allows the call to indicate the static arguments that could be arrays, but to process the array of trailing arguments accepted by the method "normally":

    public static function date($startDate, $endDays, ...$dateArgs)
    {
        if (is_array($startDate) || is_array($endDays)) {
            return self::evaluateArrayArgumentsSubset(
                [self::class, __FUNCTION__],
                2,
                $startDate,
                $endDays,
                ...$dateArgs
            );
        }

        ...
    }

We only check to see if the "static" argument values are arrays; and we call evaluateArrayArgumentsSubset() with the additional limit argument, that tells the code logic that only the first two arguments should be processed for the purposes of array testing.

Why is this required?

Besides fixing some basic array formulae when passed through the call stack in the Calculation engine (as described above):

One of the planned new features for PhpSpreadsheet 2.0 is support for array formulae, including the new array functions like SEQUENCE(), SORT(), FILTER(), etc; and also the new Spill and Single operators.

While the PhpSpreadsheet operators already support Excel matrix/array handling, most of the function implementations don't yet; so this is preparation work in anticipation of providing full support for array formulae in version 2.0.

This is linked to the work ongoing in PR #2539 on branch CalculationEngine-Array-Formulae-Initial-Work

These changes to the function implementations can be done ahead in the current codebase ahead of the 2.0 release. The existing Calculation Engine will simply discard all but the very first scalar value from any matrix of values returned by an Excel function: that is already existing behaviour; but it may provide correct results when using arrays as arguments when the current implementation does not (as in the case of our =MAX(ABS({-3, 2.5, -1; 0, -1, -12})) example).

Examples of Excel array functions, and the proposed support/implementation can be found in the documentation

@MarkBaker MarkBaker added calculation engine pinned pinned issue to avoid them becoming stale labels Feb 4, 2022
@MarkBaker MarkBaker added this to the 2.0 milestone Feb 4, 2022
@MarkBaker
Copy link
Member Author

MarkBaker commented Feb 4, 2022

With some initial work; I've decided to move the evaluateArray() method out into an ArrayEnabled trait that can then be used for many different functions

This also means that use of the ArrayEnabled trait in a function implementation class is immediately visible at the top of the class file. While it doesn't mean that all function implementations in that class have been modified to use the trait, it does give some indication that those functions have probably already been updated, but it's absence does indicate that the functions in that class do need assessing.

Methods in the ArrayEnable trait that have visibility of protected should be the only methods called from function implementations; methods set as private are intended purely for internal use inside the trait; and ma be separated out into a separate class at some point.

@MarkBaker
Copy link
Member Author

MarkBaker commented Feb 7, 2022

Logic currently supports either a single array argument, which can be

  • a row vector,
  • a column vector
  • or a matrix

or two array arguments which can be

  • a row vector and a column vector
  • a row vector and a row vector
  • a column vector and a column vector
  • a matrix and a row vector
  • a matrix and a column vector
  • a matrix and a matrix

Currently unsupported

  • more than 2 array arguments (throws an exception)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
calculation engine pinned pinned issue to avoid them becoming stale
Development

No branches or pull requests

1 participant