Skip to content

Add ARRAY_UNIQUE_IDENTICAL option #9804

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

iluuu1994
Copy link
Member

Alternative to GH-9788 with better performance characteristics. Heavily based on GH-7806.

@TysonAndre
Copy link
Contributor

This has all of the fixes I'd wanted in #7806 and looks correct for floating point edge cases and array edge cases - if PHP's === returned true for a pair of elements then array_unique() would filter out the duplicate

Limiting this flag and proposal to array_unique() for an RFC is the best approach in my opinion, given that there's no intuitive sort order for all edge cases of user-defined and internal classes and their subclasses.

@TysonAndre
Copy link
Contributor

TysonAndre commented Oct 22, 2022

I'd compared a few approaches to getting unique values (benchmarking for the case of sets of int values with a lot of duplicates - you may want to try different cases as well) - this is adapted from https://github.com/TysonAndre/pecl-teds/blob/main/benchmarks/unique_values.php

It seems like a hash table would compare better at different sizes (e.g. comparing array_flip with the array_unique implementation here) - the unique_values function from that pecl using a hash set internally returns a list, not an associative array, which is faster, but I'd expect the perforrmance of array_flip to be much better.

  • Having this new flag perform faster than the default array_unique (while working on all of php's value types, including arrays or objects (especially enums) or mixes of scalars) might be a compelling argument for it
  • Hash tables also do better than

If this went with a hash set based approach, I'd recommend putting the hash struct and function's definitions in ext/standard/array.c and not making them public C apis (i.e. leaving them as internal implementation details, and keeping the implementation as only the parts needed, and allowing for zend_long)

I could help in porting those to so that any PHP values that are === will have identical hashes

Benchmark results

(There's some variance in results, this was only one run)

Benchmarking of an array of many integers with few duplicates , n random numbers from 0 to n*4 (click to expand)
Results for php 8.3.0-dev debug=0 with opcache enabled=true

Note that Teds\sorted_set is also sorting the elements and maintaining a balanced binary search tree.
           bench_unique_values n=       1 iterations= 8000000 time=0.378 sum=24000000
           bench_teds_hash_set n=       1 iterations= 8000000 time=1.383 sum=24000000
         bench_array_flip_keys n=       1 iterations= 8000000 time=0.895 sum=24000000
   bench_teds_sortedvector_set n=       1 iterations= 8000000 time=1.249 sum=24000000
           bench_teds_tree_set n=       1 iterations= 8000000 time=1.377 sum=24000000
            bench_array_unique n=       1 iterations= 8000000 time=0.311 sum=24000000
  bench_array_unique_identical n=       1 iterations= 8000000 time=0.312 sum=24000000

           bench_unique_values n=       4 iterations= 2000000 time=0.247 sum=58000000
           bench_teds_hash_set n=       4 iterations= 2000000 time=0.447 sum=58000000
         bench_array_flip_keys n=       4 iterations= 2000000 time=0.322 sum=58000000
   bench_teds_sortedvector_set n=       4 iterations= 2000000 time=0.675 sum=58000000
           bench_teds_tree_set n=       4 iterations= 2000000 time=0.589 sum=58000000
            bench_array_unique n=       4 iterations= 2000000 time=0.424 sum=58000000
  bench_array_unique_identical n=       4 iterations= 2000000 time=0.334 sum=58000000

           bench_unique_values n=       8 iterations= 1000000 time=0.186 sum=142000000
           bench_teds_hash_set n=       8 iterations= 1000000 time=0.299 sum=142000000
         bench_array_flip_keys n=       8 iterations= 1000000 time=0.232 sum=142000000
   bench_teds_sortedvector_set n=       8 iterations= 1000000 time=0.571 sum=142000000
           bench_teds_tree_set n=       8 iterations= 1000000 time=0.484 sum=142000000
            bench_array_unique n=       8 iterations= 1000000 time=0.472 sum=142000000
  bench_array_unique_identical n=       8 iterations= 1000000 time=0.270 sum=142000000

           bench_unique_values n=    1024 iterations=   20000 time=0.452 sum=38490120000
           bench_teds_hash_set n=    1024 iterations=   20000 time=0.415 sum=38490120000
         bench_array_flip_keys n=    1024 iterations=   20000 time=0.403 sum=38490120000
   bench_teds_sortedvector_set n=    1024 iterations=   20000 time=4.000 sum=38490120000
           bench_teds_tree_set n=    1024 iterations=   20000 time=2.989 sum=38490120000
            bench_array_unique n=    1024 iterations=   20000 time=1.393 sum=38490120000
  bench_array_unique_identical n=    1024 iterations=   20000 time=2.672 sum=38490120000

           bench_unique_values n= 1048576 iterations=      20 time=2.203 sum=38933104158740
           bench_teds_hash_set n= 1048576 iterations=      20 time=2.353 sum=38933104158740
         bench_array_flip_keys n= 1048576 iterations=      20 time=2.113 sum=38933104158740
   bench_teds_sortedvector_set n= 1048576 iterations=      20 time=8.897 sum=38933104158740
           bench_teds_tree_set n= 1048576 iterations=      20 time=21.730 sum=38933104158740
            bench_array_unique n= 1048576 iterations=      20 time=5.045 sum=38933104158740
  bench_array_unique_identical n= 1048576 iterations=      20 time=6.479 sum=38933104158740

Implementation
<?php

use function Teds\unique_values;

function bench_unique_values(int $n, int $iterations) {
    $values = [];
    srand(1234);
    for ($i = 0; $i < $n; $i++) {
        $values[] = rand(0, ($n << 2) - 1);
    }
    $start = hrtime(true);
    $sum = 0;
    for ($i = 0; $i < $iterations; $i++) {
        $sum += array_sum(unique_values($values));
    }
    $end = hrtime(true);
    printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
function bench_teds_tree_set(int $n, int $iterations) {
    $values = [];
    srand(1234);
    for ($i = 0; $i < $n; $i++) {
        $values[] = rand(0, ($n << 2) - 1);
    }
    $start = hrtime(true);
    $sum = 0;
    for ($i = 0; $i < $iterations; $i++) {
        $sum += array_sum((new Teds\StrictTreeSet($values))->values());
    }
    $end = hrtime(true);
    printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
function bench_teds_sortedvector_set(int $n, int $iterations) {
    $values = [];
    srand(1234);
    for ($i = 0; $i < $n; $i++) {
        $values[] = rand(0, ($n << 2) - 1);
    }
    $start = hrtime(true);
    $sum = 0;
    for ($i = 0; $i < $iterations; $i++) {
        $sum += array_sum((new Teds\StrictSortedVectorSet($values))->values());
    }
    $end = hrtime(true);
    printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
function bench_teds_hash_set(int $n, int $iterations) {
    $values = [];
    srand(1234);
    for ($i = 0; $i < $n; $i++) {
        $values[] = rand(0, ($n << 2) - 1);
    }
    $start = hrtime(true);
    $sum = 0;
    for ($i = 0; $i < $iterations; $i++) {
        $sum += array_sum((new Teds\StrictHashSet($values))->values());
    }
    $end = hrtime(true);
    printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
function bench_array_unique(int $n, int $iterations) {
    $values = [];
    srand(1234);
    for ($i = 0; $i < $n; $i++) {
        $values[] = rand(0, ($n << 2) - 1);
    }
    $start = hrtime(true);
    $sum = 0;
    for ($i = 0; $i < $iterations; $i++) {
        $sum += array_sum(array_unique($values));
    }
    $end = hrtime(true);
    printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
if (defined('ARRAY_UNIQUE_IDENTICAL')) {
function bench_array_unique_identical(int $n, int $iterations) {
    $values = [];
    srand(1234);
    for ($i = 0; $i < $n; $i++) {
        $values[] = rand(0, ($n << 2) - 1);
    }
    $start = hrtime(true);
    $sum = 0;
    for ($i = 0; $i < $iterations; $i++) {
        $sum += array_sum(array_unique($values, ARRAY_UNIQUE_IDENTICAL));
    }
    $end = hrtime(true);
    printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
} /*defined('ARRAY_UNIQUE_IDENTICAL')  */
function bench_array_flip_keys(int $n, int $iterations) {
    $values = [];
    srand(1234);
    for ($i = 0; $i < $n; $i++) {
        $values[] = rand(0, ($n << 2) - 1);
    }
    $start = hrtime(true);
    $sum = 0;
    for ($i = 0; $i < $iterations; $i++) {
        $sum += array_sum(array_keys(array_flip($values)));
    }
    $end = hrtime(true);
    printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
$n = 2**20;
$iterations = 10;
$sizes = [
    [1, 8000000],
    [4, 2000000],
    [8, 1000000],
    [2**10, 20000],
    [2**20, 20],
];
printf(
    "Results for php %s debug=%s with opcache enabled=%s\n\n",
    PHP_VERSION,
    json_encode(PHP_DEBUG),
    json_encode(function_exists('opcache_get_status') && (opcache_get_status(false)['opcache_enabled'] ?? false))
);
echo "Note that Teds\sorted_set is also sorting the elements and maintaining a balanced binary search tree.\n";

foreach ($sizes as [$n, $iterations]) {
    bench_unique_values($n, $iterations);
    bench_teds_hash_set($n, $iterations);
    bench_array_flip_keys($n, $iterations);
    bench_teds_sortedvector_set($n, $iterations);
    bench_teds_tree_set($n, $iterations);
    bench_array_unique($n, $iterations);
    bench_array_unique_identical($n, $iterations);
    echo "\n";
}
Benchmarks of various options (including PECL) for deduplicating values in a list of integers with many duplicates (click to expand)
<?php

use function Teds\unique_values;

function bench_unique_values(int $n, int $iterations) {
    $values = [];
    srand(1234);
    for ($i = 0; $i < $n; $i++) {
        $values[] = rand(0, ($n >> 2) - 1);
    }
    $start = hrtime(true);
    $sum = 0;
    for ($i = 0; $i < $iterations; $i++) {
        $sum += array_sum(unique_values($values));
    }
    $end = hrtime(true);
    printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
function bench_teds_tree_set(int $n, int $iterations) {
    $values = [];
    srand(1234);
    for ($i = 0; $i < $n; $i++) {
        $values[] = rand(0, ($n >> 2) - 1);
    }
    $start = hrtime(true);
    $sum = 0;
    for ($i = 0; $i < $iterations; $i++) {
        $sum += array_sum((new Teds\StrictTreeSet($values))->values());
    }
    $end = hrtime(true);
    printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
function bench_teds_sortedvector_set(int $n, int $iterations) {
    $values = [];
    srand(1234);
    for ($i = 0; $i < $n; $i++) {
        $values[] = rand(0, ($n >> 2) - 1);
    }
    $start = hrtime(true);
    $sum = 0;
    for ($i = 0; $i < $iterations; $i++) {
        $sum += array_sum((new Teds\StrictSortedVectorSet($values))->values());
    }
    $end = hrtime(true);
    printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
function bench_teds_hash_set(int $n, int $iterations) {
    $values = [];
    srand(1234);
    for ($i = 0; $i < $n; $i++) {
        $values[] = rand(0, ($n >> 2) - 1);
    }
    $start = hrtime(true);
    $sum = 0;
    for ($i = 0; $i < $iterations; $i++) {
        $sum += array_sum((new Teds\StrictHashSet($values))->values());
    }
    $end = hrtime(true);
    printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
function bench_array_unique(int $n, int $iterations) {
    $values = [];
    srand(1234);
    for ($i = 0; $i < $n; $i++) {
        $values[] = rand(0, ($n >> 2) - 1);
    }
    $start = hrtime(true);
    $sum = 0;
    for ($i = 0; $i < $iterations; $i++) {
        $sum += array_sum(array_unique($values));
    }
    $end = hrtime(true);
    printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
if (defined('ARRAY_UNIQUE_IDENTICAL')) {
function bench_array_unique_identical(int $n, int $iterations) {
    $values = [];
    srand(1234);
    for ($i = 0; $i < $n; $i++) {
        $values[] = rand(0, ($n >> 2) - 1);
    }
    $start = hrtime(true);
    $sum = 0;
    for ($i = 0; $i < $iterations; $i++) {
        $sum += array_sum(array_unique($values, ARRAY_UNIQUE_IDENTICAL));
    }
    $end = hrtime(true);
    printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
} /*defined('ARRAY_UNIQUE_IDENTICAL')  */
function bench_array_flip_keys(int $n, int $iterations) {
    $values = [];
    srand(1234);
    for ($i = 0; $i < $n; $i++) {
        $values[] = rand(0, ($n >> 2) - 1);
    }
    $start = hrtime(true);
    $sum = 0;
    for ($i = 0; $i < $iterations; $i++) {
        $sum += array_sum(array_keys(array_flip($values)));
    }
    $end = hrtime(true);
    printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
$n = 2**20;
$iterations = 10;
$sizes = [
    [1, 8000000],
    [4, 2000000],
    [8, 1000000],
    [2**10, 20000],
    [2**20, 20],
];
printf(
    "Results for php %s debug=%s with opcache enabled=%s\n\n",
    PHP_VERSION,
    json_encode(PHP_DEBUG),
    json_encode(function_exists('opcache_get_status') && (opcache_get_status(false)['opcache_enabled'] ?? false))
);
echo "Note that Teds\sorted_set is also sorting the elements and maintaining a balanced binary search tree.\n";

foreach ($sizes as [$n, $iterations]) {
    bench_unique_values($n, $iterations);
    bench_teds_hash_set($n, $iterations);
    bench_array_flip_keys($n, $iterations);
    bench_teds_sortedvector_set($n, $iterations);
    bench_teds_tree_set($n, $iterations);
    bench_array_unique($n, $iterations);
    bench_array_unique_identical($n, $iterations);
    echo "\n";
}
Results for php 8.3.0-dev debug=0 with opcache enabled=true

Note that Teds\sorted_set is also sorting the elements and maintaining a balanced binary search tree.
           bench_unique_values n=       1 iterations= 8000000 time=0.371 sum=0
           bench_teds_hash_set n=       1 iterations= 8000000 time=1.376 sum=0
         bench_array_flip_keys n=       1 iterations= 8000000 time=0.802 sum=0
   bench_teds_sortedvector_set n=       1 iterations= 8000000 time=1.280 sum=0
           bench_teds_tree_set n=       1 iterations= 8000000 time=1.376 sum=0
            bench_array_unique n=       1 iterations= 8000000 time=0.311 sum=0
  bench_array_unique_identical n=       1 iterations= 8000000 time=0.313 sum=0

           bench_unique_values n=       4 iterations= 2000000 time=0.212 sum=0
           bench_teds_hash_set n=       4 iterations= 2000000 time=0.380 sum=0
         bench_array_flip_keys n=       4 iterations= 2000000 time=0.254 sum=0
   bench_teds_sortedvector_set n=       4 iterations= 2000000 time=0.615 sum=0
           bench_teds_tree_set n=       4 iterations= 2000000 time=0.418 sum=0
            bench_array_unique n=       4 iterations= 2000000 time=0.279 sum=0
  bench_array_unique_identical n=       4 iterations= 2000000 time=0.306 sum=0

           bench_unique_values n=       8 iterations= 1000000 time=0.133 sum=1000000
           bench_teds_hash_set n=       8 iterations= 1000000 time=0.221 sum=1000000
         bench_array_flip_keys n=       8 iterations= 1000000 time=0.202 sum=1000000
   bench_teds_sortedvector_set n=       8 iterations= 1000000 time=0.539 sum=1000000
           bench_teds_tree_set n=       8 iterations= 1000000 time=0.266 sum=1000000
            bench_array_unique n=       8 iterations= 1000000 time=0.214 sum=1000000
  bench_array_unique_identical n=       8 iterations= 1000000 time=0.328 sum=1000000

           bench_unique_values n=    1024 iterations=   20000 time=0.184 sum=636220000
           bench_teds_hash_set n=    1024 iterations=   20000 time=0.202 sum=636220000
         bench_array_flip_keys n=    1024 iterations=   20000 time=0.294 sum=636220000
   bench_teds_sortedvector_set n=    1024 iterations=   20000 time=3.965 sum=636220000
           bench_teds_tree_set n=    1024 iterations=   20000 time=1.829 sum=636220000
            bench_array_unique n=    1024 iterations=   20000 time=1.192 sum=636220000
  bench_array_unique_identical n=    1024 iterations=   20000 time=2.654 sum=636220000

           bench_unique_values n= 1048576 iterations=      20 time=0.821 sum=674350107920
           bench_teds_hash_set n= 1048576 iterations=      20 time=0.840 sum=674350107920
         bench_array_flip_keys n= 1048576 iterations=      20 time=0.971 sum=674350107920
   bench_teds_sortedvector_set n= 1048576 iterations=      20 time=8.582 sum=674350107920
           bench_teds_tree_set n= 1048576 iterations=      20 time=12.105 sum=674350107920
            bench_array_unique n= 1048576 iterations=      20 time=5.547 sum=674350107920
  bench_array_unique_identical n= 1048576 iterations=      20 time=6.798 sum=674350107920


@TysonAndre
Copy link
Contributor

I've seen a hybrid approach elsewhere, since a hash table and sorting do worse than nested for loops in practice at very small sizes

  • For small n=count($input), use a regular foreach over buckets in the zend_array to be returned and call zend_is_identical one by one
  • For large n, use a hash table similar to the previous comment

@TysonAndre
Copy link
Contributor

TysonAndre commented Oct 22, 2022

I didn't realize array_unique's default was SORT_STRING when writing that benchmark, and that the way I used was creating a lot of tiny strings and freeing them, so not a great benchmark.

This reminds me, though - Another argument that could be brought up in favor of ARRAY_UNIQUE_IDENTICAL is that it doesn't depend on current ini settings and will work properly for mixes of large ints/floats
(SORT_STRING converts to strings, which depends on the precision and is slower, and some things might depend on that for mixes of strings/Stringable and floats)

And SORT_NUMERIC uses php's < operator, which has an edge case I'd worked around for https://github.com/TysonAndre/pecl-teds/#stable-comparison (I don't expect much interest in changing existing behavior of SORT_NUMERIC specifically for int/float at the cost of performance)

~ » php -d precision=3 -d serialize_precision=3 -r 'var_export(array_unique([(float)PHP_INT_MAX, PHP_INT_MAX * 0.999999, PHP_INT_MAX, PHP_INT_MAX - 1], SORT_STRING));' 
array (
  0 => 9.22E+18,
  2 => 9223372036854775807,
  3 => 9223372036854775806,
)
~ » php -r 'var_export(array_unique([(float)PHP_INT_MAX, PHP_INT_MAX, PHP_INT_MAX - 1], SORT_NUMERIC)); echo "\n";'
array (
  0 => 9.223372036854776E+18,
)
~ » php -r 'var_export(array_unique([(float)19007199254740995, 19007199254740997, 19007199254740996], SORT_NUMERIC)); echo "\n";'
array (
  0 => 19007199254740996.0,
)

# Actually check for identical values (===). Unlike SORT_STRING, does not stringify
~ » php -d precision=3 -d serialize_precision=3 -r 'var_export(array_unique([(float)PHP_INT_MAX, PHP_INT_MAX * 0.999999, PHP_INT_MAX, PHP_INT_MAX - 1], ARRAY_UNIQUE_IDENTICAL));'
array (
  0 => 9.22E+18,
  1 => 9.22E+18,
  2 => 9223372036854775807,
  3 => 9223372036854775806,
)

@iluuu1994
Copy link
Member Author

@TysonAndre Thank you for the benchmark!

I didn't realize array_unique's default was SORT_STRING when writing that benchmark, and that the way I used was creating a lot of tiny strings and freeing them, so not a great benchmark.

SORT_STRING has a custom hashtable implementation with O(n) (n hashtable lookups of O(1)) so I would expect it to perform better than the other options for large numbers, including IDENTICAL. I'd expect IDENTICAL to perform better than all other options though. I'll need to verify this.

This reminds me, though - Another argument that could be brought up in favor of ARRAY_UNIQUE_IDENTICAL is that it doesn't depend on current ini settings and will work properly for mixes of large ints/floats

Good point. I'll mention this on the mailing list.

If this went with a hash set based approach, I'd recommend putting the hash struct and function's definitions in ext/standard/array.c and not making them public C apis (i.e. leaving them as internal implementation details, and keeping the implementation as only the parts needed, and allowing for zend_long)

👍 I think as of right now, the performance is likely "good enough" compared to the existing options to justify adding the hash set implementation, especially without an RFC (which is still what I'm hoping for, if there are no complaints on the mailing list). Once hash set is added for ADTs anyway this is something I can look into.

I've seen a hybrid approach elsewhere, since a hash table and sorting do worse than nested for loops in practice at very small sizes

This sounds like a good option.

@TysonAndre
Copy link
Contributor

SORT_STRING has a custom hashtable implementation with O(n) (n hashtable lookups of O(1)) so I would expect it to perform better than the other options for large numbers, including IDENTICAL. I'd expect IDENTICAL to perform better than all other options though. I'll need to verify this.

See 6c2c7a0

The optimization of switching from the easy to implement sort algorithm of (O(n log n) for an array of size n) for strings to a longer hash table implementation O(c) was done in php 7.2 - the short but less efficient implementation was around in php before that for years.

I'd be willing to write a PR adding the hybrid hash implementation instead

}
}
case IS_STRING:
return zend_binary_zval_strcmp(z1, z2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately from my previous comment: This can be optimized, the sort order is internal and not user-visible.

So if any algorithms using sorting internally (where the sort order is not user-visible) are added to php, they can avoid string comparisons entirely in the common cases (comparing hash first):

I haven't benchmarked this, though

(for the worst case of long strings with common prefixes, or the normal case of short strings with common prefixes and a mix of lengths)

  • compare ZSTR_HASH(s1) to ZSTR_HASH(s2) first (compute if not already computed) - these should almost always be different for typical inputs
  • Then check if they're the same pointer
  • Then by length
  • Then by memcmp
#define ZSTR_H(zstr)    (zstr)->h
#define ZSTR_HASH(zstr) zend_string_hash_val(zstr) // gets hash value - computes the hash if it isn't already computed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, let's do that!

@iluuu1994
Copy link
Member Author

the short but less efficient implementation was around in php before that for years.

I was just pointing out why the implementation for SORT_STRING was faster for large n in PHP 8.3.

I'd be willing to write a PR adding the hybrid hash implementation instead

I can look into how much the hash set can be condensed. If it's not too much code we can see what the reaction is from internals. Otherwise I'll throw together an RFC.

@iluuu1994
Copy link
Member Author

iluuu1994 commented Oct 27, 2022

Sorting:
  bench_array_unique_identical n=       1 iterations= 8000000 time=0.248 sum=0
  bench_array_unique_identical n=       4 iterations= 2000000 time=0.217 sum=12000000
  bench_array_unique_identical n=       8 iterations= 1000000 time=0.218 sum=26000000
  bench_array_unique_identical n=    1024 iterations=   20000 time=1.648 sum=6729080000
  bench_array_unique_identical n= 1048576 iterations=      20 time=3.971 sum=6952544574380

StrictMap:
  bench_array_unique_identical n=       1 iterations= 8000000 time=0.255 sum=0
  bench_array_unique_identical n=       4 iterations= 2000000 time=0.207 sum=12000000
  bench_array_unique_identical n=       8 iterations= 1000000 time=0.169 sum=26000000
  bench_array_unique_identical n=    1024 iterations=   20000 time=0.524 sum=6729080000
  bench_array_unique_identical n= 1048576 iterations=      20 time=2.204 sum=6952544574380

In my tests strict map was faster under all circumstances. I tried different amounts of duplicates too but the difference was negligible or the ratio between ns was roughly the same. The implementation is much bigger though (~700 lines excluding tests). There are still some cleanups I need to make.

One problem for ADTs: The map currently does refcounting, which will not work with ADTs because the map should essentially be a weak map. We could remove refcounting (since we're not exposing this to userland) and then add a custom free handler for ADTs and remove the instance from the global ADT map.

One more note: NAN behavior seems pretty inconsistent. https://3v4l.org/VWf7pJ

$a = NAN;
var_dump($a === $a);  // false

$a = [NAN];
$b = [NAN];
var_dump($a === $a);  // true, because ht == ht
var_dump($a === $b);  // false

But with the current implementation ADTs at least should match this behavior:

$a = NAN;
var_dump(Option::Some($a) === Option::Some($a));  // false, new instance is created

$a = [NAN];
$b = [NAN];
var_dump(Option::Some($a) === Option::Some($a));  // true, because ht == ht
var_dump(Option::Some($a) === Option::Some($b));  // false, new instance is created

EDIT: Potentially Option::Some(NAN) could cause issues with deletion from the map, as we can't actually look up this item because zend_is_identical would always fail. Easiest solution here would be to just reject NAN passed directly as an ADT value. Actually, we can just deleted it by searching for the value instead.

@iluuu1994
Copy link
Member Author

iluuu1994 commented Oct 27, 2022

@TysonAndre Do you think recursion handling really makes sense when something as simple as this terminates PHP?

https://3v4l.org/d0jjM

$a = [];
$b = [&$a];
$a[0] = &$b;

var_dump($a === $b);

Fatal error: Nesting level too deep - recursive dependency?

(I'm trying to condense the code as much as possible)

@TysonAndre
Copy link
Contributor

One more note: NAN behavior seems pretty inconsistent. https://3v4l.org/VWf7pJ

Yes. I know. NAN causes a lot of edge cases.

zend_is_identical compares the zend_array pointers first to improve performance, among other things. So $array === $array


Unrelatedly, that's why I changed the linked Teds\StrictHashSet/Map implementation in a PECL to treat NAN = NAN (e.g. only allow one nan entry), even in arrays. https://github.com/TysonAndre/pecl-teds/#tedsstricthashmap-and-tedsstricthashset

It seems like there's now more languages that I can point to that have introduced stricter equality types

JavaScript: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Map#using_nan_as_map_keys has been the case for a while (older js versions may have issues polyfilling) (only allow one nan key)

Python: This changed recently in python 3.10 to have nan = nan and singleton nans (only allow one nan key) in hash tables. - https://bugs.python.org/issue43475

Ruby: https://bugs.ruby-lang.org/issues/13146

1: NaN is given a random hash (to avoid hash key collisions).

If array_unique uses the PHP === operator definition, we'd want to avoid hash key collisions in this rare use case by using random numbers or sequences(e.g. +=2 in module globals) as hashes instead of the hardcoded value used for the approach in StrictHashSet, I was going to mention the same thing when this was updated

  • I've been hesitant to propose extra types of equality
  • Getting multiple nans seems unhelpful in most cases
  • I forgot to mention - in numeric sorts, NAN breaks sorting order, and in string sorts, it's deduplicated by the cast to string
php > var_export(array_unique([NAN, 1, NAN, -1])); // SORT_STRING
array (
  0 => NAN,
  1 => 1,
  3 => -1,
)
php > var_export(array_unique([NAN, 1, NAN, -1], SORT_NUMERIC));
array (
  0 => NAN,
  1 => 1,
  2 => NAN,
  3 => -1,
)
irb(main):001:0> m={}
irb(main):002:0> m[0/0.0] = 'a'
irb(main):003:0> m[0/0.0] = 'A'
irb(main):004:0> m
=> {NaN=>"a", NaN=>"A"}

@TysonAndre
Copy link
Contributor

Do you think recursion handling really makes sense when something as simple as this terminates PHP?

php > $a = [];
php > $b = [&$a];
php > $a[0] = &$b;
array(1) {
  [0]=>
  array(1) {
    [0]=>
    &array(1) {
      [0]=>
      *RECURSION*
    }
  }
}
php > var_dump(array_unique([$a, $b], SORT_REGULAR));
Fatal error: Nesting level too deep - recursive dependency? in php shell code on line 1

It isn't a new problem with array_unique sorting algorithms, anyway.

The recursion check makes crashes less common, e.g. n = 1. This is avoided in some (but not all) use cases by the HashTable * C equality check if the arrays have the exact same common recursive array for some field with the same HashTable. It avoids crashing before the identical check (e.g. multiple values of ['config' => $commonButRecursive, 'i' => $i])

I don't expect reference cycles to be common in practice, unless you're writing tests for edge cases, and users would notice the crash and stop using === for that (or use objects, etc)

it looks correct for floating point edge cases and array edge cases - if PHP's === returned true for a pair of elements then array_unique() would filter out the duplicate

It also has the same behavior as a polyfill (if the polyfill doesn't have a fatal error), and avoids fatal errors when the hashes are different.

@iluuu1994
Copy link
Member Author

@TysonAndre Here's an initial implementation. master...iluuu1994:php-src:array_unique_identical_strictmap

Some remarks:

  • The implementation is mostly the same. I removed unused things like iterators, adjusted the naming to match php-src, etc.
  • I copied teds license into the new files, let me know if there's anything else that needs to be done in that regard
  • I replaced teds_stricthashmap_entries_rehash_inplace with a free indices list since order doesn't matter here or for ADTs, which makes the implementation simpler and likely faster
  • I'd still like to adjust variable names, which are currently inconsistent (map, ht, array, etc)

@iluuu1994
Copy link
Member Author

Out of interest I tried cswisstable but it wasn't significantly faster. Looks like it's a bit slower for small maps but faster for huge ones. The map uses emalloc/efree, the hash calculation from teds, and skips refcounting. I compiled with intrinsics and -O2. Not sure if there's something else it's missing.

  bench_array_unique_identical n=       1 iterations= 8000000 time=0.234 sum=0
  bench_array_unique_identical n=       4 iterations= 2000000 time=0.240 sum=12000000
  bench_array_unique_identical n=       8 iterations= 1000000 time=0.191 sum=26000000
  bench_array_unique_identical n=    1024 iterations=   20000 time=0.579 sum=6729080000
  bench_array_unique_identical n= 1048576 iterations=      20 time=1.753 sum=6952544574380

Reasons to favor cswisstable:

  • We could generate other, more tailored maps/sets. E.g. it can have custom storage, copy, dtor, hash and eq functions. Thus, we could have maps that skip refcounting, or use persistent memory, etc. A lot of php-src code uses HashMap because it's the most convenient. However, storing an entire zval is often unnecessary.
  • This code is (more) battle tested than a custom hash map implementation
  • We don't have to maintain it

Reasons to favor teds implementation:

  • It's faster for small maps, which is likely most maps
  • We can tweak it ourselves for best performance
  • It looks like it's not possible to create a custom dtor for keys in cswisstable which would be necessary for refcounted maps with zvals as keys

@TysonAndre
Copy link
Contributor

TysonAndre commented Oct 28, 2022

  • This functionality isn't reusable as-is (e.g. no exported symbols, only used outside of Zend in one extension), and I wouldn't try (until we have real hash maps), so it doesn't make sense to put it in zend.

  • The file comments really should keep the note that it's based on code used in zend_hash, since it's the same hashing/collision strategy with zvals as keys instead and fixing reference counting for that.

  • static zend_always_inline zend_ulong zend_rotr3(zend_ulong key)

  • I'd recommend using hashset over hashmap for the tiny bit of improvement. See teds_array_unique_values in teds.c.

  • I was using StrictHashMap to distinguish it from other hashes if teds added a map that called a hash function for objects.

  • empty_entry_list is only needed for checking that __construct and __unserialize are called in collection objects.

  • zend_strictmap_dtor seems like it'll leak for reference counted strings such as $dynamic = 'dynamic'; array_unique(["xyz$dynamic"]);

    RETURN_ARR(teds_move_zend_array_from_entries(&array)); is used instead in teds for that reason, though Teds\unique_values returns a list instead.

    This only needs a hash set and only needs to efree(data) (EDIT: If the use of a hash set neither increments the reference count nor decrements it)

  • /* TODO: This can be optimized based on zend_hash_rehash's special cases, e.g. HT_IS_WITHOUT_HOLES */ -

  • The capacity checks are redundant in array_unique

  • /* This assumes that pointers differ in low addresses rather than high addresses. - I should fix that in the PECL and keep the other two lines.
    The hash function was copied from igbinary; this isn't hashing pointers.

  • guaranteed_new is always false. That was for cloning hash sets/maps, I believe

@TysonAndre
Copy link
Contributor

TysonAndre commented Oct 29, 2022

I hadn't tried or seen cswisstable before.

Again, the main reason I used https://github.com/TysonAndre/pecl-teds/blob/1.2.8/teds.c#L486-L492 in that implementation was because it was already there (and it's a lot of code already)

https://github.com/google/cwisstable#compatibility-warnings I guess it would be useful in C files.

The other thing is that cwisstable expects few hash collisions

@iluuu1994
Copy link
Member Author

so it doesn't make sense to put it in zend.

So, where would you say this belongs?

static zend_always_inline zend_ulong zend_rotr3(zend_ulong key)

I'll need some more context for this one 🙂

I'd recommend using hashset over hashmap for the tiny bit of improvement. See teds_array_unique_values in teds.c.

The main reason to justify the complexity (IMO at least) is that the hash map necessary for ADTs. Hash sets (while certainly useful) don't have an immediate use case.

empty_entry_list is only needed for checking that __construct and __unserialize are called in collection objects.

ZEND_STRICTMAP_FOREACH accesses data unconditionally, so it would prevent a NULL pointer access if data has not been allocated. But since this is private API I'm fine with removing this along with the capacity checks.

This only needs a hash set and only needs to efree(data) (EDIT: If the use of a hash set neither increments the reference count nor decrements it)

I think it's better to remove refcounting for now, as neither array_unique nor ADTs will need it.

Are you passing a 32 bit integer to the size_t hash cwiss expects in those benchmarks?

No, I passed the 64-bit hash.

The other thing is that cwisstable expects few hash collisions

Yeah, good distribution is crucial for open addressing. There might be a way to check how many collisions there were.

@iluuu1994
Copy link
Member Author

iluuu1994 commented Oct 30, 2022

Btw, there's a problem with references with the current approach that we'll need to handle for ADTs:

$foo = 'foo';
$a1 = [&$foo];
$a2 = ['foo'];

$e1 = Option::Some($a1);
$e2 = Option::Some($a2);
$foo = 'bar';

var_dump($e1->value); // ['bar'], questionable
var_dump($e2->value); // ['bar'], should be ['foo']

For ADTs, we'll need to reject values are references or arrays that contain references. That's because these values can be changed after the enum instance has been created.

@TysonAndre I haven't tested it but there might be a similar issue for teds where the hash no longer matches the given array value.

$foo = 'foo';
$a1 = [&$foo];
$a2 = ['foo'];

$strictHashMap = new StrictHashMap();
$strictHashMap[$a1] = 'foo';
$foo = 'bar';

var_dump($strictHashMap[$a1]); // Hash doesn't match, not found
var_dump($strictHashMap[$a2]); // Values not identical, not found

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants