-
Notifications
You must be signed in to change notification settings - Fork 7.8k
Add ARRAY_UNIQUE_IDENTICAL option #9804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This has all of the fixes I'd wanted in #7806 and looks correct for floating point edge cases and array edge cases - if PHP's Limiting this flag and proposal to array_unique() for an RFC is the best approach in my opinion, given that there's no intuitive sort order for all edge cases of user-defined and internal classes and their subclasses. |
I'd compared a few approaches to getting unique values (benchmarking for the case of sets of It seems like a hash table would compare better at different sizes (e.g. comparing array_flip with the array_unique implementation here) - the
If this went with a hash set based approach, I'd recommend putting the hash struct and function's definitions in ext/standard/array.c and not making them public C apis (i.e. leaving them as internal implementation details, and keeping the implementation as only the parts needed, and allowing for
I could help in porting those to so that any PHP values that are Benchmark results(There's some variance in results, this was only one run) Benchmarking of an array of many integers with few duplicates , n random numbers from 0 to n*4 (click to expand)
Implementation<?php
use function Teds\unique_values;
function bench_unique_values(int $n, int $iterations) {
$values = [];
srand(1234);
for ($i = 0; $i < $n; $i++) {
$values[] = rand(0, ($n << 2) - 1);
}
$start = hrtime(true);
$sum = 0;
for ($i = 0; $i < $iterations; $i++) {
$sum += array_sum(unique_values($values));
}
$end = hrtime(true);
printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
function bench_teds_tree_set(int $n, int $iterations) {
$values = [];
srand(1234);
for ($i = 0; $i < $n; $i++) {
$values[] = rand(0, ($n << 2) - 1);
}
$start = hrtime(true);
$sum = 0;
for ($i = 0; $i < $iterations; $i++) {
$sum += array_sum((new Teds\StrictTreeSet($values))->values());
}
$end = hrtime(true);
printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
function bench_teds_sortedvector_set(int $n, int $iterations) {
$values = [];
srand(1234);
for ($i = 0; $i < $n; $i++) {
$values[] = rand(0, ($n << 2) - 1);
}
$start = hrtime(true);
$sum = 0;
for ($i = 0; $i < $iterations; $i++) {
$sum += array_sum((new Teds\StrictSortedVectorSet($values))->values());
}
$end = hrtime(true);
printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
function bench_teds_hash_set(int $n, int $iterations) {
$values = [];
srand(1234);
for ($i = 0; $i < $n; $i++) {
$values[] = rand(0, ($n << 2) - 1);
}
$start = hrtime(true);
$sum = 0;
for ($i = 0; $i < $iterations; $i++) {
$sum += array_sum((new Teds\StrictHashSet($values))->values());
}
$end = hrtime(true);
printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
function bench_array_unique(int $n, int $iterations) {
$values = [];
srand(1234);
for ($i = 0; $i < $n; $i++) {
$values[] = rand(0, ($n << 2) - 1);
}
$start = hrtime(true);
$sum = 0;
for ($i = 0; $i < $iterations; $i++) {
$sum += array_sum(array_unique($values));
}
$end = hrtime(true);
printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
if (defined('ARRAY_UNIQUE_IDENTICAL')) {
function bench_array_unique_identical(int $n, int $iterations) {
$values = [];
srand(1234);
for ($i = 0; $i < $n; $i++) {
$values[] = rand(0, ($n << 2) - 1);
}
$start = hrtime(true);
$sum = 0;
for ($i = 0; $i < $iterations; $i++) {
$sum += array_sum(array_unique($values, ARRAY_UNIQUE_IDENTICAL));
}
$end = hrtime(true);
printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
} /*defined('ARRAY_UNIQUE_IDENTICAL') */
function bench_array_flip_keys(int $n, int $iterations) {
$values = [];
srand(1234);
for ($i = 0; $i < $n; $i++) {
$values[] = rand(0, ($n << 2) - 1);
}
$start = hrtime(true);
$sum = 0;
for ($i = 0; $i < $iterations; $i++) {
$sum += array_sum(array_keys(array_flip($values)));
}
$end = hrtime(true);
printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
$n = 2**20;
$iterations = 10;
$sizes = [
[1, 8000000],
[4, 2000000],
[8, 1000000],
[2**10, 20000],
[2**20, 20],
];
printf(
"Results for php %s debug=%s with opcache enabled=%s\n\n",
PHP_VERSION,
json_encode(PHP_DEBUG),
json_encode(function_exists('opcache_get_status') && (opcache_get_status(false)['opcache_enabled'] ?? false))
);
echo "Note that Teds\sorted_set is also sorting the elements and maintaining a balanced binary search tree.\n";
foreach ($sizes as [$n, $iterations]) {
bench_unique_values($n, $iterations);
bench_teds_hash_set($n, $iterations);
bench_array_flip_keys($n, $iterations);
bench_teds_sortedvector_set($n, $iterations);
bench_teds_tree_set($n, $iterations);
bench_array_unique($n, $iterations);
bench_array_unique_identical($n, $iterations);
echo "\n";
} Benchmarks of various options (including PECL) for deduplicating values in a list of integers with many duplicates (click to expand)<?php
use function Teds\unique_values;
function bench_unique_values(int $n, int $iterations) {
$values = [];
srand(1234);
for ($i = 0; $i < $n; $i++) {
$values[] = rand(0, ($n >> 2) - 1);
}
$start = hrtime(true);
$sum = 0;
for ($i = 0; $i < $iterations; $i++) {
$sum += array_sum(unique_values($values));
}
$end = hrtime(true);
printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
function bench_teds_tree_set(int $n, int $iterations) {
$values = [];
srand(1234);
for ($i = 0; $i < $n; $i++) {
$values[] = rand(0, ($n >> 2) - 1);
}
$start = hrtime(true);
$sum = 0;
for ($i = 0; $i < $iterations; $i++) {
$sum += array_sum((new Teds\StrictTreeSet($values))->values());
}
$end = hrtime(true);
printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
function bench_teds_sortedvector_set(int $n, int $iterations) {
$values = [];
srand(1234);
for ($i = 0; $i < $n; $i++) {
$values[] = rand(0, ($n >> 2) - 1);
}
$start = hrtime(true);
$sum = 0;
for ($i = 0; $i < $iterations; $i++) {
$sum += array_sum((new Teds\StrictSortedVectorSet($values))->values());
}
$end = hrtime(true);
printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
function bench_teds_hash_set(int $n, int $iterations) {
$values = [];
srand(1234);
for ($i = 0; $i < $n; $i++) {
$values[] = rand(0, ($n >> 2) - 1);
}
$start = hrtime(true);
$sum = 0;
for ($i = 0; $i < $iterations; $i++) {
$sum += array_sum((new Teds\StrictHashSet($values))->values());
}
$end = hrtime(true);
printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
function bench_array_unique(int $n, int $iterations) {
$values = [];
srand(1234);
for ($i = 0; $i < $n; $i++) {
$values[] = rand(0, ($n >> 2) - 1);
}
$start = hrtime(true);
$sum = 0;
for ($i = 0; $i < $iterations; $i++) {
$sum += array_sum(array_unique($values));
}
$end = hrtime(true);
printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
if (defined('ARRAY_UNIQUE_IDENTICAL')) {
function bench_array_unique_identical(int $n, int $iterations) {
$values = [];
srand(1234);
for ($i = 0; $i < $n; $i++) {
$values[] = rand(0, ($n >> 2) - 1);
}
$start = hrtime(true);
$sum = 0;
for ($i = 0; $i < $iterations; $i++) {
$sum += array_sum(array_unique($values, ARRAY_UNIQUE_IDENTICAL));
}
$end = hrtime(true);
printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
} /*defined('ARRAY_UNIQUE_IDENTICAL') */
function bench_array_flip_keys(int $n, int $iterations) {
$values = [];
srand(1234);
for ($i = 0; $i < $n; $i++) {
$values[] = rand(0, ($n >> 2) - 1);
}
$start = hrtime(true);
$sum = 0;
for ($i = 0; $i < $iterations; $i++) {
$sum += array_sum(array_keys(array_flip($values)));
}
$end = hrtime(true);
printf("%30s n=%8d iterations=%8d time=%.3f sum=%d\n", __FUNCTION__, $n, $iterations, ($end - $start)/1e9, $sum);
}
$n = 2**20;
$iterations = 10;
$sizes = [
[1, 8000000],
[4, 2000000],
[8, 1000000],
[2**10, 20000],
[2**20, 20],
];
printf(
"Results for php %s debug=%s with opcache enabled=%s\n\n",
PHP_VERSION,
json_encode(PHP_DEBUG),
json_encode(function_exists('opcache_get_status') && (opcache_get_status(false)['opcache_enabled'] ?? false))
);
echo "Note that Teds\sorted_set is also sorting the elements and maintaining a balanced binary search tree.\n";
foreach ($sizes as [$n, $iterations]) {
bench_unique_values($n, $iterations);
bench_teds_hash_set($n, $iterations);
bench_array_flip_keys($n, $iterations);
bench_teds_sortedvector_set($n, $iterations);
bench_teds_tree_set($n, $iterations);
bench_array_unique($n, $iterations);
bench_array_unique_identical($n, $iterations);
echo "\n";
}
|
I've seen a hybrid approach elsewhere, since a hash table and sorting do worse than nested for loops in practice at very small sizes
|
I didn't realize array_unique's default was SORT_STRING when writing that benchmark, and that the way I used was creating a lot of tiny strings and freeing them, so not a great benchmark. This reminds me, though - Another argument that could be brought up in favor of ARRAY_UNIQUE_IDENTICAL is that it doesn't depend on current ini settings and will work properly for mixes of large ints/floats And SORT_NUMERIC uses php's
|
@TysonAndre Thank you for the benchmark!
Good point. I'll mention this on the mailing list.
👍 I think as of right now, the performance is likely "good enough" compared to the existing options to justify adding the hash set implementation, especially without an RFC (which is still what I'm hoping for, if there are no complaints on the mailing list). Once hash set is added for ADTs anyway this is something I can look into.
This sounds like a good option. |
See 6c2c7a0 The optimization of switching from the easy to implement sort algorithm of ( I'd be willing to write a PR adding the hybrid hash implementation instead |
} | ||
} | ||
case IS_STRING: | ||
return zend_binary_zval_strcmp(z1, z2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separately from my previous comment: This can be optimized, the sort order is internal and not user-visible.
So if any algorithms using sorting internally (where the sort order is not user-visible) are added to php, they can avoid string comparisons entirely in the common cases (comparing hash first):
I haven't benchmarked this, though
(for the worst case of long strings with common prefixes, or the normal case of short strings with common prefixes and a mix of lengths)
- compare
ZSTR_HASH(s1)
toZSTR_HASH(s2)
first (compute if not already computed) - these should almost always be different for typical inputs - Then check if they're the same pointer
- Then by length
- Then by memcmp
#define ZSTR_H(zstr) (zstr)->h
#define ZSTR_HASH(zstr) zend_string_hash_val(zstr) // gets hash value - computes the hash if it isn't already computed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, let's do that!
I was just pointing out why the implementation for
I can look into how much the hash set can be condensed. If it's not too much code we can see what the reaction is from internals. Otherwise I'll throw together an RFC. |
In my tests strict map was faster under all circumstances. I tried different amounts of duplicates too but the difference was negligible or the ratio between One problem for ADTs: The map currently does refcounting, which will not work with ADTs because the map should essentially be a weak map. We could remove refcounting (since we're not exposing this to userland) and then add a custom free handler for ADTs and remove the instance from the global ADT map. One more note: $a = NAN;
var_dump($a === $a); // false
$a = [NAN];
$b = [NAN];
var_dump($a === $a); // true, because ht == ht
var_dump($a === $b); // false But with the current implementation ADTs at least should match this behavior: $a = NAN;
var_dump(Option::Some($a) === Option::Some($a)); // false, new instance is created
$a = [NAN];
$b = [NAN];
var_dump(Option::Some($a) === Option::Some($a)); // true, because ht == ht
var_dump(Option::Some($a) === Option::Some($b)); // false, new instance is created EDIT: |
@TysonAndre Do you think recursion handling really makes sense when something as simple as this terminates PHP? $a = [];
$b = [&$a];
$a[0] = &$b;
var_dump($a === $b);
(I'm trying to condense the code as much as possible) |
Yes. I know. NAN causes a lot of edge cases. zend_is_identical compares the zend_array pointers first to improve performance, among other things. So Unrelatedly, that's why I changed the linked It seems like there's now more languages that I can point to that have introduced stricter equality types JavaScript: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Map#using_nan_as_map_keys has been the case for a while (older js versions may have issues polyfilling) (only allow one nan key) Python: This changed recently in python 3.10 to have nan = nan and singleton nans (only allow one nan key) in hash tables. - https://bugs.python.org/issue43475 Ruby: https://bugs.ruby-lang.org/issues/13146
If array_unique uses the PHP
php > var_export(array_unique([NAN, 1, NAN, -1])); // SORT_STRING
array (
0 => NAN,
1 => 1,
3 => -1,
)
php > var_export(array_unique([NAN, 1, NAN, -1], SORT_NUMERIC));
array (
0 => NAN,
1 => 1,
2 => NAN,
3 => -1,
) irb(main):001:0> m={}
irb(main):002:0> m[0/0.0] = 'a'
irb(main):003:0> m[0/0.0] = 'A'
irb(main):004:0> m
=> {NaN=>"a", NaN=>"A"} |
It isn't a new problem with array_unique sorting algorithms, anyway. The recursion check makes crashes less common, e.g. n = 1. This is avoided in some (but not all) use cases by the I don't expect reference cycles to be common in practice, unless you're writing tests for edge cases, and users would notice the crash and stop using
It also has the same behavior as a polyfill (if the polyfill doesn't have a fatal error), and avoids fatal errors when the hashes are different. |
@TysonAndre Here's an initial implementation. master...iluuu1994:php-src:array_unique_identical_strictmap Some remarks:
|
Out of interest I tried cswisstable but it wasn't significantly faster. Looks like it's a bit slower for small maps but faster for huge ones. The map uses
Reasons to favor cswisstable:
Reasons to favor teds implementation:
|
|
I hadn't tried or seen cswisstable before. Again, the main reason I used https://github.com/TysonAndre/pecl-teds/blob/1.2.8/teds.c#L486-L492 in that implementation was because it was already there (and it's a lot of code already)
https://github.com/google/cwisstable#compatibility-warnings I guess it would be useful in C files. The other thing is that cwisstable expects few hash collisions |
So, where would you say this belongs?
I'll need some more context for this one 🙂
The main reason to justify the complexity (IMO at least) is that the hash map necessary for ADTs. Hash sets (while certainly useful) don't have an immediate use case.
I think it's better to remove refcounting for now, as neither
No, I passed the 64-bit hash.
Yeah, good distribution is crucial for open addressing. There might be a way to check how many collisions there were. |
Btw, there's a problem with references with the current approach that we'll need to handle for ADTs: $foo = 'foo';
$a1 = [&$foo];
$a2 = ['foo'];
$e1 = Option::Some($a1);
$e2 = Option::Some($a2);
$foo = 'bar';
var_dump($e1->value); // ['bar'], questionable
var_dump($e2->value); // ['bar'], should be ['foo'] For ADTs, we'll need to reject values are references or arrays that contain references. That's because these values can be changed after the enum instance has been created. @TysonAndre I haven't tested it but there might be a similar issue for teds where the hash no longer matches the given array value. $foo = 'foo';
$a1 = [&$foo];
$a2 = ['foo'];
$strictHashMap = new StrictHashMap();
$strictHashMap[$a1] = 'foo';
$foo = 'bar';
var_dump($strictHashMap[$a1]); // Hash doesn't match, not found
var_dump($strictHashMap[$a2]); // Values not identical, not found |
Alternative to GH-9788 with better performance characteristics. Heavily based on GH-7806.