Skip to content

Remove memory bottleneck #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Remove memory bottleneck #11

wants to merge 3 commits into from

Conversation

mimrock
Copy link
Contributor

@mimrock mimrock commented Aug 8, 2014

I was using PHPUnit when a test, which compared two big DOMDocument files, stopped due to the exceeding of the memory limit. I investigated the story, and found that the method which generates the diff for the failed tests is using a dynamic programming-based approach. This algorithm have a quadratic running time, and requires quadratic memory.

If the available time-frame to run the tests is not extremely tight, the memory consumption is a stricter bottleneck.

In this pull request I replaced the old LCS method with a new one based on an algorithm called Hirschberg's algorithm, which also have a quadratic running time, but only uses linear space. The new algorithm is a bit slower (30-40%), but removes the memory bottleneck, thus makes PHPUnit suitable for comparing bigger items.

@mimrock mimrock changed the title Remove memory limit from diffing Remove memory bottleneck Aug 8, 2014
@sebastianbergmann
Copy link
Owner

I refactored your pull request and made the implementation pluggable. By default, the time-efficient implementation is used. In a next step (read: before I make a new release) I want to choose the implementation used (when no inmplementation is explicitly passed) based on the size of the input. Can you share some insight what a good threshold could be?

@mimrock
Copy link
Contributor Author

mimrock commented Aug 11, 2014

Great idea!

According to some ad hoc tests with memory_get_peak_usage() (and using php 5.5.3) the DP algorithm consumes ~N_N_150 bytes of memory where N is the size of the bigger smaller input.

I have two ideas: One is to assume that there is an amount of memory which is affordable to everybody. Given this limit is 100M, the longer string/array should not contain more then sqrt(10^8 / 150) ~ sqrt(660,000) ~ 800 elements for the time-efficient implementation.

The second is to assume that it is safe to use up a fixed portion of the free memory. It is possible to calculate this with reading memory_limit setting, and subtract the value of memory_get_usage, then calculate if the estimated memory cost will be greater then the 50% of this value.

Personally, I prefer the first solution, because it's simpler, and even if it fails, and let's the Differ run out of memory, it can be fixed by increasing the memory limit.

The second one, while faster in some cases, changes it's behaviour based on the settings of the php environment, which can lead to strange bugs.

@sebastianbergmann
Copy link
Owner

Thank you for your feedback, @mimrock. Do you want to send a pull request for the first solution?

@mimrock
Copy link
Contributor Author

mimrock commented Aug 11, 2014

Yes, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants