-
Notifications
You must be signed in to change notification settings - Fork 2k
Most of the best-performing frameworks don't survive temporary db connectivity loss #8790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I assume you've got your browser tabs mixed up, because otherwise I have not a slightest idea about what orm/rust/redis-question you @zheby are talking about. |
cc @Xudong-Huang @fakeshadow @billywhizz @vietj @sagenschneider @fundon @jplatte @jknack @ecruz-te @fafhrd91 @ShreckYe @chrislearn @jkuhn1 as the most recent committers into the corresponding implementations. Do you find my concerns justified? |
Hey, I do think that if things don't recover from temporary DB issues, that's pretty weird / probably shouldn't fly for the benchmarks. Re. axum, the one I contributed to, I can say that the code has some issues (though none that I know would certainly degrade performance) but I haven't had that much motivation to take the time required to fix them. Which configurations are you talking about as the slow and fast one? |
Hi! This is actually a common theme for every (as far as I can tell) rust implementation: employing |
I wouldn't mind at all having the |
if techempower team say recovery is require, should easy to add. doubt it would change results on measurable amount |
@NateBrady23 could you please clarify the rules here? |
Thanks for doing that QA @itrofimow I agree with everything said here. I would expect a production-grade framework to be able to recover from a temporary db connectivity loss. However, I also agree that it probably won't alter the results. Still, I'd like to see the frameworks here updated, though we probably wouldn't be adding this into the test suite anytime soon. I'll leave this issue open for now. If anyone wanders here, it probably doesn't look good if your framework implementation doesn't recover in such a way. |
Mandatory reconnectivity is a good rule to add. Though it's unlikely to affect perf unless it's a hot path of your system. |
I am the original author of rule 1 of the general test requirements. You are right that it is intentionally vague on the specifics of production grade. My thinking at the time was that if there were specific matters, like the one you raise, the community could discuss them in precisely this manner. So thank you for raising the concern! Our deference to each framework's definition of "production grade" grants us some luxury to avoid dictating the right or wrong ways to build web applications. For example, I would not want to reject a framework or platform whose developers stated something akin to, "It's normal to restart the application server in production if the database server restarts." While I personally would disagree with such a concession, meanwhile, I would continue not using foreign keys in databases and another developer might consider that reckless. I suspect we can all think of weird stuff we've seen being considered production-grade by other developers. My inclination would be to allow for the current situation but re-emphasize a longer-term goal of the project to offer more detailed implementation notes for visitors on the results web site. Being able to understand that a given framework provides database connection resilience (or does not) with relative ease would be worthwhile. In other words: don't forbid this arguably risky optimization (or lack of functionality), but call attention to it, and by so doing, nudge the developers to either explain how they recommend handling database outages in a production environment, present an argument about why it's not relevant in their ecosystem, or accept that consumers of the results data will be made aware of the concession. |
Thanks for clarification.
About that: pipelining unrelated queries over a relatively small number of connections seems to be instrumental to squeeze as much performance from PostgreSQL as possible in these benchmarks, however most widely used Pooling connections takes away the possibility to pipeline unrelated queries, so one needs to increase the number of connections to achieve the same throughput. Curious to learn how folks would overcome this limitation, will see. P.S. This issue (the github issue I mean, #8790) is not specifically about Rust implementations, it just so happens that they all fail the QA-test for the exact same reason. edit: typos |
Given you interpretation, should I rename the issue to "Most of the best-performing frameworks don't survive temporary db connectivity loss", so it doesn't come as strongly worded w.r.t. the rules? @bhauer |
@itrofimow Up to you! That title revision might be more clear about the specific concern, but I don't think the title matters very much. |
Hi @itrofimow, I mainly adapted the Vert.x benchmarks into the Kotlin language. Let me help you cc @vietj, @franz1981, and @pmlopes who are the core developers for this. |
Hi @itrofimow I do agree that production-grade applications should be resilient to DB failure or restart. Now looking at rule 1 (https://github.com/TechEmpower/FrameworkBenchmarks/wiki/Project-Information-Framework-Tests-Overview), as you mentioned it is not really obvious that this is required for the performance test (as you mentioned "should" is not "must"). If we want to go on that path, I think it might be good to clarify this, we can keep "should" to be inclusive and yet clearly state that it must be resilient to DB connection loss but this raises some other concerns: TFB must then include a test for this, if we test this what else can we test to make sure application is resilient? (client connection loss? client sending bad HTTP requests...) This can easily lead to test frameworks resiliency and no longer only pure performance. Don't get me wrong this would be a good thing but it would also widen the scope. Inverno (and I guess many other Java based frameworks) relies on Vert.x SQL client (which is really good). The framework is still under heavy development and I wouldn't use it yet for mission critical applications (and above issue just proved it) but my goal is mission critical production grade so I'll take some time to test this scenario and see if something can be done on my side otherwise I guess @vietj would need to provide some fix or advice (please let me know if I can help). |
Hi, I've made some tests on Inverno and refreshed a bit my memory on the application for the benchmark test. I'm just posting my findings here because it might help understand why other frameworks based on Vertx SQL client (or not) are also failing (including Vertx). I decided to create a single DB connection linked to each IO threads (2*n where n is the number of cores) rather than relying on a connection pool which would be the standard setup for a production application. Using a single connection basically allows to pipeline DB requests which as you know has a significant impact on performance (I guess this was already noticed in previous comments). Considering resiliency was not in scope for such performance test I didn't find it useful to include something that would recreate the connection in case of errors, this would be doable but not really important to assess performances. I tested above scenario in an application using the standard pool SQL client and I can confirm Inverno is resilient to DB restart, the Vertx SQL client doing its job perfectly by recreating the dead connection. In a production application, people would most likely use a pool SQL client but that doesn't mean it is not possible to create specific connection like in the TFB app when pipelining has some interests, the only drawback is that you have to handle connection errors on your own. From what I saw the issue seems to be the same for the Vertx TFB application, tbc with @vietj . So there's no issue with Vertx SQL client, for the rest I think it depends on what you decide for the TFB rules: do we include resiliency to DB failure as part of the performance test or not? If so I'll provide some fix to make this happen. Regards |
Well, given the latter remark, this is simply not true - You are correct that the optimal case for PostgreSQL seems to be 1 database connection per logical processor core, so the optimal configuration is the one from the previous paragraph, given that the benchmarking environment that the TechEmpower team has set up at the moment uses homogenous hardware. However, back in the ServerCentral days (yeah, it has been a while) that was not the case; as a side note, it was also a NUMA environment. Furthermore, when I am testing my code in the cloud, I often bring up powerful client/load generator and database (virtual) machines and a relatively weak application server, in order to alleviate any potential bottlenecks that are outside of the framework. In such asymmetric cases it is useful to have the capability to use more than 1 database connection per pool, so I have implemented it in a way that does not require any code changes to switch. |
You are right, I should've worded it better: by "this issue is not specifically about Rust" I meant the #8790 issue, yet a "pipelining vs pooling" issue is mostly Rust-related. |
it's my understanding that postgres pipelining should be disabled for all frameworks in the tests - on every connection to postgres, the client should wait to receive a response before putting another command on the wire. there has been a big discussion already about this and a number of frameworks have had to turn this off, which drastically alters the results. if there are frameworks currently using postgres pipelining then it should probably be flagged up here and authors notified. re. resilience, there is a such a broad range of frameworks on here in various states of maintenance, i am not sure adding more rules for the existing tests is useful and it will also likely have no impact on results. imo, TE exists to test performance/throughput and not much else. it's a yardstick for folks to help uncover performance issues and get a sense of performance differences between different languages, frameworks and approaches. if we want to test for correctness and spec compliance/etc. then i think that is really a whole different ballgame. |
On the contrary, pipelining is explicitly allowed:
All frameworks that had to be fixed were violating the transactionality requirement, and skipping Sync messages between queries in particular. |
yes. you are right. apologies - it's been some time since i looked at this. 🙏 but there is still no verification that this is implemented "correctly" with a sync for every query. |
I have added persist db session with auto recovery of TCP connection and cache prepared statements for @itrofimow Do you mind trying it and possibly update the issue if it passes your test? |
Hi @fakeshadow I've tested you PR locally and can confirm that it indeed makes Thanks for giving fixes a start! edit: typo |
Thanks for the test and update. The lock is with zero contention in happy path when no reconnecting happen so it's just a couple of single threaded atomic read and a single CAS exchange so I would not worry about the perf impact. There is also room for optimization to remove the atomic operations completely as |
Hi, I've just created PR #8865 for Rgds |
Closed accidentally? |
It seems like it. The problem still persists with plenty of those frameworks just engineered for the appearance of performance. |
I think the issue is mostly concluded. For mentioned frameworks they would have already responded and submitted patches if they are interested in fixing this issue in bench code. Besides like me and others mentioned there will not be major perf impact caused by a cold path re connectivity of db session. From what I see unless there is a rule change to db bench there is no way to motivate everyone to make change. (I don't advocate for a change of rule) |
Well, the current rules seem to be enough to disqualify such frameworks
How can a web framework that primarily uses a SQL database be considered production-grade if it fails to simply reconnect after a connection failure? |
Yea personally I agree with you on this issue and I would happy to see every framework implement db session persist. That said from what I see here people have drastically different opinion on what "production grade" means and it's really hard and sometimes downright impossible to come to consent. There are a bunch of similar issue left unsolved related to how to define words. It's kinda sad but also normal for a big community. |
Sorry for another ping. Recently I've found a flaw in this approach where user customized types could be changed during db downtime and cause conflict between client cached types and prepared statement. Therefore I gotta say there is measurable perf regression with the change which is around 1~2% in db related score. And I'm confident to say almost all Rust framework will suffer a perf regression if connection pool approach is required. The severity of regression would be different from framework to framework but majority of the top scorers will see huge regression as they use popular crates like |
@Xudong-Huang could you add reconnection capabilities to your pool here? @itrofimow do you happen to know why userver dropped so significantly in round 23? It looks like a bug/regression. |
TLDR: in the current top20 only 7 implementations survive a temporary database connectivity loss, other either hang, crash, or get stuck with http500.
I decided to become a QA engineer for a day and give frameworks a touch of real-world scenarios, to see how they would react.
I did this
for every framework in the top20 of the most recent continuous run.
The results i've got:
Some of the rules in the spec are a bit vague, intentionally i believe, and Rule 1. of General Test Requirements is definitely one of such rules, but I would be very surprised to learn that the behavior observed qualifies as a "production-grade" in anyone's book.
Having a reliable way to recover from potential errors is likely to be not free performance-wise.
Some of the frameworks above come with configurations that behave reasonably in the scenario given,
but their reliability cost is not just "not free", it's staggering:
Axum, for example, which hits a very impressive 550K RPS and is ranked 9 in the "Single Query" test,
has a configuration that survives a database restart.
That configurations drops its performance to 90K RPS (x6 degradation2) and rank 262, even below MongoDB-based implementation for the test.
Considering the above, from a perspective of a web-developer who tries to choose a right tool for the job I personally find the performance figures shown misleading.
Curious to hear the community thoughts on the matter.
Footnotes
By 502 I mean "closes the client connection right away" ↩ ↩2 ↩3 ↩4 ↩5
I think it's not configured optimally, but still ↩
The text was updated successfully, but these errors were encountered: