-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Remove ignore system bootstrap checks #20511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove ignore system bootstrap checks #20511
Conversation
Today we allow system bootstrap checks to be ignored with a setting. Yet, the system bootstrap checks are as vital to the health of a production node as the non-system checks (e.g., the original bootstrap check, the file descriptor check, is critical for reducing the chances of data loss from being too low). This commit removes the ability to ignore system bootstrap checks.
I'm opening this PR for discussion, there is no need for a review at this time. Personally, I'm very torn on whether or not to do this with the one factor pushing me towards removal being the file descriptor check. This is one of the most critical checks, it is a system check ignored when system bootstrap checks are ignored yet it is also one of the most likely roadblocks. |
I think that it just comes down to whether there's a legitimate scenario where someone in a development environment won't be able to start ES without this setting. I doubt there are, but I think you're more familiar with esoteric OS things than I am. I think removing it makes sense for production because even if they can't fix those settings, then it's not really a valid production environment; it's only trying to give developers a chance to get started that has me thinking at all. |
I realized that most of the frustration around the bootstrap checks is coming from testing scenarios. For instance elasticsearch runs in a VM or some kind of container and must communicate with the outside via http. Now setting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM let do this
|
||
for (final Check check : checks) { | ||
if (check.check()) { | ||
if ((!enforceLimits || (check.isSystemCheck() && ignoreSystemChecks)) && !check.alwaysEnforce()) { | ||
if (!enforceLimits && !check.alwaysEnforce()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
man == false
reads so much easier :)
@s1monw So when setting |
* master: (119 commits) allow settings logging level via a sys config in unit tests [fix] JSON Processor was not properly added (#20613) [docs] [fix] `field` is no longer an option for the script processor (#20614) [TEST] Fix NumberFieldMapperTests.testNoDocValues to call correct helper method. [DOCS] Remove non-valid link to mapping migration document Revert "Default `include_in_all` for numeric-like types to false" test: add a test with ipv6 address docs: clearify that both ip4 and ip6 addresses are supported Include complex settings in settings requests Add production warning for pre-release builds Clean up confusing error message on unhandled endpoint [TEST] Increase logging level in testDelayShards() change health from string to enum (#20661) Provide error message when plugin id is missing Document that sliced scroll works for reindex Make reindex-from-remote ignore unknown fields Remove NoopGatewayAllocator in favor of a more realistic mock (#20637) Remove Marvel character reference from guide Fix documentation for setting Java I/O temp dir Update client benchmarks to log4j2 ...
Today we allow system bootstrap checks to be ignored with a setting. Yet, the system bootstrap checks are as vital to the health of a production node as the non-system checks (e.g., the original bootstrap check, the file descriptor check, is critical for reducing the chances of data loss from being too low). This commit removes the ability to ignore system bootstrap checks. Relates #20511
Today we allow system bootstrap checks to be ignored with a setting. Yet, the system bootstrap checks are as vital to the health of a production node as the non-system checks (e.g., the original bootstrap check, the file descriptor check, is critical for reducing the chances of data loss from being too low). This commit removes the ability to ignore system bootstrap checks. Relates #20511
@cwurm Yes. |
Very bad decision. This change avoid create and run cluster on docker that haven't all bootstrap conditions. |
@ozlevka If you're trying to run production systems in a docker container where you are unable to fix these issues, then you are going to crash and lose data later on. So it is a good thing that we're preventing you from doing this. If you just want to use these containers for testing, you can configure |
Of course use of docker for functional testing and not for production. |
@ozlevka maybe next time you just describe your usecase in detail and ask how you can solve it instead of moaning. We would all appreciate this. |
@s1monw Very simple usecase: Run cluster in docker containers... |
@s1monw I participate in many open source communities and comments like yours makes this community one of the worst. I'm running into the same issue as @ozlevka in that upgrading to 5.0 I no longer can running inside Docker. While don't you let me worry about crashing and loosing data, instead of preventing me from starting at all. I'm an adult and can deal it with myself. |
@jasontedor can you revert this? I want an option to run in less than ideal configurations. |
@mingfang I'm very sorry, but this change will not be reverted. Note that if you're not forming a cluster, only running a single node for testing, you can expose HTTP via http.host and leave transport bound to localhost; this will avoid tripping the bootstrap tests. |
@XANi don't base that assumption on 2.x, which used a combination of mmapfs and niofs. Elasticsearch 5.x uses mmapfs only, which completely changes the max map count requirements. And if this value is not set high enough, you lose data. Hence the very sane checks. |
This basically might explain why some people are still using elasticsearch version 2.4 in Openshift environment? |
Just chipping in on this discussion and not throwing oil on the flames. (@s1monw I do tend to agree that the tone of your october comment above is misplaced) I regret removing the possibility of running in 'development mode' even when clustering. We are a big fan of Elasticsearch and it was one of the first technologies we deployed on our Hex devices. We wrap deployable services as docker containers to deploy them on our 6 node microclusters. Which by definition are development, demo, POC setups, but yes, truly clustered. Because of this fix, we can not easily force ES into development mode or have it bypass the bootstrap checks. Not sure if tweaking all these system settings to production grade values are sound choices for 16GB/1TB microservers. I'll have to go through the list and provide patches to our install base to allow for these settings. Forcing dev mode or a way to have bootstrap checks only as severe warnings would have been better. But I'm assuming this will never come back in such form, would it? |
correct - you can still cluster if you are binding to localhost just not other interfaces |
Kind of difficult to bind to localhost on a headless server and within docker containers, no? Or am I missing something here? |
Just for some perspective, what you consider a microserver is a full-sized server to some others. I have helped plenty of users with 8 GB of RAM. The users that have been burned by these settings lost data and the number of advanced users that truly know the risks is tiny in comparison to that risk. |
@pickypg good to hear and know. So we'll have to push our Hexes more as production grade clusters :-) I'll check on how to implement the settings upon install of the containers. But I'd like to prevent building our own containers as much as possible. I'll keep you posted! |
Sorry if I missed something, but is there a solution for the following use case with ElasticSearch 5.2.x?
Currently this fails because of the Bootstrap tests. |
... which is preferable to you losing your data later on when mmap no longer works. |
@clintongormley not sure if you are trolling or whatnot, but @moritzzimmer's comment clearly states he is trying to use this for integration tests and isn't going to be needing that data later on |
@termie i would never troll :) I missed the integration test comment |
then just add flag back and add warning message "@clintongormley told you so" when that happens ;) |
@clintongormley so do you have a solution for my use case? |
@moritzzimmer sounds like an option would be to submit a PR rolling back this patch and changing the flag from |
+1 ^. Given the existence of Docker testing usecases where we might not have control over kernel tunables and aren't interested in long-term integrity of our data, it would be incredibly helpful to have the [non-default] option to ignore these bootstrap tests. Surely a stern warning against using the flag in production environments would be a less nuclear option than removing it entirely? |
Or just a "developer/test mode" where:
which would prevent someone from accidentally leaving it on production cluster, yet still allows to run tests. |
If your test environment resembles a production environment, but cannot be configured like a real production environment, then it is not a valid test environment. I think that's the bigger concern here: not that some production-level check is failing you. I'm not at all familiar with Wercker, but I do know that a lot of people successfully use clustered environments of Dockerized Elasticsearch instances, which implies that they successfully use Docker with bootstrap checks. Therefore, I think the better option is to work with Docker/Containers authors to resolve these issues where you're forced to accept a system level setting: it's their issue; we're just the symptom here. I'm not saying that it will be easy, but it's the right recourse here. As humorous as the suggested reimplemented-setting rename is, it doesn't change the fact that a lot of people will progress from development into production with the exact same Puppet/Chef/Ansible/Salt/hand-rolled configuration. It also reopens the noted can of worms that the acknowledgement does not outweigh the concern; let's not pretend that other users won't flip the switch just to make it work because some random website told them how to work around a log message that they did not understand (copy/paste for the win). Even worse, it's often not the same person going from development to production and they frequently don't have the experience that many of you do (because they're taking a "working" system configuration).
If you want this use case, then you can already bypass the bootstrap checks by standing up a single node and only setting
This would mean that there is code that we support and test, just to support and test a non-production scenario, which means your real production scenario could trigger unexpected bugs and issues that are completely unreproducible in your test environment. No thanks. |
@pickypg reading through your comment confuses me a bit, you seem to flip flop a bit. Why do you allow a non-production mode? You spend the first paragraph chastising people for not meeting your philosophically pure definition of the word "testing" by apparently not literally cloning their production environment. Your second paragraph sort of implies that you think the trying to make the gigantic ecosystem (read: much larger than elastic or elasticsearch) of docker and containers change is a better option than keeping a flag that allows that gigantic community to use your product for testing in a way that produces better quality code for their product. Third paragraph says that even allowing for the thought of a possibility that these checks are not run means somebody might accidentally put a development configuration into prod. Then, in a magical turn of events, in the fourth paragraph you share that there is a non-production mode enabled by, somewhat surprisingly, an extremely innocuous combination of flags being set and unset. To recap: Can't have this flag because 1) testing against anything that is not literally equal to your prod is not worth testing, 2) everybody else is wrong (and you don't use containers), 3) somebody could accidentally configure their cluster wrong, 4) oh but here is a really simple way you can configure your cluster wrong that is much less obvious to any code reviewer. If you really believe what you say in your 1st and 3rd paragraph, remove the bug described in the 4th. It is in clear direct opposition to your initial statements. If you believe what you say in your 2nd paragraph make sure the only official distribution of your code is as VM images so you can control the environment and keep these pesky kids If you really believe what you say in your 3rd paragraph make sure your code only runs when it matches a checksum signed by elastic (may as well remove access to the source, too). Otherwise some crazy programmer might try to change what is running to work in the environment they are trying to run it in, erroneously thinking that they are aware of the constraints and limitations of their decisions. |
@termie, I don't agree with your summary of what is being said:
Such a single-node-not-production-setup is what you can use to run your integration tests against, which hence should work on wercker.com as well. I didn't think of this setup while reading these comments on CI and integration tests setups. I required the bypass because I needed clustered setup on our micro-datacenters. But I do agree that not all test-setups should be totally resembling your production setup. Depends on your tests, obviously! Hope this helps! |
@wimvanleuven last message from me, as i have already spent more time in this thread than I'm sure anybody is happy with: while i think your words are clearer and less myopic than @pickypg's comments, the base conflict of your two statements seems to remain (#1 shouldn't allow something that isn't for production via a simple configuration, #2 here is how to do something that isn't for production via a simple configuration). regardless, i leave you with the specific test-case involved, as commented above and not yet addressed (he can't exploit the http.host bug), because he is one of the real users here with a real specific use case, #20511 (comment) |
It's not a bug, it's by design. You can't form a cluster on multiple hosts if |
This is a poor assumption. We prefer developing with more system context and proper integration tests. The single node case is simply too trivial to do any infrastructure as code sanity checks on. This effectively killed the resources on our dev environment and now our engineers don't test ansible/infrastructure code changes in dev anymore before before running plays against prod! I get that there's effort to make ES idiot proof... hence this reasoning:
BUT not all of your clients are negligent about prod vs dev. I have a legitimate ansible + vagrant test suite that sets up a 5 node cluster mimicking (only functionally) our production cluster design. The purpose check the ansible orchestration, including rolling upgrades, works. I have an explicitly set a difference between the dev env and prod env (inventories in ansible world). E.g. I tried to map those to We have multiple special ES node roles, i.e. dedicated masters, coordinating, hot data node (SSD) vs cold data node (HDD), etc. Every ES node ends up being greedy and needing heaps of RAM (pun intended) than it might never actually touch while doing a dev functional cluster build test... Unfortunately, our test VM environment to run this is currently limited to 16GB of RAM. So throwing away memory in dev just because of production bootstrap checks is hurting a whole lot. I'd love a middle ground on the boot checks where most things are checked, but the memory requirements can at least be relaxed a little to allow things like |
Your test clusters should pass the bootstrap checks too then, just like your production clusters.
The purpose of this flag is to be able to force the bootstrap checks in a testing environment where they might not otherwise be enabled because the test node binds to localhost or has single-node discovery type.
If it’s not needed, lower the heap to 512m? |
Thanks, currently we seem to get away with If anyone else stumbles into this, I also found For the dev test workload, we don't know exactly how much heap is needed, and I was hoping to skip the effort of learning how to monitor/profile java heap internals and just start small, but let heap grow if need be. I understand elasticsearch production is more stable if all the heap is allocated upfront (and locked), but for dev functional testing, allowing the heap to grow dynamically would've been a nice convenience. I used this for dev cluster testing with older versions of elasticsearch. That's no longer possible due to enforced bootstrap checks and there's no opt out. |
Your paragraph about memory lock is not right. The memory lock setting defaults to false, and it doesn’t impact allocation of the heap, only what is allocated at the time we lock (locking is performed early in startup). This in turn is impacted by the starting heap size equaling the max heap size, and the always pre-touch setting which causes the JVM to touch every page of the heap on startup, forcing allocation.
Actually we advise disabling swap completely if possible, locking is a fallback if not. |
Thanks for clearing up my misconception! Totally appreciate the advice. So for sweating a small dev environment, something I could experiment with is jvm.options:
If elastic is the only thing running on the prod system (or in a container), totally makes sense. On the other end, if one is sharing a dev platform with a host of other components (e.g. logstash, kibana, nginx, kafka, zookeeper, rsyslog), disabling swap might have the unintended consequence of hurting Linux's ability to overcommit memory? Overcommit is probably a bad thing on a dedicated prod node, but useful on a crowded dev/testing host? |
For the heap sizes that you're talking about, I doubt that it would make much if any meaningful difference.
Agree, that's what I buried in if possible.
Yes. |
Just a quick recap of what, after many test cycles, I've found has helped avoid unnecessary memory pressure/greediness and pain in dev:
Adding |
Today we allow system bootstrap checks to be ignored with a
setting. Yet, the system bootstrap checks are as vital to the health of
a production node as the non-system checks (e.g., the original bootstrap
check, the file descriptor check, is critical for reducing the chances
of data loss from being too low). This commit removes the ability to
ignore system bootstrap checks.