-
Notifications
You must be signed in to change notification settings - Fork 24
Change APIServer startup message #48
Change APIServer startup message #48
Conversation
We found that the APIServer was not fully up when it logged "Serving (in)securely on ...". We tried to find a different log line we can wait for to know when the APIServer is fully up. The autoregister controller came in with kubernetes/kubernetes#42732. It seems to be the last thing to come up before the server is ready to go.
/test pull-frameworks-test |
I think it would be preferable to do a health check call to the server instead of attempting to parse arbitrary output. This would be consistent with how existing tests work in k8s.io/kubernetes. |
We now use the health check API of the APIServer to determine when the APIServer is ready. The implementation in the `processState` gives precedence to the the health check API in favour of the start message -- if you specify both, we only use the endpoint.
@marun : PTAL :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your quick response, good progress! I have some suggestions on how it could be improved.
@@ -122,3 +121,27 @@ func isSomethingListeningOnPort(hostAndPort string) portChecker { | |||
return true | |||
} | |||
} | |||
|
|||
func CheckAPIServerIsReady(kubeCtl *integration.KubeCtl) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this check required? It seems more like validation of the underlying kubeapi implementation than of the fixture that is managing a kube-apiserver binary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are porting tests in k/k
and had an issue there. We took the code from this test and used it in here. This should guard against the same issue re-appearing, even if we change the method by which we check that the APIServer is ready to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider documenting your rationale in a comment so that this knowledge is available to future maintainers.
integration/internal/process.go
Outdated
// assume the process is ready to operate. E.g. "/healthz". If this is set, | ||
// we ignore StartMessage. | ||
HealthCheckEndpoint string | ||
// Message to wait for on stderr. If we recieve this message, we assume the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it desirable to still support waiting for a message on stderr? All kube api servers and controllers are likely to have healthcheck endpoints given that is how pod health is generally determined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We didn't port all the things, only the APIServer, to use that new method
- We are not sure if
etcd
has an health check API (it probably has) - We know that the virtual kubelet with the mock provider does not expose its health via API (see Integration: Introduce notion of Controller Manager & Scheduler #41)
- I think it does not hurt or adds too much more complexity
- If we find that all the things we want to support have an API and all the processes are ported to use that new mechanism, we are happy to phase out the detection via waiting for a string on stderr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider updating the comment for StartMessage
discouraging its use in favor of a health check. It's not the complexity I'm worried about, rather that anyone naively relying on a startup message appearing in stderr is likely to run into the same type of problem that prompted you to create this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mark as deprecated? (https://blog.golang.org/godoc-documenting-go-code)
integration/internal/process_test.go
Outdated
}) | ||
|
||
It("hits the endpoint, and successfully starts", func() { | ||
processState := &ProcessState{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DRY - Consider creating a function to initialize a ProcessState
instance rather than duplicating those details in every test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We try to keep the tests as easy to grasp as possible and to have them tell a story. That means that we sometimes/often deliberately end up with rather verbose tests and duplication in them.
Having said that, I am happy to revisit this. Especially, but not only, in this case it might be better to abstract all that into a func createSimpleProcessState() *ProcessState
or such.
integration/internal/process.go
Outdated
if ps.HealthCheckEndpoint != "" { | ||
healthCheckURL := ps.URL | ||
healthCheckURL.Path = ps.HealthCheckEndpoint | ||
go pollURLUntilOK(healthCheckURL, ready) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the start timeout is reached, this goroutine should be terminated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, thanks! 👍
integration/internal/process.go
Outdated
ready <- true | ||
return | ||
} | ||
time.Sleep(time.Millisecond * 100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider exposing the loop iteration interval as a configurable option alongside timeout to reduce the potential for negative interaction between interval and timeout (e.g. timeout < interval).
It might be worth cribbing from the apimachinery polling code:
https://github.com/kubernetes/apimachinery/blob/master/pkg/util/wait/wait.go#L240
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@marun , thanks for the great input! PTAL |
I just recognized that the poll frequency is not configurable yet. I can address this tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the changes look good! I have some inline suggestions for simplifying the goroutine handling.
integration/internal/process_test.go
Outdated
processState.Path = "/nonexistent" | ||
Context("when the healthcheck returns ok", func() { | ||
BeforeEach(func() { | ||
server.RouteToHandler("GET", "/healthz", ghttp.RespondWith(http.StatusOK, "")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YAGNI - What is the value of putting this in BeforeEach
given that there is only a single test? Same comment on some of the other contexts in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pattern we are just used to use over here. I can see that it might look strange and overly complicated / too exhaustive / ... sometimes. Just to give you a bit of context, why we structured the tests as they are:
- We have the different
Context
s which tell a different (sub)story of what we want to test in this particular case - In each
Context
we create the state as we described it in the title of theContext
, mostly with{Just,}BeforeEach
or so - In the
It
we do the actual test, do what we have described in the title of theIt
, with the assumption that the context is already prepared as described
So it is not so much about sharing setup code for multiple tests (at least in this example), for us it is more about the story telling (and arguably(?) nicely readable ginkgo output maybe).
If I'd pull the server.RouteToHandler("GET", "/healthz", ...)
from the Context
's BeforeEach
into the It
, I'd feel an urge to change the title of the It
to something like
hits an endpoint which always returns OK, and successfully starts
Does my long-winded explanation make any sense to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still not clear to me why some code belongs in BeforeEach and other code does not, and I'm concerned that the distinction is arbitrary. I think separating out code in arbitrary ways imposes costs to readability and maintainability, because anything that's arbitrary is difficult to communicate or have common standards for. I'm not going to force you to change, I'm just providing my perspective.
integration/internal/process.go
Outdated
|
||
for { | ||
select { | ||
case <-time.Tick(time.Millisecond * 100): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following comment remains unresolved: #48 (comment)
I'm also not sure of the value of adding indirection to such a simple function. Consider simplifying:
for {
res, err := http.Get(url.String())
if err == nil && res.StatusCode == http.StatusOK {
ready <- true
return
}
select {
case <-stopCh:
return
default:
time.Sleep(pollInterval)
}
}
integration/internal/process.go
Outdated
return io.MultiWriter(safeWriters...) | ||
} | ||
|
||
func pollURLUntilOK(url url.URL, ready chan bool, timedOut <-chan time.Time) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(No action required) The name timedOut
presumes the reason this function should exit. Consider using a name that is more functional (e.g. stopCh
is a common convention in k/k).
integration/internal/process.go
Outdated
return fmt.Errorf("timeout waiting for process %s to start", path.Base(ps.Path)) | ||
} | ||
} | ||
|
||
func teeTimer(src <-chan time.Time) (chan time.Time, chan time.Time) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this approach is a bit heavyweight if the goal is providing a stop channel to the goroutine. Consider the following example:
var stopCh chan struct{}
...
stopCh = make(chan struct{})
go pollURLUntilOK(healthCheckURL, ready, stopCh)
...
select {
...
case <-timedOut:
if stopCh != nil {
close(stopCh)
}
}
@@ -122,3 +121,40 @@ func isSomethingListeningOnPort(hostAndPort string) portChecker { | |||
return true | |||
} | |||
} | |||
|
|||
// CheckAPIServerIsReady checks if the APIServer is really ready and not only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
// message, we assume the process is ready to operate. Ignored if | ||
// HealthCheckEndpoint is specified. | ||
// | ||
// The usage of StartMessage is discouraged, favour HealthCheckEndpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
}) | ||
It("returns a timeout error and stops health API checker", func() { | ||
processState.HealthCheckEndpoint = "/healthz" | ||
processState.StartTimeout = 500 * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(No action required) Consider using a variable or constant in place of a magic number where possible. Naming a value can document its purpose and simplify reuse across tests:
shortTimeout = 100 * time.Millisecond
longTimeout = 500 * time.Millisecond
foreverTimeout = 20 * time.Second
The decission was made, that the defaulting is not done in the `DoDefaulting` function, as this does not apply for each process (as off now). In future we might consider splitting the `DoDefaulting` function into multiple functions and apply only certain on per process.
@marun, thanks for the great review, comments & tips! Let me know what you think about the PR now. /test pull-frameworks-test |
/lgtm |
@marun: changing LGTM is restricted to assignees, and only kubernetes-sigs org members may be assigned issues. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: marun, totherme The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
||
Context("when the polling interval is not configured", func() { | ||
It("uses the default interval for polling", func() { | ||
processState.HealthCheckEndpoint = "/helathz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
We found that the APIServer was not fully up when it logged "Serving
(in)securely on ...". We tried to find a different log line we can wait
for to know when the APIServer is fully up.
The autoregister controller came in with
kubernetes/kubernetes#42732. It seems to be the
last thing to come up before the server is ready to go.
Is there a better thing to wait for, or is this a good indicator that the APIServer is ready to use?