Skip to content
This repository was archived by the owner on Jan 17, 2020. It is now read-only.

Change APIServer startup message #48

Merged

Conversation

totherme
Copy link
Contributor

We found that the APIServer was not fully up when it logged "Serving
(in)securely on ...". We tried to find a different log line we can wait
for to know when the APIServer is fully up.

The autoregister controller came in with
kubernetes/kubernetes#42732. It seems to be the
last thing to come up before the server is ready to go.

Is there a better thing to wait for, or is this a good indicator that the APIServer is ready to use?

We found that the APIServer was not fully up when it logged "Serving
(in)securely on ...". We tried to find a different log line we can wait
for to know when the APIServer is fully up.

The autoregister controller came in with
kubernetes/kubernetes#42732. It seems to be the
last thing to come up before the server is ready to go.
@totherme totherme requested a review from apelisse March 19, 2018 11:47
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 19, 2018
@hoegaarden
Copy link
Contributor

/test pull-frameworks-test

@marun
Copy link
Contributor

marun commented Mar 19, 2018

I think it would be preferable to do a health check call to the server instead of attempting to parse arbitrary output. This would be consistent with how existing tests work in k8s.io/kubernetes.

We now use the health check API of the APIServer to determine when the
APIServer is ready.

The implementation in the `processState` gives precedence to the the
health check API in favour of the start message -- if you specify both,
we only use the endpoint.
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 20, 2018
@totherme
Copy link
Contributor Author

@marun : PTAL :)

Copy link
Contributor

@marun marun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your quick response, good progress! I have some suggestions on how it could be improved.

@@ -122,3 +121,27 @@ func isSomethingListeningOnPort(hostAndPort string) portChecker {
return true
}
}

func CheckAPIServerIsReady(kubeCtl *integration.KubeCtl) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this check required? It seems more like validation of the underlying kubeapi implementation than of the fixture that is managing a kube-apiserver binary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are porting tests in k/k and had an issue there. We took the code from this test and used it in here. This should guard against the same issue re-appearing, even if we change the method by which we check that the APIServer is ready to go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider documenting your rationale in a comment so that this knowledge is available to future maintainers.

// assume the process is ready to operate. E.g. "/healthz". If this is set,
// we ignore StartMessage.
HealthCheckEndpoint string
// Message to wait for on stderr. If we recieve this message, we assume the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it desirable to still support waiting for a message on stderr? All kube api servers and controllers are likely to have healthcheck endpoints given that is how pod health is generally determined.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • We didn't port all the things, only the APIServer, to use that new method
  • We are not sure if etcd has an health check API (it probably has)
  • We know that the virtual kubelet with the mock provider does not expose its health via API (see Integration: Introduce notion of Controller Manager & Scheduler #41)
  • I think it does not hurt or adds too much more complexity
  • If we find that all the things we want to support have an API and all the processes are ported to use that new mechanism, we are happy to phase out the detection via waiting for a string on stderr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider updating the comment for StartMessage discouraging its use in favor of a health check. It's not the complexity I'm worried about, rather that anyone naively relying on a startup message appearing in stderr is likely to run into the same type of problem that prompted you to create this PR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

})

It("hits the endpoint, and successfully starts", func() {
processState := &ProcessState{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRY - Consider creating a function to initialize a ProcessState instance rather than duplicating those details in every test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to keep the tests as easy to grasp as possible and to have them tell a story. That means that we sometimes/often deliberately end up with rather verbose tests and duplication in them.

Having said that, I am happy to revisit this. Especially, but not only, in this case it might be better to abstract all that into a func createSimpleProcessState() *ProcessState or such.

if ps.HealthCheckEndpoint != "" {
healthCheckURL := ps.URL
healthCheckURL.Path = ps.HealthCheckEndpoint
go pollURLUntilOK(healthCheckURL, ready)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the start timeout is reached, this goroutine should be terminated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks! 👍

ready <- true
return
}
time.Sleep(time.Millisecond * 100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider exposing the loop iteration interval as a configurable option alongside timeout to reduce the potential for negative interaction between interval and timeout (e.g. timeout < interval).

It might be worth cribbing from the apimachinery polling code:

https://github.com/kubernetes/apimachinery/blob/master/pkg/util/wait/wait.go#L240

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@hoegaarden
Copy link
Contributor

@marun , thanks for the great input! PTAL

@hoegaarden
Copy link
Contributor

I just recognized that the poll frequency is not configurable yet. I can address this tomorrow.

Copy link
Contributor

@marun marun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the changes look good! I have some inline suggestions for simplifying the goroutine handling.

processState.Path = "/nonexistent"
Context("when the healthcheck returns ok", func() {
BeforeEach(func() {
server.RouteToHandler("GET", "/healthz", ghttp.RespondWith(http.StatusOK, ""))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YAGNI - What is the value of putting this in BeforeEach given that there is only a single test? Same comment on some of the other contexts in this file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pattern we are just used to use over here. I can see that it might look strange and overly complicated / too exhaustive / ... sometimes. Just to give you a bit of context, why we structured the tests as they are:

  • We have the different Contexts which tell a different (sub)story of what we want to test in this particular case
  • In each Context we create the state as we described it in the title of the Context, mostly with {Just,}BeforeEach or so
  • In the It we do the actual test, do what we have described in the title of the It, with the assumption that the context is already prepared as described

So it is not so much about sharing setup code for multiple tests (at least in this example), for us it is more about the story telling (and arguably(?) nicely readable ginkgo output maybe).

If I'd pull the server.RouteToHandler("GET", "/healthz", ...) from the Context's BeforeEach into the It, I'd feel an urge to change the title of the It to something like

hits an endpoint which always returns OK, and successfully starts

Does my long-winded explanation make any sense to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still not clear to me why some code belongs in BeforeEach and other code does not, and I'm concerned that the distinction is arbitrary. I think separating out code in arbitrary ways imposes costs to readability and maintainability, because anything that's arbitrary is difficult to communicate or have common standards for. I'm not going to force you to change, I'm just providing my perspective.


for {
select {
case <-time.Tick(time.Millisecond * 100):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following comment remains unresolved: #48 (comment)

I'm also not sure of the value of adding indirection to such a simple function. Consider simplifying:

for {
  res, err := http.Get(url.String())
  if err == nil && res.StatusCode == http.StatusOK {
    ready <- true
    return
  }
    
  select {
    case <-stopCh:
      return
    default:
      time.Sleep(pollInterval)
  }  
}

return io.MultiWriter(safeWriters...)
}

func pollURLUntilOK(url url.URL, ready chan bool, timedOut <-chan time.Time) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(No action required) The name timedOut presumes the reason this function should exit. Consider using a name that is more functional (e.g. stopCh is a common convention in k/k).

return fmt.Errorf("timeout waiting for process %s to start", path.Base(ps.Path))
}
}

func teeTimer(src <-chan time.Time) (chan time.Time, chan time.Time) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach is a bit heavyweight if the goal is providing a stop channel to the goroutine. Consider the following example:

var stopCh chan struct{}
...
  stopCh = make(chan struct{})
  go pollURLUntilOK(healthCheckURL, ready, stopCh)
...
select {
...
case <-timedOut:
  if stopCh != nil {
    close(stopCh)
  }
}

@@ -122,3 +121,40 @@ func isSomethingListeningOnPort(hostAndPort string) portChecker {
return true
}
}

// CheckAPIServerIsReady checks if the APIServer is really ready and not only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

// message, we assume the process is ready to operate. Ignored if
// HealthCheckEndpoint is specified.
//
// The usage of StartMessage is discouraged, favour HealthCheckEndpoint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

})
It("returns a timeout error and stops health API checker", func() {
processState.HealthCheckEndpoint = "/healthz"
processState.StartTimeout = 500 * time.Millisecond
Copy link
Contributor

@marun marun Mar 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(No action required) Consider using a variable or constant in place of a magic number where possible. Naming a value can document its purpose and simplify reuse across tests:

shortTimeout = 100 * time.Millisecond
longTimeout = 500 * time.Millisecond
foreverTimeout = 20 * time.Second

The decission was made, that the defaulting is not done in the
`DoDefaulting` function, as this does not apply for each process (as off
now).
In future we might consider splitting the `DoDefaulting` function into
multiple functions and apply only certain on per process.
@hoegaarden
Copy link
Contributor

@marun, thanks for the great review, comments & tips! Let me know what you think about the PR now.

/test pull-frameworks-test

@marun
Copy link
Contributor

marun commented Mar 23, 2018

/lgtm

@k8s-ci-robot
Copy link
Contributor

@marun: changing LGTM is restricted to assignees, and only kubernetes-sigs org members may be assigned issues.

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@marun
Copy link
Contributor

marun commented Mar 23, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 23, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: marun, totherme

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 2909416 into kubernetes-retired:master Mar 23, 2018

Context("when the polling interval is not configured", func() {
It("uses the default interval for polling", func() {
processState.HealthCheckEndpoint = "/helathz"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants