Skip to content

add custom problem detector plugin #145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

andyxning
Copy link
Member

@andyxning andyxning commented Oct 18, 2017

This PR will add custom plugin problem detector interface to node-problem-detector.

Proposal: https://docs.google.com/document/d/1jK_5YloSYtboj-DtfjmYKxfNnUxCAvohLnsH5aGCAYQ/edit#

Major changes:

  • move ApplyDefaultConfiguration for MonitorConfig from function to method. Ref
  • move ValidateRules for MonitorConfig from function to method for. Ref
  • move k8s.io/node-problem-detector/pkg/systemlogmonitor/util to k8s.io/node-problem-detector/pkg/util/tomb. Ref
  • move Rule type for system log config to pkg k8s.io/node-problem-detector/pkg/types. Ref
  • move LogMonitor interface to k8s.io/node-problem-detector/pkg/types and rename it to Monitor. Ref
  • Code clean. Ref

This change is Reviewable

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 18, 2017
@andyxning andyxning force-pushed the add_custom_problem_detector_plugin branch 16 times, most recently from 606d921 to b6e8d02 Compare October 19, 2017 07:57
@andyxning andyxning changed the title [WIP] add custom problem detector plugin add custom problem detector plugin Oct 19, 2017
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 19, 2017
@andyxning andyxning force-pushed the add_custom_problem_detector_plugin branch 4 times, most recently from 885e141 to 7b24581 Compare October 19, 2017 10:14
@andyxning
Copy link
Member Author

ping @Random-Liu @dchen1107 :)

@andyxning
Copy link
Member Author

Ping @Random-Liu @dchen1107

@andyxning andyxning force-pushed the add_custom_problem_detector_plugin branch 2 times, most recently from 63c0f89 to 1c8fd67 Compare October 29, 2017 07:06
@Random-Liu Random-Liu self-assigned this Nov 14, 2017
@Random-Liu
Copy link
Member

@andyxning I'll review this today.

@andyxning
Copy link
Member Author

@Random-Liu #144 is also ready to review. :)

Copy link
Member

@Random-Liu Random-Liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished half. Please address comments or reply.
I'll continue review tomorrow.

README.md Outdated
@@ -14,7 +14,7 @@ enabled by default in the GCE cluster.
# Background
There are tons of node problems could possibly affect the pods running on the
node such as:
* Hardware issues: Bad cpu, memory or disk;
* Hardware issues: Bad cpu, memory or disk, ntp service down;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ntp service down a hardware issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually not. Just wan to add ntp as an example. :)

Will move to another type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add Infrastructure daemon issues type.

README.md Outdated
@@ -44,32 +44,38 @@ A problem daemon could be:
* An existing node health monitoring daemon integrated with node-problem-detector.

Currently, a problem daemon is running as a goroutine in the node-problem-detector
binary. In the future, we'll separate node-problem-detector and problem daemons into
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why removing this one? We are still not sure how this should work. :p

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am opinionated about this. :(

You're right. We have not concluded which way to run NPD with different problem daemons is the best one.

README.md Outdated

List of supported problem daemons:

| Problem Daemon | NodeCondition | Description |
|----------------|:---------------:|:------------|
| [KernelMonitor](https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json) | KernelDeadlock | A system log monitor monitors kernel log and reports problem according to predefined rules. |
| [AbrtAdaptor](https://github.com/kubernetes/node-problem-detector/blob/master/config/abrt-adaptor.json) | None | Monitor ABRT log messages and report them further. ABRT (Automatic Bug Report Tool) is health monitoring daemon able to catch kernel problems as well as application crashes of various kinds occurred on the host. For more information visit the [link](https://github.com/abrt). |
| [CustomPluginMonitor](https://github.com/kubernetes/node-problem-detector/blob/master/config/custom-plugin-monitor.json) | On-demand(According to users configuration) | A custom plugin monitor for node-problem-detector to invoke and check various node problems with user defined check scripts. [Proposal is here](https://docs.google.com/document/d/1jK_5YloSYtboj-DtfjmYKxfNnUxCAvohLnsH5aGCAYQ/edit#). |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Proposal is here/see proposal here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

README.md Outdated
* `--custom-plugin-monitors`: List of paths to custom plugin monitor config files, comma separated, e.g.
[config/custom-plugin-monitor.json](https://github.com/kubernetes/node-problem-detector/blob/master/config/custom-plugin-monitor.json).
Node problem detector will start a separate custom plugin monitor for each configuration. You can
use different custom plugin monitors to monitor different node problems.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

http://APISERVER_IP:APISERVER_PORT?inClusterConfig=false
```
Refer [heapster docs](https://github.com/kubernetes/heapster/blob/master/docs/source-configuration.md#kubernetes) for a complete list of available options.
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why indent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mainly used to make the code block indent with previous content in the same section.

// Validate configurations
err = l.config.Validate()
if err != nil {
glog.Fatalf("Failed to validate custom plugin config %+v. %v", l.config, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/./:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

for {
select {
case result := <-resultChan:
glog.V(3).Infof("Receive new plugin result. %+v", result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/./:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.


// NewCustomPluginMonitorOrDie create a new customPluginMonitor, panic if error occurs.
func NewCustomPluginMonitorOrDie(configPath string) types.Monitor {
l := &customPluginMonitor{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do you want to change l to c or something else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. It is really difficult to recognize. :)

}
} else {
// For permanent error changes the condition
for i := range l.conditions {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also generate an event XXChanged here? (Also do that for log monitor).

Then we don't need to add 2 entries for the same script.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, I prefer we separate events and conditions. This will make npd more flexible in case people we only need conditions aside from events.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is an issue I want to fix long time ago. We should always generate an event for condition switch, I think. We also do that in kubelet https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_node_status.go#L430.

Usually condition change is not obvious enough to user.

// For permanent error changes the condition
for i := range l.conditions {
condition := &l.conditions[i]
if condition.Type == result.Rule.Condition {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For custom plugin, I feel like the condition may change back, right?
For log it's hard to identify, but for plugin, we could purely change condition status based on plugin return value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. With the output of npd plugins, we should update the condition status accordingly. Will do.

"type": "temporary",
"reason": "NTPIsDown",
"path": "./config/plugin/check_ntp.sh",
"timeout": "3s"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the per plugin timeout config. :)

@andyxning andyxning force-pushed the add_custom_problem_detector_plugin branch 5 times, most recently from f532d5c to 2b3e108 Compare November 15, 2017 15:30
"plugin": "custom",
"pluginConfig": {
"invoke_interval": "30s",
"timeout": "5s",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this for now, but please add TODO in code. We should have per-rule interval.

"path": "./config/plugin/check_ntp.sh",
"timeout": "3s"
},
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should generate event for condition change. With that, you shouldn't need 2 rules here.

}
condition.Status = true
condition.Reason = result.Rule.Reason
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why break?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored. PTAL.

// For permanent error changes the condition
for i := range c.conditions {
condition := &c.conditions[i]
if condition.Type == result.Rule.Condition {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combine the logic:

status = (result.ExitStatus >= cpmtypes.NonOK)
if condition.Status != status || condition.Reason != result.Rule.Reason {
  condition.Transition = timestamp
  condition.Message = result.Message
}
condition.Status = status
condition.Reason = result.Rule.Reason

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And please generate event. :P

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we have a conclusion on emitting events when condition change. I prefer adding this as a TODO and will be addressed in next PR. :)

"path": "./config/plugin/check_ntp.sh",
"timeout": "3s"
},
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes, for the same condition, status is not changed, but reason is changed. Without event, people will not even notice that.

}

p.Wait()
glog.Info("End to run custom plugins")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finish running custom plugins.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

stdout, err := cmd.Output()
if err != nil {
if _, ok := err.(*exec.ExitError); !ok {
glog.Errorf("Error in running plugin %q. %v. %v", rule.Path, err, string(stdout))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

glog.Errorf("Error in running plugin %q: error - %v. output - %q", rule.Path, err, string(stdout))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.


// trim suffix useless bytes
output = string(stdout)
output = strings.TrimSpace(output)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trimspace should have trimmed newline, I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. Have not realized this before. :)

output = strings.TrimSpace(output)
output = strings.TrimRight(output, "\n")

if cmd.ProcessState.Sys().(syscall.WaitStatus).Signaled() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the script is stopped because of timeout, I hope we could clarify that in the message. Currently, it seems that we'll only get signal: killed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enhanced log.


import (
"testing"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove empty line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@Random-Liu
Copy link
Member

Random-Liu commented Nov 16, 2017

@andyxning I have a meeting soon. After that I'll come back and finish review.

import (
"fmt"
"time"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove empty line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

type CustomPluginConfig struct {
// Plugin is the name of plugin which is currently used.
// Currently supported: custom.
Plugin string `json:"plugin, omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

Copy link
Member Author

@andyxning andyxning Nov 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also thought this when i prepare this PR. In order to be consistent with existing log configs, i choose to retain this Plugin field and the only valid value is custom.

// PluginConfig is global plugin configuration.
PluginGlobalConfig pluginGlobalConfig `json:"pluginConfig, omitempty"`
// Source is the source name of the custom plugin monitor
Source string `json:"source"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, just have Plugin name, and use Plugin name as the source?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mainly used to let users define the user specified event source. This should be retained, IMO.


timeout, err := time.ParseDuration(*cpc.PluginGlobalConfig.TimeoutString)
if err != nil {
return fmt.Errorf("Error in parsing global timeout %q. %v", *cpc.PluginGlobalConfig.TimeoutString, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Error/error
s/./:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.


invoke_interval, err := time.ParseDuration(*cpc.PluginGlobalConfig.InvokeIntervalString)
if err != nil {
return fmt.Errorf("Error in parsing invoke interval %q. %v", *cpc.PluginGlobalConfig.InvokeIntervalString, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.


for _, rule := range cpc.Rules {
if _, err := os.Stat(rule.Path); os.IsNotExist(err) {
return fmt.Errorf("Rule path %q does not exist. Rule: %+v", rule.Path, rule)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

}

for _, rule := range cpc.Rules {
if _, err := os.Stat(rule.Path); os.IsNotExist(err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should return any error, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we have returned an error to represent the file not exist error. I am probably not understanding this comment quite well. :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean we should return any error returned by Stat, right?

ruleTimeout := 1 * time.Second
ruleTimeoutString := ruleTimeout.String()

utMetas := []struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a map, add description for each test case https://github.com/kubernetes-incubator/cri-containerd/blob/master/pkg/server/helpers_test.go#L29.

Or else it's very hard to figure out what each case is testing in the future. And please do this for the other unit tests you added.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@@ -76,3 +76,55 @@ type Status struct {
// newest node conditions in this field.
Conditions []Condition `json:"conditions"`
}

// Type is the type of the problem.
type Type string
Copy link
Member

@Random-Liu Random-Liu Nov 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer keeping these types in each monitor. :)
At top level, we only care about the status above. Rules are monitor specific.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.


// Monitor monitors log and custom plugins and reports node problem condition and event according to
// the rules.
type Monitor interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep this here. I agree.

@Random-Liu
Copy link
Member

LGTM overall. Please take a look at the comments @andyxning

@andyxning andyxning force-pushed the add_custom_problem_detector_plugin branch from 2b3e108 to 28f8a8f Compare November 19, 2017 12:57
@andyxning andyxning force-pushed the add_custom_problem_detector_plugin branch 3 times, most recently from 8995b78 to 51b0a4e Compare November 19, 2017 14:26
@andyxning
Copy link
Member Author

@Random-Liu Comments addressed. PTAL.

condition.Transition = timestamp
condition.Message = result.Message
}
condition.Status = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

condition.Status = status

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the logic error. :(

@@ -76,3 +76,55 @@ type Status struct {
// newest node conditions in this field.
Conditions []Condition `json:"conditions"`
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

@Random-Liu
Copy link
Member

Please address the last comment. I'll send a PR based on yours.

@Random-Liu Random-Liu mentioned this pull request Nov 21, 2017
@andyxning andyxning force-pushed the add_custom_problem_detector_plugin branch from 51b0a4e to 10dbfef Compare November 22, 2017 02:14
@andyxning
Copy link
Member Author

@Random-Liu Comment addressed. PTAL.

@Random-Liu
Copy link
Member

LGTM

@Random-Liu Random-Liu merged commit ec470b6 into kubernetes:master Nov 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants