-
Notifications
You must be signed in to change notification settings - Fork 652
add custom problem detector plugin #145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add custom problem detector plugin #145
Conversation
606d921
to
b6e8d02
Compare
885e141
to
7b24581
Compare
ping @Random-Liu @dchen1107 :) |
Ping @Random-Liu @dchen1107 |
63c0f89
to
1c8fd67
Compare
@andyxning I'll review this today. |
@Random-Liu #144 is also ready to review. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished half. Please address comments or reply.
I'll continue review tomorrow.
README.md
Outdated
@@ -14,7 +14,7 @@ enabled by default in the GCE cluster. | |||
# Background | |||
There are tons of node problems could possibly affect the pods running on the | |||
node such as: | |||
* Hardware issues: Bad cpu, memory or disk; | |||
* Hardware issues: Bad cpu, memory or disk, ntp service down; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ntp service down a hardware issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually not. Just wan to add ntp as an example. :)
Will move to another type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add Infrastructure daemon issues
type.
README.md
Outdated
@@ -44,32 +44,38 @@ A problem daemon could be: | |||
* An existing node health monitoring daemon integrated with node-problem-detector. | |||
|
|||
Currently, a problem daemon is running as a goroutine in the node-problem-detector | |||
binary. In the future, we'll separate node-problem-detector and problem daemons into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why removing this one? We are still not sure how this should work. :p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am opinionated about this. :(
You're right. We have not concluded which way to run NPD with different problem daemons is the best one.
README.md
Outdated
|
||
List of supported problem daemons: | ||
|
||
| Problem Daemon | NodeCondition | Description | | ||
|----------------|:---------------:|:------------| | ||
| [KernelMonitor](https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json) | KernelDeadlock | A system log monitor monitors kernel log and reports problem according to predefined rules. | | ||
| [AbrtAdaptor](https://github.com/kubernetes/node-problem-detector/blob/master/config/abrt-adaptor.json) | None | Monitor ABRT log messages and report them further. ABRT (Automatic Bug Report Tool) is health monitoring daemon able to catch kernel problems as well as application crashes of various kinds occurred on the host. For more information visit the [link](https://github.com/abrt). | | ||
| [CustomPluginMonitor](https://github.com/kubernetes/node-problem-detector/blob/master/config/custom-plugin-monitor.json) | On-demand(According to users configuration) | A custom plugin monitor for node-problem-detector to invoke and check various node problems with user defined check scripts. [Proposal is here](https://docs.google.com/document/d/1jK_5YloSYtboj-DtfjmYKxfNnUxCAvohLnsH5aGCAYQ/edit#). | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Proposal is here/see proposal here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
README.md
Outdated
* `--custom-plugin-monitors`: List of paths to custom plugin monitor config files, comma separated, e.g. | ||
[config/custom-plugin-monitor.json](https://github.com/kubernetes/node-problem-detector/blob/master/config/custom-plugin-monitor.json). | ||
Node problem detector will start a separate custom plugin monitor for each configuration. You can | ||
use different custom plugin monitors to monitor different node problems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
http://APISERVER_IP:APISERVER_PORT?inClusterConfig=false | ||
``` | ||
Refer [heapster docs](https://github.com/kubernetes/heapster/blob/master/docs/source-configuration.md#kubernetes) for a complete list of available options. | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why indent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mainly used to make the code block indent with previous content in the same section.
// Validate configurations | ||
err = l.config.Validate() | ||
if err != nil { | ||
glog.Fatalf("Failed to validate custom plugin config %+v. %v", l.config, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/./:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
for { | ||
select { | ||
case result := <-resultChan: | ||
glog.V(3).Infof("Receive new plugin result. %+v", result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/./:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
|
||
// NewCustomPluginMonitorOrDie create a new customPluginMonitor, panic if error occurs. | ||
func NewCustomPluginMonitorOrDie(configPath string) types.Monitor { | ||
l := &customPluginMonitor{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: do you want to change l
to c
or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. It is really difficult to recognize. :)
} | ||
} else { | ||
// For permanent error changes the condition | ||
for i := range l.conditions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also generate an event XXChanged
here? (Also do that for log monitor).
Then we don't need to add 2 entries for the same script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, I prefer we separate events and conditions. This will make npd more flexible in case people we only need conditions aside from events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is an issue I want to fix long time ago. We should always generate an event for condition switch, I think. We also do that in kubelet https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_node_status.go#L430.
Usually condition change is not obvious enough to user.
// For permanent error changes the condition | ||
for i := range l.conditions { | ||
condition := &l.conditions[i] | ||
if condition.Type == result.Rule.Condition { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For custom plugin, I feel like the condition may change back, right?
For log it's hard to identify, but for plugin, we could purely change condition status based on plugin return value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. With the output of npd plugins, we should update the condition status accordingly. Will do.
"type": "temporary", | ||
"reason": "NTPIsDown", | ||
"path": "./config/plugin/check_ntp.sh", | ||
"timeout": "3s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the per plugin timeout config. :)
f532d5c
to
2b3e108
Compare
"plugin": "custom", | ||
"pluginConfig": { | ||
"invoke_interval": "30s", | ||
"timeout": "5s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this for now, but please add TODO in code. We should have per-rule interval.
"path": "./config/plugin/check_ntp.sh", | ||
"timeout": "3s" | ||
}, | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should generate event for condition change. With that, you shouldn't need 2 rules here.
} | ||
condition.Status = true | ||
condition.Reason = result.Rule.Reason | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why break?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored. PTAL.
// For permanent error changes the condition | ||
for i := range c.conditions { | ||
condition := &c.conditions[i] | ||
if condition.Type == result.Rule.Condition { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combine the logic:
status = (result.ExitStatus >= cpmtypes.NonOK)
if condition.Status != status || condition.Reason != result.Rule.Reason {
condition.Transition = timestamp
condition.Message = result.Message
}
condition.Status = status
condition.Reason = result.Rule.Reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And please generate event. :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we have a conclusion on emitting events when condition change. I prefer adding this as a TODO and will be addressed in next PR. :)
"path": "./config/plugin/check_ntp.sh", | ||
"timeout": "3s" | ||
}, | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes, for the same condition, status is not changed, but reason is changed. Without event, people will not even notice that.
} | ||
|
||
p.Wait() | ||
glog.Info("End to run custom plugins") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finish running custom plugins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
stdout, err := cmd.Output() | ||
if err != nil { | ||
if _, ok := err.(*exec.ExitError); !ok { | ||
glog.Errorf("Error in running plugin %q. %v. %v", rule.Path, err, string(stdout)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
glog.Errorf("Error in running plugin %q: error - %v. output - %q", rule.Path, err, string(stdout))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
|
||
// trim suffix useless bytes | ||
output = string(stdout) | ||
output = strings.TrimSpace(output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trimspace should have trimmed newline, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. Have not realized this before. :)
output = strings.TrimSpace(output) | ||
output = strings.TrimRight(output, "\n") | ||
|
||
if cmd.ProcessState.Sys().(syscall.WaitStatus).Signaled() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the script is stopped because of timeout, I hope we could clarify that in the message. Currently, it seems that we'll only get signal: killed
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enhanced log.
|
||
import ( | ||
"testing" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove empty line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
@andyxning I have a meeting soon. After that I'll come back and finish review. |
import ( | ||
"fmt" | ||
"time" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove empty line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
type CustomPluginConfig struct { | ||
// Plugin is the name of plugin which is currently used. | ||
// Currently supported: custom. | ||
Plugin string `json:"plugin, omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have also thought this when i prepare this PR. In order to be consistent with existing log configs, i choose to retain this Plugin
field and the only valid value is custom
.
// PluginConfig is global plugin configuration. | ||
PluginGlobalConfig pluginGlobalConfig `json:"pluginConfig, omitempty"` | ||
// Source is the source name of the custom plugin monitor | ||
Source string `json:"source"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, just have Plugin
name, and use Plugin
name as the source?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mainly used to let users define the user specified event source. This should be retained, IMO.
|
||
timeout, err := time.ParseDuration(*cpc.PluginGlobalConfig.TimeoutString) | ||
if err != nil { | ||
return fmt.Errorf("Error in parsing global timeout %q. %v", *cpc.PluginGlobalConfig.TimeoutString, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Error/error
s/./:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
|
||
invoke_interval, err := time.ParseDuration(*cpc.PluginGlobalConfig.InvokeIntervalString) | ||
if err != nil { | ||
return fmt.Errorf("Error in parsing invoke interval %q. %v", *cpc.PluginGlobalConfig.InvokeIntervalString, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
|
||
for _, rule := range cpc.Rules { | ||
if _, err := os.Stat(rule.Path); os.IsNotExist(err) { | ||
return fmt.Errorf("Rule path %q does not exist. Rule: %+v", rule.Path, rule) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
} | ||
|
||
for _, rule := range cpc.Rules { | ||
if _, err := os.Stat(rule.Path); os.IsNotExist(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should return any error, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we have returned an error to represent the file not exist error. I am probably not understanding this comment quite well. :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean we should return any error returned by Stat
, right?
ruleTimeout := 1 * time.Second | ||
ruleTimeoutString := ruleTimeout.String() | ||
|
||
utMetas := []struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a map, add description for each test case https://github.com/kubernetes-incubator/cri-containerd/blob/master/pkg/server/helpers_test.go#L29.
Or else it's very hard to figure out what each case is testing in the future. And please do this for the other unit tests you added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
@@ -76,3 +76,55 @@ type Status struct { | |||
// newest node conditions in this field. | |||
Conditions []Condition `json:"conditions"` | |||
} | |||
|
|||
// Type is the type of the problem. | |||
type Type string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer keeping these types in each monitor. :)
At top level, we only care about the status above. Rules are monitor specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
pkg/types/types.go
Outdated
|
||
// Monitor monitors log and custom plugins and reports node problem condition and event according to | ||
// the rules. | ||
type Monitor interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep this here. I agree.
LGTM overall. Please take a look at the comments @andyxning |
2b3e108
to
28f8a8f
Compare
8995b78
to
51b0a4e
Compare
@Random-Liu Comments addressed. PTAL. |
condition.Transition = timestamp | ||
condition.Message = result.Message | ||
} | ||
condition.Status = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
condition.Status = status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the logic error. :(
@@ -76,3 +76,55 @@ type Status struct { | |||
// newest node conditions in this field. | |||
Conditions []Condition `json:"conditions"` | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
Please address the last comment. I'll send a PR based on yours. |
51b0a4e
to
10dbfef
Compare
@Random-Liu Comment addressed. PTAL. |
LGTM |
This PR will add custom plugin problem detector interface to node-problem-detector.
Proposal: https://docs.google.com/document/d/1jK_5YloSYtboj-DtfjmYKxfNnUxCAvohLnsH5aGCAYQ/edit#
Major changes:
ApplyDefaultConfiguration
forMonitorConfig
from function to method. RefValidateRules
forMonitorConfig
from function to method for. Refk8s.io/node-problem-detector/pkg/systemlogmonitor/util
tok8s.io/node-problem-detector/pkg/util/tomb
. RefRule
type for system log config to pkgk8s.io/node-problem-detector/pkg/types
. RefLogMonitor
interface tok8s.io/node-problem-detector/pkg/types
and rename it toMonitor
. RefThis change is