-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[JENKINS-48149] Add Pod Retention policies #354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation details in comments
@carlossg ptal
} catch (NoSuchMethodException ex) { | ||
return null; | ||
} | ||
e.setNoReconnect(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SlaveDisconnector
tells the slave pod's JNLP agent to not attempt any further reconnects. Without this, the slave JNLP agent will fail after the Jenkins build completes, leading to a final Pod phase of Failed
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again a code comment seems warranted here
if (template != null && template.getPodRetention() != PodRetention.DEFAULT) { | ||
retentionPolicy = template.getPodRetention(); | ||
} | ||
return retentionPolicy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pod retention policy can be overridden by the pod template. Defaults to the cloud-level setting.
|
||
boolean deletePod = shouldDeletePod(client, cloud); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check if we should delete the pod before disconnecting the JNLP agent. At this point a healthy build will have a running pod with a JNLP agent running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A code comment here seems warranted
deleteSlavePod(listener, client); | ||
} else { | ||
LOGGER.log(Level.WARNING, "Slave pod {0} was not deleted due to retention policy {1}.", | ||
new Object[] { name, getPodRetention(cloud) }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log a warning if the slave pod was not deleted, as this may mean a running pod is still active and consuming resources.
PodStatus podStatus = getSlavePodStatus(client); | ||
boolean slaveHasErrors = false; | ||
if (podStatus != null) { | ||
slaveHasErrors = podStatus.getPhase().toLowerCase().matches("(failed|unknown)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a pod has the phase Failed
or Unknown
, it is in an error state.
@DataBoundSetter | ||
public void setPodRetention(PodRetention podRetention) { | ||
this.podRetention = podRetention; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Groovy input we do not care if the PodRetention value is null.
LOGGER.log(Level.WARNING, "Failed to delete pod for agent {0}/{1}: not found", | ||
new String[] { client.getNamespace(), podTemplate.getName() }); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not delete the pod with the finished
callback - this is taken care of by KubernetesSlave._terminate()
} | ||
|
||
@Test | ||
public void testShouldDeletePod() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit test logic for deleting pods
@@ -56,4 +75,162 @@ private void assertRegex(String name, String regex) { | |||
assertNotNull(name); | |||
assertTrue(String.format("Name does not match regex [%s]: '%s'", regex, name), name.matches(regex)); | |||
} | |||
|
|||
@Test | |||
public void testGetPodRetention() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit test pod retention for the slave
@@ -371,4 +371,14 @@ public void runWithActiveDeadlineSeconds() throws Exception { | |||
r.assertLogNotContains("Hello from container!", b); | |||
} | |||
|
|||
@Test | |||
public void runInPodWithRetention() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integration test using the pod retention setting
@gabemontero may also want to take a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of minor items/questions in line.
Also, since I disconnected once you got off the ground with this, just so I'm clear, you backed away from a new RetentionStrategy
implementation because of:
a) the JNLP reconnect thing ?
b) the once vs. cloud discrepancies made things cumbersome ?
c) it was just easier to code up ?
d) some combination of the above ?
e) something else ?
I'm curious as much as anything ... what you have integrated in just fine, but am curious of @carlossg or others on the jenkinsci side think this logic needs to be encapsulated there
Nice work :-) !
README.md
Outdated
@@ -130,7 +130,8 @@ Either way it provides access to the following fields: | |||
* **annotations** Annotations to apply to the pod. | |||
* **inheritFrom** List of one or more pod templates to inherit from *(more details below)*. | |||
* **slaveConnectTimeout** Timeout in seconds for an agent to be online. | |||
* **activeDeadlineSeconds** Pod is deleted after this deadline is passed. | |||
* **podRetention** Controls the behavior of keeping slave pods. Can be 'NEVER', 'ON_FAILURE', or 'ALWAYS' - if empty will default to the plugin's "Pod Retention" setting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be clearer to say ... if empty will default to deleting the pod after
activeDeadlineSeconds has passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's true if the cloud's setting isn't changed (which, granted, will cover most cases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep ... and mentioning the "inherit from the cloud setting" should be mentioned somehow in this doc as well I would think ... so combine the two pieces here
} | ||
PodList templateSlaveList = client.pods().inNamespace(templateNamespace).withLabels(labelsMap).list(); | ||
List<Pod> allTemplateSlavePods = templateSlaveList.getItems().stream() | ||
.filter(x -> x.getStatus().getPhase().toLowerCase().matches("(running|pending)")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the name allTemplateSlavePod
implies to me that you would NOT filter based on phase .... can you reconcile either with removing the filter, changing the var name, comment explaining what is up, etc.
|
||
boolean deletePod = shouldDeletePod(client, cloud); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A code comment here seems warranted
} catch (NoSuchMethodException ex) { | ||
return null; | ||
} | ||
e.setNoReconnect(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again a code comment seems warranted here
@gabemontero at the end of the day I felt the delete pod logic belonged within the |
On Thu, Jun 28, 2018 at 2:08 PM, Adam Kaplan ***@***.***> wrote:
@gabemontero <https://github.com/gabemontero> at the end of the day I
felt the delete pod logic belonged within the KubernetesSlave class, and
should run as a part of the node termination process. IMO a caller of
node.terminate() expects the slave to shut down and clean up its
resources.
Yeah sounds fair. Admittedly as this has evolved and I've iterated over
these classes, the separation of concerns between `RetentionStrategy` and
`hudson.slaves.AbstractCloudSlave` had me wondering somewhat.
Most likely I was being overly pessimistic in anticipating possible
contention.
"sgtm"
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#354 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADbadEegLuirzwHTS9TRaeeP2A_KayIsks5uBRuPgaJpZM4U7t49>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks good to me.
I would prefer using Describable/Descriptor though for PodRetention in order to make the system open for future extensions and keep the logic self-contained.
@@ -87,6 +87,8 @@ | |||
/** Default timeout for idle workers that don't correctly indicate exit. */ | |||
private static final int DEFAULT_RETENTION_TIMEOUT_MINUTES = 5; | |||
|
|||
static final PodRetention DEFAULT_POD_RETENTION_POLICY = PodRetention.NEVER; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VisibleForTesting
@@ -0,0 +1,15 @@ | |||
package org.csanchez.jenkins.plugins.kubernetes; | |||
|
|||
public enum PodRetention { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More Jenkins-y to use a Describable/Descriptor
pattern here instead. Move implementation of shouldDeletePod
in the Describable class so that you can implement additional behaviours through extensions.
} | ||
// Ensure the slave supports setNoReconnect | ||
try { | ||
Engine.class.getMethod("setNoReconnect", boolean.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method has been there since 2009. I don't think this check is worth it. jenkinsci/remoting@e43e787
@Vlatombe Is there more recent documentation on selecting
I could also use guidance on supporting "static" instance types via the groovy DSL |
80a5b1b
to
70becc2
Compare
@Vlatombe refactored as Describable/Descriptor - ptal |
@Vlatombe have you had a chance to look at my updates? |
I'm sorry, I got swamped in other work. I'll take a look now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Some code could be pulled up in PodRetention
@DataBoundConstructor | ||
public PodTemplate() { | ||
} | ||
|
||
public PodTemplate(PodTemplate from) { | ||
public PodTemplate(PodTemplate from) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
broken indentation
|
||
@Override | ||
public String toString() { | ||
return "Always"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe getDescriptor().getDisplayName()
instead ? Could be defined in the parent class
|
||
private static final long serialVersionUID = -5209499689925746138L; | ||
|
||
@DataBoundConstructor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation
3c5adb1
to
6b50978
Compare
Updated & squashed. I could pull up the Bad indents came from my IDE using tabs |
@carlossg ok to merge? |
New configuration to control how the kuberenetes-plugin deletes slave build pods. * Support for cloud and pod template configurations in Jenkins UI * Groovy support for pipeline builds * Describable implementation for future extensions OpenShift Bug 1515940
6b50978
to
0cee31c
Compare
Does this PR also address the following point from JENKINS-48149?
Probably not, but just want to confirm as we have a large-ish setup (~1000 builds/day) and run into OOMKilled often, which usually manifests as a |
no, getting it into the logs is a bit harder |
more like "almost impossible" as kubernetes OOM leads to SIGKILL being sent to the offending container process. This OOM should not be confused with a JVM If the kubelet kills the container because it exceeds its resource quota, OOMKilled will show up in the Pod data as the reason for the pod/container's current status. |
After installing the latest version on Jenkins with this change all my pods were marked as |
see #362 |
} | ||
|
||
@Extension | ||
@Symbol("default") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a valid symbol as per https://github.com/jenkinsci/structs-plugin/blob/a7765136112ec72dc2d686d3a958fd010e47e07a/annotation/src/main/java/org/jenkinsci/Symbol.java#L32 https://docs.groovy-lang.org/latest/html/documentation/#_keywords though it does not much matter since as the default value you would not explicitly select it in a script anyway.
New configuration to control how the kubernetes-plugin deletes
slave build pods.
OpenShift Bug 1515940