Skip to content

[JENKINS-48149] Add Pod Retention policies #354

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 31, 2018

Conversation

adambkaplan
Copy link
Contributor

@adambkaplan adambkaplan commented Jun 28, 2018

New configuration to control how the kubernetes-plugin deletes
slave build pods.

  • Support for cloud and pod template configurations in Jenkins UI
  • Groovy support for pipeline builds

OpenShift Bug 1515940

Copy link
Contributor Author

@adambkaplan adambkaplan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation details in comments
@carlossg ptal

} catch (NoSuchMethodException ex) {
return null;
}
e.setNoReconnect(true);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SlaveDisconnector tells the slave pod's JNLP agent to not attempt any further reconnects. Without this, the slave JNLP agent will fail after the Jenkins build completes, leading to a final Pod phase of Failed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again a code comment seems warranted here

if (template != null && template.getPodRetention() != PodRetention.DEFAULT) {
retentionPolicy = template.getPodRetention();
}
return retentionPolicy;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pod retention policy can be overridden by the pod template. Defaults to the cloud-level setting.


boolean deletePod = shouldDeletePod(client, cloud);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if we should delete the pod before disconnecting the JNLP agent. At this point a healthy build will have a running pod with a JNLP agent running.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A code comment here seems warranted

deleteSlavePod(listener, client);
} else {
LOGGER.log(Level.WARNING, "Slave pod {0} was not deleted due to retention policy {1}.",
new Object[] { name, getPodRetention(cloud) });
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log a warning if the slave pod was not deleted, as this may mean a running pod is still active and consuming resources.

PodStatus podStatus = getSlavePodStatus(client);
boolean slaveHasErrors = false;
if (podStatus != null) {
slaveHasErrors = podStatus.getPhase().toLowerCase().matches("(failed|unknown)");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a pod has the phase Failed or Unknown, it is in an error state.

@DataBoundSetter
public void setPodRetention(PodRetention podRetention) {
this.podRetention = podRetention;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Groovy input we do not care if the PodRetention value is null.

LOGGER.log(Level.WARNING, "Failed to delete pod for agent {0}/{1}: not found",
new String[] { client.getNamespace(), podTemplate.getName() });
return;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not delete the pod with the finished callback - this is taken care of by KubernetesSlave._terminate()

}

@Test
public void testShouldDeletePod() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit test logic for deleting pods

@@ -56,4 +75,162 @@ private void assertRegex(String name, String regex) {
assertNotNull(name);
assertTrue(String.format("Name does not match regex [%s]: '%s'", regex, name), name.matches(regex));
}

@Test
public void testGetPodRetention() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit test pod retention for the slave

@@ -371,4 +371,14 @@ public void runWithActiveDeadlineSeconds() throws Exception {
r.assertLogNotContains("Hello from container!", b);
}

@Test
public void runInPodWithRetention() throws Exception {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integration test using the pod retention setting

@adambkaplan
Copy link
Contributor Author

@gabemontero may also want to take a look

Copy link
Contributor

@gabemontero gabemontero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of minor items/questions in line.

Also, since I disconnected once you got off the ground with this, just so I'm clear, you backed away from a new RetentionStrategy implementation because of:
a) the JNLP reconnect thing ?
b) the once vs. cloud discrepancies made things cumbersome ?
c) it was just easier to code up ?
d) some combination of the above ?
e) something else ?

I'm curious as much as anything ... what you have integrated in just fine, but am curious of @carlossg or others on the jenkinsci side think this logic needs to be encapsulated there

Nice work :-) !

README.md Outdated
@@ -130,7 +130,8 @@ Either way it provides access to the following fields:
* **annotations** Annotations to apply to the pod.
* **inheritFrom** List of one or more pod templates to inherit from *(more details below)*.
* **slaveConnectTimeout** Timeout in seconds for an agent to be online.
* **activeDeadlineSeconds** Pod is deleted after this deadline is passed.
* **podRetention** Controls the behavior of keeping slave pods. Can be 'NEVER', 'ON_FAILURE', or 'ALWAYS' - if empty will default to the plugin's "Pod Retention" setting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be clearer to say ... if empty will default to deleting the pod after activeDeadlineSeconds has passed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's true if the cloud's setting isn't changed (which, granted, will cover most cases)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep ... and mentioning the "inherit from the cloud setting" should be mentioned somehow in this doc as well I would think ... so combine the two pieces here

}
PodList templateSlaveList = client.pods().inNamespace(templateNamespace).withLabels(labelsMap).list();
List<Pod> allTemplateSlavePods = templateSlaveList.getItems().stream()
.filter(x -> x.getStatus().getPhase().toLowerCase().matches("(running|pending)"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name allTemplateSlavePod implies to me that you would NOT filter based on phase .... can you reconcile either with removing the filter, changing the var name, comment explaining what is up, etc.


boolean deletePod = shouldDeletePod(client, cloud);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A code comment here seems warranted

} catch (NoSuchMethodException ex) {
return null;
}
e.setNoReconnect(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again a code comment seems warranted here

@adambkaplan
Copy link
Contributor Author

adambkaplan commented Jun 28, 2018

@gabemontero at the end of the day I felt the delete pod logic belonged within the KubernetesSlave class, and should run as a part of the node termination process. IMO a caller of node.terminate() expects the slave to shut down and clean up its resources. Using a separate RetentionStrategy would require the delete pod logic to be called externally of node.terminate()

@carlossg carlossg requested a review from Vlatombe June 28, 2018 18:12
@gabemontero
Copy link
Contributor

gabemontero commented Jun 28, 2018 via email

Copy link
Contributor

@gabemontero gabemontero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Member

@Vlatombe Vlatombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good to me.

I would prefer using Describable/Descriptor though for PodRetention in order to make the system open for future extensions and keep the logic self-contained.

@@ -87,6 +87,8 @@
/** Default timeout for idle workers that don't correctly indicate exit. */
private static final int DEFAULT_RETENTION_TIMEOUT_MINUTES = 5;

static final PodRetention DEFAULT_POD_RETENTION_POLICY = PodRetention.NEVER;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VisibleForTesting

@@ -0,0 +1,15 @@
package org.csanchez.jenkins.plugins.kubernetes;

public enum PodRetention {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More Jenkins-y to use a Describable/Descriptor pattern here instead. Move implementation of shouldDeletePod in the Describable class so that you can implement additional behaviours through extensions.

}
// Ensure the slave supports setNoReconnect
try {
Engine.class.getMethod("setNoReconnect", boolean.class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method has been there since 2009. I don't think this check is worth it. jenkinsci/remoting@e43e787

@adambkaplan
Copy link
Contributor Author

@Vlatombe Is there more recent documentation on selecting Describable/Descriptor instances from the web UI? I've found these:

I could also use guidance on supporting "static" instance types via the groovy DSL

@adambkaplan adambkaplan force-pushed the JENKINS-48149 branch 2 times, most recently from 80a5b1b to 70becc2 Compare July 16, 2018 15:02
@adambkaplan
Copy link
Contributor Author

@Vlatombe refactored as Describable/Descriptor - ptal

@adambkaplan
Copy link
Contributor Author

@Vlatombe have you had a chance to look at my updates?

@Vlatombe
Copy link
Member

I'm sorry, I got swamped in other work. I'll take a look now

Copy link
Member

@Vlatombe Vlatombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Some code could be pulled up in PodRetention

@DataBoundConstructor
public PodTemplate() {
}

public PodTemplate(PodTemplate from) {
public PodTemplate(PodTemplate from) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broken indentation


@Override
public String toString() {
return "Always";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe getDescriptor().getDisplayName() instead ? Could be defined in the parent class


private static final long serialVersionUID = -5209499689925746138L;

@DataBoundConstructor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation

@adambkaplan
Copy link
Contributor Author

Updated & squashed. I could pull up the hashCode override as well, but I feel that would break standard Java conventions.

Bad indents came from my IDE using tabs ☹️

@adambkaplan
Copy link
Contributor Author

@carlossg ok to merge?

New configuration to control how the kuberenetes-plugin deletes
slave build pods.

* Support for cloud and pod template configurations in Jenkins UI
* Groovy support for pipeline builds
* Describable implementation for future extensions

OpenShift Bug 1515940
@adambkaplan
Copy link
Contributor Author

@carlossg @Vlatombe rebased to address merge conflicts. ptal.

@carlossg carlossg changed the title [JENKINS-48149] Slave Pod Retention [JENKINS-48149] Add Pod Retention policies Jul 31, 2018
@carlossg carlossg merged commit 65ae56a into jenkinsci:master Jul 31, 2018
@vickychijwani
Copy link
Member

Does this PR also address the following point from JENKINS-48149?

It'd be really nice if this log [Reason: OOMKilled] could appear in the build log as well as/instead of the Jenkins master log.

Probably not, but just want to confirm as we have a large-ish setup (~1000 builds/day) and run into OOMKilled often, which usually manifests as a ChannelClosedException.

@carlossg
Copy link
Contributor

carlossg commented Aug 1, 2018

no, getting it into the logs is a bit harder

@adambkaplan
Copy link
Contributor Author

adambkaplan commented Aug 1, 2018

no, getting it into the logs is a bit harder

more like "almost impossible" as kubernetes OOM leads to SIGKILL being sent to the offending container process. This OOM should not be confused with a JVM OutOfMemoryError - this is an ongoing challenge for Java applications running on Kubernetes.

If the kubelet kills the container because it exceeds its resource quota, OOMKilled will show up in the Pod data as the reason for the pod/container's current status.

@adambkaplan adambkaplan deleted the JENKINS-48149 branch August 1, 2018 21:33
@dioss-Machiel
Copy link

After installing the latest version on Jenkins with this change all my pods were marked as
Pod Retention: Always which was a bit unexpected.

@carlossg
Copy link
Contributor

carlossg commented Aug 3, 2018

see #362

}

@Extension
@Symbol("default")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants