Skip to content

Commit 66fd1fa

Browse files
authored
Added splunk, formated for better visualization, steps (openshift#61579)
Co-authored-by: Hector Vido <>
1 parent 96cb333 commit 66fd1fa

File tree

1 file changed

+68
-21
lines changed

1 file changed

+68
-21
lines changed

docs/dptp-triage-sop/misc.md

+68-21
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
## Probe Failing on ci-rpms
1+
Probe Failing on ci-rpms
2+
========================
23

34
```
45
[FIRING:1] ProbeFailing blackbox (https://artifacts-rpms-openshift-origin-ci-rpms.apps.ci.l2s4.p1.openshiftapps.com/openshift-origin-v3.11/repodata/repomd.xml critical)
@@ -9,9 +10,12 @@ The TP team does not own these services.
910

1011
Resolution before [DPTP-2981](https://issues.redhat.com/browse/DPTP-2981) is completed:
1112

12-
> oc --context app.ci delete --all pods --namespace=ci-rpms
13+
```bash
14+
oc --context app.ci delete --all pods --namespace=ci-rpms
15+
```
1316

14-
## Probe Failing on deck-internal
17+
Probe Failing on deck-internal
18+
==============================
1519

1620
```
1721
[FIRING:1] deck-internalDown (critical)
@@ -20,22 +24,61 @@ The service deck-internal has been down for 5 minutes.
2024

2125
Resolution before [DPTP-2712](https://issues.redhat.com/browse/DPTP-2712) is completed:
2226

23-
> oc --context app.ci delete pod -n ci -l app=prow,component=deck-internal
27+
```bash
28+
oc --context app.ci delete pod -n ci -l app=prow,component=deck-internal
29+
```
2430

25-
## Access internal job logs
26-
For jobs available in [`deck-internal`](https://deck-internal-ci.apps.ci.l2s4.p1.openshiftapps.com/), the logs are stored in GCP project `openshift-ci-private`, bucket `origin-ci-private`.
31+
Access internal job logs
32+
========================
2733

28-
The logs can be deleted in case of leak of secrets or other sensitive information.
34+
For jobs available in [deck-internal](https://deck-internal-ci.apps.ci.l2s4.p1.openshiftapps.com/), the logs are stored in GCP project `openshift-ci-private`, bucket `origin-ci-private`.
2935

36+
The logs can be deleted in case of leak of secrets or other sensitive information.
3037

31-
## quay-io-image-mirroring-failures
38+
quay-io-image-mirroring-failures
39+
================================
3240

3341
The alert is fired if there are many failures of `oc image mirror` in `ci-images-mirror`.
42+
43+
Choose a method below - pod logs, cloudwatch or splunk - and then we can run the command locally in our computer:
44+
45+
```bash
46+
# get the credentials
47+
$ oc -n ci extract secret/registry-push-credentials-ci-images-mirror --to=- --keys .dockerconfigjson | jq > /tmp/qci.json
48+
# the source and the target are taken from the log
49+
$ oc image mirror --keep-manifest-list --registry-config=/tmp/qci.json --continue-on-error registry.ci.openshift.org/origin/scos-4.16:cluster-capi-operator=quay.io/openshift/ci:origin_scos-4.16_cluster-capi-operator
50+
```
51+
52+
If it reproduces the same error, mostly, it is caused by a broken source image. In that case, we should
53+
- Fix the source image, e.g. by rebuilding the image from **Pod logs** example below:
54+
- Inside `release` repo search for the job that promotes the image: `grep -r 'to: cluster-capi-operator'`
55+
- Observe the directory three of the returned files: `ci-operator/config/openshift/cluster-capi-operator/`
56+
- Find the equivalent `ProwJob`, e.g. `ci-operator/jobs/openshift/cluster-capi-operator/openshift-cluster-capi-operator-release-4.16-postsubmits.yaml`
57+
- Pick the right `ProwJob` from the file, e.g. `branch-ci-openshift-cluster-capi-operator-release-4.16-okd-scos-images`
58+
- Execute from inside `release` repository: `make job JOB='branch-ci-openshift-cluster-capi-operator-release-4.16-okd-scos-images' BASE_REF=release-4.16`
59+
- Ignore the mirroring otherwise: See [RFE-5363](https://issues.redhat.com/browse/) for example.
60+
61+
62+
Pod logs
63+
--------
64+
3465
The pod has the logs to show the details:
3566

36-
> oc logs -n ci -l app=ci-images-mirror -c ci-images-mirror | grep "Running command failed." | grep "image mirror"
67+
```bash
68+
oc logs -n ci -l app=ci-images-mirror -c ci-images-mirror | grep -E 'Running command failed|manifest unknown'
69+
```
70+
71+
Example:
72+
73+
```
74+
{"args":"image mirror --keep-manifest-list --registry-config=/etc/push/.dockerconfigjson --continue-on-error --max-per-registry=20 registry.ci.openshift.org/origin/scos-4.16:cluster-capi-operator=quay.io/openshift/ci:origin_scos-4.16_cluster-capi-operator registry.ci.openshift.org/origin/scos-4.13:vertical-pod-autoscaler-operator=quay.io/openshift/ci:origin_scos-4.13_vertical-pod-autoscaler-operator","client":"/usr/bin/oc","component":"ci-images-mirror","error":"exit status 1","file":"/go/src/github.com/openshift/ci-tools/pkg/controller/quay_io_ci_images_distributor/oc_quay_io_image_helper.go:49","func":"github.com/openshift/ci-tools/pkg/controller/quay_io_ci_images_distributor.(*ocExecutor).Run","level":"debug","msg":"Running command failed.","output":"quay.io/
75+
error: unable to retrieve source image registry.ci.openshift.org/origin/scos-4.16 manifest #1 from manifest list: manifest unknown: manifest unknown
76+
```
77+
78+
The logs above indicates `unable to retrieve source image registry.ci.openshift.org/origin/scos-4.16 manifest #1 from manifest list: manifest unknown: manifest unknown`, following the message we can see that the manifest is `registry.ci.openshift.org/origin/scos-4.16:cluster-capi-operator`
3779

38-
Or on CloudWatch:
80+
CloudWatch:
81+
-----------
3982

4083
```txt
4184
fields @timestamp,structured.component as component,structured.msg as msg,structured.args as args, @message, @logStream, @log
@@ -44,7 +87,7 @@ fields @timestamp,structured.component as component,structured.msg as msg,struct
4487
| limit 20
4588
```
4689

47-
Example,
90+
Example:
4891

4992
```json
5093
{
@@ -62,17 +105,21 @@ Example,
62105
}
63106
```
64107

65-
The above log line indicates "quay.io/openshift/ci:ci_cert-manager-cainjector_v1.9.1\nerror: unable to push manifest to quay.io/openshift/ci:ci_fedora_latest: manifest invalid: manifest" is the problem.
108+
The logs above indicates `quay.io/openshift/ci:ci_cert-manager-cainjector_v1.9.1\nerror: unable to push manifest to quay.io/openshift/ci:ci_fedora_latest: manifest invalid: manifest` is the problem.
66109

67-
Then we can run the cmd with oc-cli in our laptop:
110+
Splunk
111+
------
68112

69-
```bash
70-
### get the credentials
71-
$ oc -n ci extract secret/registry-push-credentials-ci-images-mirror --to=- --keys .dockerconfigjson | jq > /tmp/qci.c
72-
### the source and the target are taken from the log
73-
$ oc image mirror --keep-manifest-list --registry-config=/tmp/qci.c --continue-on-error=true --max-per-registry=20 registry.fedoraproject.org/fedora:latest=quay.io/openshift/ci:ci_fedora_latest
113+
```txt
114+
index="rh_dptp-001" openshift.cluster_id="248ca8f0-5af8-4a45-a153-d2d9125390dd" kubernetes.namespace_name="ci" kubernetes.labels.app="ci-images-mirror" ("Running command failed")
74115
```
75116

76-
If it reproduces the same error, mostly, it is caused by a broken source image. In that case, we should
77-
- Fix the source image, e.g., by rebuilding the image.
78-
- Ignore the mirroring otherwise: See [RFE-5363](https://issues.redhat.com/browse/) for example.
117+
Example:
118+
119+
```json
120+
{
121+
"message" : {"args":"image mirror --keep-manifest-list --registry-config=/etc/push/.dockerconfigjson --continue-on-error=true --max-per-registry=20 registry.ci.openshift.org/ocp/builder:rhel-9-base-nodejs-openshift-4.19.art-arm64=quay.io/openshift/ci:ocp_builder_rhel-9-base-nodejs-openshift-4.19.art-arm64 registry.ci.openshift.org/origin/scos-4.16:cluster-capi-operator=quay.io/openshift/ci:origin_scos-4.16_cluster-capi-operator registry.ci.openshift.org/origin/scos-4.13:vertical-pod-autoscaler-operator=quay.io/openshift/ci:origin_scos-4.13_vertical-pod-autoscaler-operator","msg":"Running command failed.","output":"...\nerror: unable to retrieve source image registry.ci.openshift.org/origin/scos-4.16 manifest #1 from manifest list: manifest unknown: manifest unknown\n\ninfo: Mirroring completed in 4.54s (0B/s)\nerror: one or more errors occurred\n","severity":"debug","time":"2025-02-12T13:52:27Z"}
122+
}
123+
```
124+
125+
The logs above indicates `unable to retrieve source image registry.ci.openshift.org/origin/scos-4.16 manifest #1 from manifest list: manifest unknown: manifest unknown`, following the message we can see that the manifest is `registry.ci.openshift.org/origin/scos-4.16:cluster-capi-operator`

0 commit comments

Comments
 (0)