num | title | status | authors | tags | |||
---|---|---|---|---|---|---|---|
3 |
TrustyAI Service Deployment using Operator pattern |
Draft |
|
|
TrustyAI Service Deployment using Operator pattern
The TrustyAI Service can be deployed manually as a standalone container or via ODH-manifest as part of ODH KfDef. Both cases have limitations: a plain Deployment is error prone for the users (some parameters are mandatory) and the ODH-manifest contains some hacks. A Kubernetes operator would provide a simple and consistent way to deploy and manage the TrustyAI service.
In addition to this, the deployment and the storage (PVC for now) must be created into a user owned namespace to give users full control and prevent security issues. An operator can enforce this.
- Automate the deployment, management, and maintenance of the TrustyAI service.
- Reduce manual errors and increase consistency in deployments.
- Help updating the TrustyAI service.
- Implementing mechanisms that perform actions unrelated with the lifecycle of the TrustyAI service (create, upgrade, monitor, etc)..
- In the initial stage, distribution via OperatorHub is not a goal. This may be considered in the future.
Currently, TrustyAI service deployments are done manually or through scripts that do not fully take advantage of Kubernetes. This can be inefficient and lead to errors introduced by manual steps.
As an example, the TrustyAI service needs to update ModelMesh's configuration to add the TrustyAI service as a new endpoint. This is currently done via a deployment-time script that patches the ModelMesh configuration. This could be done automatically by the Operator.
Although TrustyAI's deployment needs to configure a considerable number of resources (e.g. Deployment
, Service
, ConfigMap
, Route
, ServiceMonitor
), the actual configuration options available for a custom TrustyAI deployment are limited. This means that a custom TrustyAI Custom Resource Definition (CRD) would be quite simple, and the Operator would be able to handle the creation and management of the required resources.
We propose to use a stand-alone TrustyAI Kubernetes Operator which would create and manage the required Deployment, Service, ConfigMap, Route, and ServiceMonitor resources based on a simple Custom Resource while keeping the state consistent with the desired one 1.
An example of a custom resource is:
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: TrustyAIService
metadata:
name: trustyai-service-example
spec:
storage:
format: "PVC"
folder: "/inputs"
pv: "mypv"
size: "1Gi"
data:
filename: "data.csv"
format: "CSV"
trustyaiMetrics:
schedule: "15s"
status:
phase: …
replicas: …
conditions:
- type: Ready
…
- type: ModelMeshReady
status: "True"
lastTransitionTime: …
reason: ModelMeshHealthy
message: ModelMesh is running and healthy.
- type: StorageReady
status: "True"
lastTransitionTime: …
reason: StorageHealthy
message: Storage system is functioning correctly.
lastUpdateTime: …
In this example:
replicas
is an optional field that specifies the number of replicas of the TrustyAI service that you want to run. If not provided, the default is one replica.storage
is a mandatory field that specifies the storage details. It has two nested fields:format
- the storage format, (example: a Persistent Volume Claim (PVC)).folder
- the folder path where data is stored.pv
- the name of the Persistent Volume (PV) to use (already existing).size
- the size of the PV to use (example: 1Gi).
- data is a mandatory field that specifies the data details. It has two nested fields:
filename
- the suffix of the file that the service uses for data.format
- the format of the data file (example: a CSV file).
trustyaiMetrics
is a mandatory field that specifies the metrics details. It has one nested field:schedule
- the schedule for metrics calculation, (example: every 5 seconds).
status
- as part of the reconciliation process, the operator will add additional conditions, apart from the standard ones, to the custom resource to indicate the status of the deployment. These conditions will be:ModelMeshReady
, which indicates that the ModelMesh Serving component is running.StorageReady
, which indicates that the storage component is running.
The storage, data and metrics keys consist of the only mandatory configuration fields for the TrustyAI service, at the moment. Future configuration keys can be added to the custom resource as needed.
The proposed apiVersion
and kind
are trustyai.opendatahub.io/v1alpha1
and TrustyAIService
, respectively.
The operator also ensures the correct configuration of the ModelMesh Serving component. Once the TrustyAI Service is deployed and reachable, the operator will patch the ModelMesh Serving configuration to include a custom payload processor and it will be configured to point to the consumer endpoint of the deployed TrustyAI Service.
The processor configuration is embedded in a Kubernetes ConfigMap and follows the format:
apiVersion: v1
kind: ConfigMap
metadata:
name: model-serving-config
data:
config.yaml: |
payloadProcessors: http://trustyai-service.$NAMESPACE.svc.cluster.local/consumer/kserve/v2
In this configuration, $NAMESPACE
is replaced by the Operator with the namespace where the TrustyAI Service and ModelMesh Serving are deployed ensuring that ModelMesh sends payloads correctly to the TrustyAI Service.
The TrustyAI Operator also creates a ServiceMonitor
object which defines the services to be monitored by Prometheus. The ServiceMonitor
will have the following configuration:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: trustyai-metrics
labels:
modelmesh-service: modelmesh-serving
spec:
endpoints:
- interval: 4s
path: /q/metrics
honorLabels: true
honorTimestamps: true
scrapeTimeout: 3s
bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
targetPort: 8080
scheme: http
params:
'match[]':
- '{__name__= "trustyai_spd"}'
- '{__name__= "trustyai_dir"}'
metricRelabelings:
- action: keep
regex: trustyai_.*
sourceLabels:
- __name__
selector:
matchLabels:
app.kubernetes.io/name: trustyai-service
The ServiceMonitor
object targets the TrustyAI Service and specifies how Prometheus should scrape metrics from the service, which includes the path to the metrics endpoint (/q/metrics
), the interval at which it should scrape the metrics (every 4 seconds), and the type of metrics it should scrape (metrics with names that start with trustyai_
).
The selector would also be updated to match the labels of the TrustyAI Service from the Custom Resource.
The scrape interval and metrics names could potentially also be configurable via the custom resource (with the current values as defaults).
A possibility for the service monitor customization is the inclusion of custom values using a nested ref:
field for serviceMonitoring
. This would allow users to specify a custom configuration for the ServiceMonitor
object. For example:
apiVersion: trustyai.opendatahub.io/v1
kind: TrustyAIService
metadata:
name: trustyai-service-example
spec:
...
serviceMonitoring:
ref:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
...
spec:
endpoints:
- interval: 15s
If such configuration is not provided, the operator will use the default configuration.
If deployed on OpenShift, the Operator will also create a Route
object to expose the TrustyAI Service to external clients. The Route
object will have the following configuration:
kind: Route
apiVersion: route.openshift.io/v1
metadata:
name: trustyai
labels:
app: trustyai
app.kubernetes.io/name: trustyai-service
app.kubernetes.io/part-of: trustyai
app.kubernetes.io/version: 0.1.0
spec:
to:
kind: Service
name: trustyai-service
port:
targetPort: http
tls: null
Note that TrustyAI isn't currently implementing HTTPS endpoints, so the tls
field will be set to null
for now. Once HTTPS is implemented, the tls
field will be updated to include the TLS configuration.
The TrustyAI service requires storage to store inference data. Upon CR deployment, the operator will create a PersistentVolumeClaim
object to request storage for the TrustyAI Service. The PersistentVolumeClaim
object will have the following configuration:
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: trustyai-service-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
volumeMode: Filesystem
storageClassName: ""
and bind it to the TrustyAI Service deployment and supplied PV. The PVC will be created in the same namespace as the TrustyAI Service is being deployed.
The testing and CI of the TrustyAI Operator will be performed using the following approaches:
- Unit tests for the Operator code, to ensure that the Operator's functionality is correct.
- Integration tests using envtest to ensure that the Operator is correctly deployed and configured. The Kuttl tests will, for instance, ensure that:
- The state is correctly updated when the Custom Resource is updated.
- Routes and ServiceMonitors are correctly created.
- ModelMesh Payload Processors are correctly configured.
- End-to-End (E2E) tests, by integrating with the work already being implemented with the TrustyAI E2E tests
- No other threats additionally to the ones common to any operators themselves, which include misconfiguration of the operator, security vulnerabilities in the operator code or in the created resources.
- Go and Kubernetes/OpenShift knowledge required to develop the Operator.
None
If not completed, we will continue with the manual deployment and management of the TrustyAI service which would make it harder to scale and update the service.
Footnotes
-
Initial implementation at https://github.com/ruivieira/trustyai-service-operator ↩