Skip to content

chore: Update README #363

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 42 additions & 50 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,78 +1,70 @@
Log Cache Release
=================
# Log Cache Release

If you have any questions, or want to get attention for a PR or issue please reach out on the [#logging-and-metrics channel in the cloudfoundry slack](https://cloudfoundry.slack.com/archives/CUW93AF3M)
The in-memory caching layer for logs and metrics in [Cloud Foundry]. Log Cache is a collection of microservices packaged and distributed as a [BOSH] release.

## What is Log Cache?
## Getting started

Log Cache provides an in memory caching layer for logs and metrics which are emitted by
applications and system components in Cloud Foundry.
Some fundamental knowledge of [BOSH], [Cloud Foundry], and [Golang](https://go.dev/) are recommended in order to grok this repo and its contents.

Log Cache is deployed as a group of nodes which communicate between themselves
over GRPC. Source IDs (such as the CF application ID or a unique string for a
system component) are hashed to determine which Log Cache
instance will cache the data. Envelopes from a source ID can be sent to any log cache instance and
are then forwarded to the assigned node. Similarly queries for Log Cache can be
sent to any node and the query will be forwarded to the appropriate node which
is caching the data for the requested source ID.
### Prerequisites

## How does Log Cache fit into Cloud Foundry?
* [BOSH]: a deployment on an IAAS that supports [Ubuntu Stemcells].

Log Cache is included by default in
[Cloud Foundry's cf-deployment](https://github.com/cloudfoundry/cf-deployment).
### Deployment

By default, Log Cache receives data from syslog agents which is an add on which runs on all instance groups by default.
See the `log-cache` instance group within [cf-deployment]. It is not recommended to run this release outside of a Cloud Foundry deployment.

Log Cache is queried by Cloud Controller for app instance metrics such as CPU usage and memory when retrieving details for applications and
by the cf cli directly to retrieve recent logs. It can also be queried using the Log Cache CLI plugin to retrieve system component metrics.
#### Jobs

This release contains the following jobs:

## How do I configure it?
* **log-cache ([spec](jobs/log-cache/spec)) ([main.go](src/cmd/log-cache/main.go))**: an in-memory cache for [loggregator envelopes].
* **log-cache-cf-auth-proxy ([spec](jobs/log-cache-cf-auth-proxy/spec)) ([main.go](src/cmd/cf-auth-proxy/main.go))**: a reverse-proxy for log-cache-gateway that authenticates Log Cache client HTTPS requests with [UAA](https://github.com/cloudfoundry/uaa) or [Cloud Controller].
* **log-cache-gateway ([spec](jobs/log-cache-gateway/spec)) ([main.go](src/cmd/gateway/main.go))**: a reverse-proxy for log-cache that forwards Log Cache client HTTP requests via gRPC.
* **log-cache-syslog-server ([spec](jobs/log-cache-syslog-server/spec)) ([main.go](src/cmd/syslog-server/main.go))**: a [syslog](https://en.wikipedia.org/wiki/Syslog) server that converts received messages to [loggregator envelopes] and forwards them to log-cache via gRPC.

### Scaling
### Learn more

Numerous variables affect the retention of Log Cache:
* [Logging and metrics in Cloud Foundry](https://docs.cloudfoundry.org/loggregator/data-sources.html)
* [Advanced documentation](docs)
* [#logging-and-metrics](https://cloudfoundry.slack.com/archives/CUW93AF3M) in Cloud Foundry Slack

Number of instances - Increasing adds more storage space, allows higher throughput and reduces contention between sources
## FAQ

Max Per Source ID - Increasing allows a higher max storage allowance, but may decrease the storage of less noisy apps on the same node
### Accessing the Log Cache directly

Memory per instance - Increasing allows more storage in general, but any given instance may not be able to take advantage of that increase due to max per source id
To access logs and metrics in Log Cache directly, install the [Log Cache cf CLI plugin](https://github.com/cloudfoundry/log-cache-cli#installing) or query Log Cache's [API](src/README.md). Only authenticated clients can communicate with Log Cache, and the logs and metrics that can be retrieved are determined by the level of access your authentication allows.

Memory limit - Increasing memory limit allows for more storage, but may cause out of memory errors and crashing if set too high for the total throughput of the system
> [!NOTE]
> Accessing Log Cache directly is the only way to retrieve system component metrics from Log Cache.

Larger CPUs - Increasing the CPU budget per instance should allow higher throughput
### System component logs

Log Cache is known to exceed memory limits under high throughput/stress. If you see your log-cache reaching higher memory
then you have set, you might want to scale your log-cache up. Either solely in terms of CPU per instance, or more instances.
System component logs are not stored within Log Cache at present. This is in order to prioritize application logs and metrics, as well as system component metrics.

You can monitor the performance of log cache per source id (app or platform component) using the Log Cache CLI. The command `cf log-meta` allows viewing
the amount of logs and metrics as well as the period of time for those logs and metrics for each source on the system. This can be used in conjunction with scaling
to target your use cases. For simple pushes, a low retention period may be adequate. For running analysis on metrics for debugging and scaling, higher retention
periods may be desired; although one should remember all logs and metrics will always be lost upon crashes or re-deploys of log-cache.
### Reliability during deployments

### Log Cache Syslog Server TLS and mutual TLS configuration
Log Cache is an in-memory database, and as such will drop its entire cache when it is restarted. Because of that, clients should plan for Log Cache to not be available 100% of the time. For example, [Cloud Controller] depends on Log Cache, but can can function without it, and informs its users that Log Cache is unavailable when necessary.

If someone runs Cloud Foundry with a hardened setup in terms of security, they might want to activate TLS or even mutual TLS(mTLS) for the incoming connections to the Log Cache Syslog Server. The activation of TLS and mTLS is optional and is configured by the presence of the needed certificates. For TLS a syslog certificate or syslog key should be present in the BPM configuration and for mTLS a syslog client CA certificate should be present in the BPM configuration. Check the BOSH [BPM template](jobs/log-cache-syslog-server/templates/bpm.yml.erb) and the [spec](jobs/log-cache-syslog-server/spec) for details.
### Availability during AZ failures

### Reliability
Log Cache is built to be horizontally scalable by hashing source IDs (e.g. application GUID, unique string, etc) of logs and metrics, and assigning them to Log Cache nodes. However, assignment occurs during BOSH deployments, so logs and metrics assigned to a Log Cache node that becomes unavailable due to an AZ failure will also become unavailable.

Log Cache is an in memory cache and as such will drop envelopes when it restarts. Users should not expect 100% availability of
logs in Log Cache and should plan accordingly. For example, Cloud Controller can function without Log Cache though users are
informed that Log Cache is unavailable.
This situation can be rectified in case of an AZ failure by redeploying with a configuration that does not attempt to place a Log Cache node in the AZ that is experiencing a failure.

## How do I use it?
### Envelope retention

### From the `cf` CLI
Log Cache will prune envelopes by default based on how much memory remains available to it, and by how many envelopes are currently retained per source.

Application developers using Cloud Foundry will use Log Cache automatically. Build logs while running
`cf push` are streamed through Log Cache. Application logs when running `cf logs` are retrieved from Log Cache.
Application metrics when running `cf app APP_NAME` are retrieved from Log Cache.
The `max_per_source` property in the log-cache job determines the maximum number of envelopes that Log Cache will store for any given Source ID. It defaults to 100,000. Operators can tweak this property as-needed, but should note that adjusting it up without a corresponding increase in the amount of memory available to Log Cache might imbalance the envelope retention in favor of noisier apps.

### Using the Log Cache CLI plugin
To query Log Cache directly users or operators can install the [Log Cache CLI plugin](https://github.com/cloudfoundry/log-cache-cli)
by running `cf install-plugin -r CF-Community "log-cache"` which provides additional commands in the CLI for querying logs and metrics
stored in log cache. This is useful for querying system component metrics which are not exposed otherwise. See the CLI plugin README for details.
If envelope retention is important to you, and you find envelopes are being pruned before reaching your desired `max_per_source` then you should increase the amount of memory available to Log Cache by:
* Increasing the number of Log Cache nodes.
* Increasing the amount of memory available to each Log Cache node.

### Log Cache API
Documentation about the internals of Log Cache and its API can be found [here](https://github.com/cloudfoundry/log-cache-release/blob/main/src/README.md)

[BOSH]: https://bosh.io/docs/
[cf-deployment]: https://github.com/cloudfoundry/cf-deployment
[Cloud Controller]: https://github.com/cloudfoundry/cloud_controller_ng
[Cloud Foundry]: https://www.cloudfoundry.org/
[loggregator envelopes]: https://github.com/cloudfoundry/loggregator-api#v2-envelope
1 change: 1 addition & 0 deletions docs/api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# API
98 changes: 98 additions & 0 deletions docs/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Architecture

```mermaid
flowchart RL
origin([Syslog Agents on various VMs])
requests([Log Cache clients])

subgraph Log Cache VM
log-cache[(log-cache)]
log-cache-syslog-server
log-cache-cf-auth-proxy
log-cache-gateway
end

origin-. syslog .->log-cache-syslog-server
log-cache-syslog-server-- loggregator envelopes -->log-cache

requests-..->gorouter{GoRouter}
gorouter-..->log-cache-cf-auth-proxy
log-cache-cf-auth-proxy-. Authenticate .->uaa{UAA}
log-cache-cf-auth-proxy-->log-cache-gateway
log-cache-gateway-- gRPC -->log-cache
```

## Source IDs

Log Cache indexes data by the `source_id` field on the [Loggregator Envelope]. Source IDs should be unique across Cloud Foundry applications and system components.

> [!IMPORTANT]
> Application instances should all share the same source ID.

Typically, the source ID of a Cloud Foundry application is its GUID, and the source ID of a Cloud Foundry system component is its name (e.g. `doppler`).

## Envelope sharding

Log Cache is built to be horizontally scalable by sharding source IDs accross nodes. The source IDs are hashed to determine which Log Cache node will host which envelopes. Envelopes can be sent to any Log Cache node and they will be forwarded on to the correct node. Similarly, Log Cache clients can reach out to any Log Cache node for logs or metrics, and will be forwarded to the appropriate node for the source ID they are requesting.

## API

Log Cache's API contains three different sets of endpoints:
* `read` and `meta` are source ID-oriented.
* `query` and `query_range` are metric-oriented, and fulfill the Prometheus API.
* `info` for service version discoverability.

### Authentication

Each request must be via HTTPS and have the `Authorization` header set with a UAA-provided token.

The token must have one of the following:
* `doppler.firehose` scope.
* `logs.admin` scope.
* A relevant Cloud Controller role.

Having one of those two scopes is sufficient to return data for any source ID.

> [!IMPORTANT]
> Either the `doppler.firehose` scope or the `logs.admin` scope is required in order to retrieve system component metrics from Log Cache.

> [!WARNING]
> These authentication requirements can be bypassed completely by querying the log-cache job directly via gRPC with the correct certificates for mTLS authentication.

## Clients

Log Cache clients should use the [Go Client Library for Log Cache](https://github.com/cloudfoundry/go-log-cache) to communicate with Log Cache. Here are some useful links within that library:
* [Protobuf definitions](https://github.com/cloudfoundry/go-log-cache/tree/main/api/v1)
* [Go generated code](https://github.com/cloudfoundry/go-log-cache/tree/main/rpc/logcache_v1)
* [Examples](https://github.com/cloudfoundry/go-log-cache/tree/main/examples)




---



#### Vertical Scaling

Operators should be mindful of the amount of memory they provide to Log Cache, as

Log Cache only needs sufficient CPU to accept incoming logs and metrics.

#### Log Cache Syslog Server TLS and mutual TLS configuration

If someone runs Cloud Foundry with a hardened setup in terms of security, they might want to activate TLS or even mutual TLS(mTLS) for the incoming connections to the Log Cache Syslog Server. The activation of TLS and mTLS is optional and is configured by the presence of the needed certificates. For TLS a syslog certificate or syslog key should be present in the BPM configuration and for mTLS a syslog client CA certificate should be present in the BPM configuration. Check the BOSH [BPM template](jobs/log-cache-syslog-server/templates/bpm.yml.erb) and the [spec](jobs/log-cache-syslog-server/spec) for details.

### Log Cache API
Documentation about the internals of Log Cache and its API can be found [here](https://github.com/cloudfoundry/log-cache-release/blob/main/src/README.md)

It is meant to be deployed as a collection of microservices on a node within [Cloud Foundry]. Log Cache nodes are horizontally scalable, and source IDs (e.g. application GUID, unique string, etc) of envelopes are hashed to determine which Log Cache node will host which envelopes. Envelopes can be sent to any Log Cache node and they will be forwarded on to the correct node. Similarly, Log Cache clients can reach out to any Log Cache node for logs or metrics, and will be forwarded to the appropriate node for the source ID they are requesting.

Within [cf-deployment], Log Cache interacts with a number of other components by default:
* Log Cache receives data from Syslog Agents, which are added on to every VM.
* Cloud Controller queries Log Cache for app instance metrics such as CPU usage and memory when retrieving details for applications.
* The cf CLi queries Log Cache directly to retrieve recent logs for applications.

[Cloud Foundry]: https://www.cloudfoundry.org/
[gRPC]: https://grpc.io/
[Loggregator Envelope]: https://github.com/cloudfoundry/loggregator-api/blob/master/v2/envelope.proto
25 changes: 25 additions & 0 deletions docs/scaling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Scaling

A factsheet for operators to consider when scaling Log Cache.

---

## Variables to tweak

Numerous variables affect the retention of Log Cache:

* **Number of nodes**: Increasing the number of Log Cache nodes adds more storage space, allows higher throughput and reduces contention between sources.
* **Max envelopes per source ID**: Increasing this allows more envelopes to be stored per source ID a higher max storage allowance, but may decrease the storage of less noisy apps on the same node.
* **Memory per instance**: Increasing allows more storage in general, but any given instance may not be able to take advantage of that increase due to max per source id

Memory limit - Increasing memory limit allows for more storage, but may cause out of memory errors and crashing if set too high for the total throughput of the system

Larger CPUs - Increasing the CPU budget per instance should allow higher throughput

Log Cache is known to exceed memory limits under high throughput/stress. If you see your log-cache reaching higher memory
then you have set, you might want to scale your log-cache up. Either solely in terms of CPU per instance, or more instances.

You can monitor the performance of log cache per source id (app or platform component) using the Log Cache CLI. The command `cf log-meta` allows viewing
the amount of logs and metrics as well as the period of time for those logs and metrics for each source on the system. This can be used in conjunction with scaling
to target your use cases. For simple pushes, a low retention period may be adequate. For running analysis on metrics for debugging and scaling, higher retention
periods may be desired; although one should remember all logs and metrics will always be lost upon crashes or re-deploys of log-cache.