# New Relic Alert Policies — Konfigurationsvorlage

**Stand:** 2026-04-16
**Status:** Noch nicht konfiguriert — NRQL-Queries vorbereitet, muessen in NR UI eingetragen werden.

## Voraussetzung

- New Relic Account (EU)
- Notification Channel (Slack / Email) in NR konfiguriert
- Agent laeuft auf allen 5 Servern (`cluster = 'mtls-proxy'`)

## Alert Policy: mTLS Cluster

In New Relic: **Alerts & AI → Alert Policies → Create a policy** → Name: `mTLS Cluster`

### 1. Server Down (Infrastruktur)

```sql
SELECT count(*) FROM SystemSample
WHERE cluster = 'mtls-proxy'
FACET hostname
```

- **Condition Type:** NRQL
- **Threshold:** Signal lost for 3 minutes
- **Severity:** Critical
- **Beschreibung:** Ein Cluster-Server liefert keine Metriken mehr

### 2. Apache 5xx Error Rate

```sql
SELECT count(*) FROM Log
WHERE logtype = 'apache' AND cluster = 'mtls-proxy'
AND response LIKE '5%'
FACET hostname
SINCE 5 minutes ago
```

- **Threshold:** > 10 pro 5-Minuten-Fenster
- **Severity:** Critical
- **Beschreibung:** Apache liefert HTTP 5xx Fehler an mTLS-Clients

### 3. Disk Space > 85%

```sql
SELECT latest(diskUsedPercent) FROM StorageSample
WHERE cluster = 'mtls-proxy'
FACET hostname, mountPoint
```

- **Threshold:** > 85% (Warning), > 95% (Critical)
- **Severity:** Warning / Critical
- **Beschreibung:** Festplatte fuellt sich (Log-Rotation, CSR-Akkumulation)

### 4. CPU > 90% (sustained)

```sql
SELECT average(cpuPercent) FROM SystemSample
WHERE cluster = 'mtls-proxy'
FACET hostname
```

- **Threshold:** > 90% fuer 5 Minuten
- **Severity:** Warning
- **Beschreibung:** Ungewoehnliche CPU-Last (DDoS, Runaway-Prozess)

### 5. RAM > 90%

```sql
SELECT average(memoryUsedPercent) FROM SystemSample
WHERE cluster = 'mtls-proxy'
FACET hostname
```

- **Threshold:** > 90% fuer 5 Minuten
- **Severity:** Warning
- **Beschreibung:** Speicher-Engpass

### 6. DNS-Maintenance Events (Drain/Restore)

```sql
SELECT count(*) FROM Log
WHERE logtype = 'syslog' AND cluster = 'mtls-proxy'
AND message LIKE '%dns-maintenance%CRITICAL%'
FACET hostname
SINCE 5 minutes ago
```

- **Threshold:** > 0
- **Severity:** Critical
- **Beschreibung:** dns-maintenance hat einen kritischen Fehler gemeldet (Drain-Timeout, Restore-Fehler)

### 7. SSH Brute-Force

```sql
SELECT count(*) FROM Log
WHERE logtype = 'linux_auth' AND cluster = 'mtls-proxy'
AND message LIKE '%Failed password%'
FACET hostname
SINCE 10 minutes ago
```

- **Threshold:** > 20 pro 10 Minuten
- **Severity:** Warning
- **Beschreibung:** SSH-Brute-Force-Versuch (Wazuh reagiert auch, doppelte Sicherung)

### 8. Certificate Expiry < 14 Tage

```sql
SELECT latest(timestamp) FROM Log
WHERE logtype = 'syslog'
AND message LIKE '%certbot%'
AND message LIKE '%failure%'
FACET hostname
SINCE 24 hours ago
```

- **Threshold:** > 0 Treffer
- **Severity:** Critical
- **Beschreibung:** Certbot-Renewal fehlgeschlagen — Cert-Ablauf droht

Alternative (externer Check via Synthetic):
```
Synthetic Monitor → Check: openssl s_client -connect cs.prd.bahn.business:443
Pruefe: Cert-Ablauf > 14 Tage, sonst Critical
```

### 9. New Relic Agent selbst down

```sql
SELECT count(*) FROM SystemSample
WHERE cluster = 'mtls-proxy'
FACET hostname
```

- **Condition Type:** Signal lost for 5 minutes
- **Severity:** Warning
- **Beschreibung:** NR Agent liefert keine Daten mehr (Agent-Crash, Netzwerk)

### 10. Apache Error-Log-Spike

```sql
SELECT count(*) FROM Log
WHERE logtype = 'apache_error' AND cluster = 'mtls-proxy'
FACET hostname
SINCE 5 minutes ago
```

- **Threshold:** > 50 pro 5 Minuten (Warning), > 200 (Critical)
- **Severity:** Warning / Critical
- **Beschreibung:** Ungewoehnlich viele Apache-Fehler

## Notification Channel

Empfohlen: Slack Channel (selber wie dns-maintenance Alerts) + Email an ops@.

In NR: **Alerts & AI → Destinations → Slack** → Webhook URL eintragen.

## Einrichten (Schritt fuer Schritt)

1. NR Login → **Alerts & AI** → **Alert Policies** → **Create a policy**
2. Name: `mTLS Cluster`, Incident preference: `By condition and signal`
3. Fuer jede der 10 Conditions oben: **Add a condition** → NRQL → Query einfuegen → Threshold setzen
4. **Notification Channels** → Slack + Email zuweisen
5. Testen: `NR → Alerts → Open incidents` pruefen
