Module 7: Monitoring and Logging

Master SLT monitoring tools, log analysis, and proactive alerting to ensure reliable replication.

1. Monitoring Dashboard (LTRC)

Real-Time Dashboard

Transaction: LTRC

┌─────────────────────────────────────────────┐
│ SLT Replication Dashboard                  │
├─────────────────────────────────────────────┤
│ Overall Status: ● Active (Green)           │
│ MT_IDs Active: 3                            │
│ Tables Replicating: 125                     │
│ Throughput: 2,345 records/sec               │
│ Avg Latency: 1.2 seconds                    │
│ Errors (24h): 5                             │
│                                              │
│ Top 5 Active Tables:                        │
│ VBAP  ██████████░░  65%  (1,523 rec/s)    │
│ BSEG  ████████░░░░  40%    (940 rec/s)    │
│ MSEG  ██████░░░░░░  30%    (705 rec/s)    │
│ EKPO  ████░░░░░░░░  20%    (470 rec/s)    │
│ MARC  ██░░░░░░░░░░  10%    (235 rec/s)    │
└─────────────────────────────────────────────┘

2. Key Metrics to Monitor

Health Indicators

Metric	Check	Green	Yellow	Red
Latency	Avg response time	< 2s	2-10s	> 10s
Throughput	Records/sec	> 1000	500-1000	< 500
Error Rate	Errors/total	< 0.1%	0.1-1%	> 1%
Log Table Size	GB used	< 5	5-10	> 10
Job Failures	Failed jobs	0	1-5	> 5
CPU Usage	SLT server	< 70%	70-85%	> 85%

3. Log Files and Analysis

SLT Log Locations

Application Logs:

Location: /usr/sap/SLT/D00/work/

Key Files:
├── dev_w0      - Work process 0 log
├── dev_disp    - Dispatcher log
├── dev_rfc*    - RFC connection logs
└── syslog      - System messages

Replication Logs:
├── /DMIS/LOG_* - Logging tables (database)
└── SM37 jobs    - Background job logs

Log Analysis Queries

-- Find slow replication tables
SELECT 
  TABLE_NAME,
  AVG(REPLICATION_TIME_MS) as AVG_TIME,
  MAX(REPLICATION_TIME_MS) as MAX_TIME,
  COUNT(*) as RECORD_COUNT
FROM /DMIS/REPLICATION_STATS
WHERE TIMESTAMP >= ADD_DAYS(CURRENT_DATE, -1)
GROUP BY TABLE_NAME
HAVING AVG(REPLICATION_TIME_MS) > 1000
ORDER BY AVG_TIME DESC;

-- Identify error patterns
SELECT 
  ERROR_TYPE,
  TABLE_NAME,
  COUNT(*) as ERROR_COUNT,
  MAX(TIMESTAMP) as LAST_OCCURRENCE
FROM /DMIS/ERROR_LOG
WHERE TIMESTAMP >= ADD_DAYS(CURRENT_DATE, -7)
GROUP BY ERROR_TYPE, TABLE_NAME
ORDER BY ERROR_COUNT DESC;

4. Performance Monitoring

System Metrics (ST06)

Transaction: ST06 (Operating System Monitor)

CPU Usage:
├── Total: 68% (healthy)
├── User: 42%
├── System: 26%
└── Idle: 32%

Memory:
├── Total: 64 GB
├── Used: 48 GB (75%)
└── Free: 16 GB

Disk I/O:
├── Read: 450 MB/s
└── Write: 280 MB/s

Database Performance (DB02)

Transaction: DB02

Tablespace Usage:
├── SLTLOG: 45% (9.2 GB / 20 GB)
├── SLTDATA: 32% (16 GB / 50 GB)
└── SLTTEMP: 15% (3 GB / 20 GB)

Top SQL Statements:
1. SELECT FROM /DMIS/LOG_VBAP  (28% DB time)
2. INSERT INTO SLTREPL.VBAP    (15% DB time)
3. DELETE FROM /DMIS/LOG_BSEG  (12% DB time)

5. Alerting and Notifications

Configure Email Alerts

Transaction: LTRC → Settings → Notifications

Alert Conditions:
☑ Latency > 30 seconds
☑ Error rate > 1%
☑ Job failure
☑ Logging table > 10 GB
☑ MT_ID status changed to Error

Recipients:
- slt-admin@company.com
- dba-team@company.com

Frequency: Immediate (real-time)

SMS/Integration Alerts

" Custom alert integration
FUNCTION Z_SLT_SEND_ALERT.
  IMPORTING
    iv_mt_id TYPE string
    iv_severity TYPE string
    iv_message TYPE string.
    
  " Send to monitoring system (Nagios/Splunk/etc)
  CALL FUNCTION 'HTTP_POST'
    EXPORTING
      uri = 'https://monitoring.company.com/api/alert'
      data = |{ "mtid": "{iv_mt_id}", "severity": "{iv_severity}", "message": "{iv_message}" }|.
ENDFUNCTION.

6. Trend Analysis

Historical Performance

-- Weekly throughput trends
SELECT 
  TO_VARCHAR(DATE_TRUNC('DAY', TIMESTAMP), 'YYYY-MM-DD') as DAY,
  SUM(RECORD_COUNT) as TOTAL_RECORDS,
  AVG(LATENCY_MS) as AVG_LATENCY
FROM /DMIS/STATISTICS
WHERE TIMESTAMP >= ADD_DAYS(CURRENT_DATE, -30)
GROUP BY DATE_TRUNC('DAY', TIMESTAMP)
ORDER BY DAY;

-- Result visualization:
Day          Records     Latency
2026-01-01   2,345,678   1.2s
2026-01-02   2,456,789   1.3s
2026-01-03   2,123,456   1.1s
...

Capacity Planning

Growth Analysis (Last 90 days):
- Data volume: +15% per month
- Tables added: +5 per month
- Throughput required: +12% per month

Projected Capacity (6 months):
- Data volume: 150 GB → 225 GB
- Throughput: 2,000 rec/s → 2,800 rec/s
- Action: Plan hardware upgrade Q2 2026

7. Reporting

Daily Health Report

Automated Daily Report (Email)

Subject: SLT Daily Health Report - 2026-01-21

Summary:
✅ Overall Status: Healthy
✅ Replication Active: 125/125 tables
✅ Avg Latency: 1.2 seconds
⚠️  Warnings: 2
❌ Errors: 5

Details:
- Total records replicated: 5,234,567
- Peak throughput: 3,456 rec/s (at 14:30)
- Lowest throughput: 890 rec/s (at 03:15)

Warnings:
1. Table VBAP: Latency spike to 15s at 08:45
2. Logging table BSEG: Size 8.5 GB (approaching limit)

Errors:
1. Table MARC: 3 FK violations (resolved)
2. Table KNA1: 2 target locks (retried successfully)

Actions Required:
- Review VBAP performance during business hours
- Schedule BSEG logging table cleanup

[View Full Report]

8. Custom Monitoring Scripts

Shell Script Example

#!/bin/bash
# SLT Health Check Script

# Check replication status
status=$(hdbsql -n hanaserver:30015 -u SLTREPL -p $PASSWORD \
  "SELECT COUNT(*) FROM /DMIS/LOG_MARA WHERE PROCESSED = ''")

if [ $status -gt 10000 ]; then
  echo "WARNING: $status pending records in MARA"
  # Send alert
  curl -X POST https://monitoring/api/alert \
    -d '{"service":"SLT","status":"warning","message":"High pending count"}'
fi

# Check job status
jobs=$(sm37 | grep DMIS | grep "Cancelled|Error" | wc -l)
if [ $jobs -gt 0 ]; then
  echo "ERROR: $jobs failed replication jobs"
fi

Python Monitoring Client

import requests
from datetime import datetime

class SLTMonitor:
    def __init__(self, hana_host, user, password):
        self.conn = connect(hana_host, user, password)
    
    def check_latency(self, threshold_seconds=10):
        query = """
        SELECT TABLE_NAME, 
               TIMESTAMPDIFF(SECOND, MIN(TIMESTAMP), CURRENT_TIMESTAMP) as LATENCY
        FROM /DMIS/LOG_*
        WHERE PROCESSED = ''
        GROUP BY TABLE_NAME
        HAVING LATENCY > ?
        """
        results = self.conn.execute(query, (threshold_seconds,))
        
        for table, latency in results:
            self.send_alert(f"High latency on {table}: {latency}s")
    
    def send_alert(self, message):
        requests.post('https://monitoring/api/alert', 
                     json={'message': message, 'timestamp': datetime.now()})

9. Best Practices

Monitoring Checklist

Daily:

✅ Check dashboard for red/yellow indicators
✅ Review overnight job logs
✅ Verify no replication stopped
✅ Check error queue

Weekly:

✅ Analyze performance trends
✅ Review logging table growth
✅ Check CPU/memory utilization
✅ Test alert notifications

Monthly:

✅ Capacity planning review
✅ Performance tuning assessment
✅ Update monitoring thresholds
✅ Disaster recovery test

Summary

✅ Real-time dashboard monitoring (LTRC)
✅ Key health metrics and thresholds
✅ Log file locations and analysis
✅ Performance monitoring (ST06, DB02)
✅ Alert configuration and notifications
✅ Trend analysis and capacity planning
✅ Custom monitoring scripts
✅ Best practices for proactive monitoring

Next: Module 8 - Error Handling & Recovery

1. Monitoring Dashboard (LTRC)​

Real-Time Dashboard​

2. Key Metrics to Monitor​

Health Indicators​

3. Log Files and Analysis​

SLT Log Locations​

Log Analysis Queries​

4. Performance Monitoring​

System Metrics (ST06)​

Database Performance (DB02)​

5. Alerting and Notifications​

Configure Email Alerts​

SMS/Integration Alerts​

6. Trend Analysis​

Historical Performance​

Capacity Planning​

7. Reporting​

Daily Health Report​

8. Custom Monitoring Scripts​

Shell Script Example​

Python Monitoring Client​

9. Best Practices​

Monitoring Checklist​

Summary​