Skip to main content

Module 16: Backup and Disaster Recovery

Implement robust backup strategies and disaster recovery procedures for SLT environments.

1. Backup Architecture

graph TD
A[SLT Production] -->|Full Backup| B[Backup Storage]
A -->|Incremental| B
A -->|Log Backup| B
B -->|Replication| C[DR Site]
C -->|Standby| D[DR SLT/HANA]
A -->|Async Replication| D

2. HANA Backup Strategy

Backup Types

TypeFrequencyDurationStorageRPORTO
FullWeekly2-4 hours500 GB1 week4-6 hours
IncrementalDaily30-60 min50 GB1 day2-3 hours
DifferentialDaily1-2 hours100 GB1 day3-4 hours
LogEvery 15 min2-5 min10 GB15 min1-2 hours

Full Backup Configuration

-- Configure backup destination
ALTER SYSTEM ALTER CONFIGURATION ('global.ini', 'SYSTEM')
SET ('persistence', 'basepath_databackup') = '/backup/data',
('persistence', 'basepath_logbackup') = '/backup/log'
WITH RECONFIGURE;

-- Perform full backup
BACKUP DATA USING FILE ('COMPLETE_BACKUP_2026_01_15')
COMMENT 'Weekly full backup';

-- Verify backup
SELECT
BACKUP_ID,
ENTRY_TYPE_NAME,
STATE_NAME,
START_TIME,
END_TIME,
BACKUP_SIZE / 1024 / 1024 / 1024 as SIZE_GB
FROM M_BACKUP_CATALOG
ORDER BY START_TIME DESC
LIMIT 10;

Incremental Backup

-- Configure automatic incremental backup
ALTER SYSTEM ALTER CONFIGURATION ('global.ini', 'SYSTEM')
SET ('backup', 'data_backup_buffer_size') = '512',
('backup', 'log_backup_timeout_s') = '900'
WITH RECONFIGURE;

-- Incremental backup
BACKUP DATA INCREMENTAL USING FILE ('INCR_BACKUP_2026_01_15');

-- Result:
Full Backup: 500 GB (Sunday)
Incremental Mon: +45 GB
Incremental Tue: +38 GB
Incremental Wed: +52 GB
Incremental Thu: +41 GB
Incremental Fri: +48 GB
Incremental Sat: +39 GB
Total for week: 763 GB

Log Backup

# Automatic log backup every 15 minutes
# /usr/sap/HDB/SYS/global/hdb/backup/log_backup.sh

#!/bin/bash
LOG_PATH="/backup/log/$(date +%Y%m%d)"
mkdir -p "$LOG_PATH"

hdbsql -U BACKUP_USER << EOF
BACKUP DATA FOR FULL SYSTEM
CREATE SNAPSHOT COMMENT 'Automated log backup';
EOF

# Rotate logs older than 7 days
find /backup/log -type f -mtime +7 -delete

# Result: RPO = 15 minutes max

3. SLT Configuration Backup

Export MT_ID Configuration

Transaction: LTRC → Configuration → Export

MT_ID: MT_ID_PROD_01
Export Path: /backup/slt/configs/

Files Created:
├── MT_ID_PROD_01_config.xml (Configuration)
├── MT_ID_PROD_01_tables.xml (Table list)
├── MT_ID_PROD_01_mappings.xml (Transformations)
└── MT_ID_PROD_01_schedule.xml (Job settings)

Storage:
├── Local: /backup/slt/configs/
├── Network: \\backup-server\slt\
└── Cloud: s3://backup-bucket/slt/configs/

Automated Backup Script

#!/bin/bash
# /usr/sap/SLT/scripts/backup_slt_config.sh

BACKUP_DIR="/backup/slt/$(date +%Y%m%d)"
mkdir -p "$BACKUP_DIR"

# Export all MT_IDs
for MT_ID in $(sapcontrol -nr 00 -function GetProcessList | grep LTRC | awk '{print $3}')
do
echo "Backing up $MT_ID..."

# Export via RFC
Rfc_call LTRC_EXPORT \
MT_ID="$MT_ID" \
EXPORT_PATH="$BACKUP_DIR/${MT_ID}_config.xml"

done

# Backup logging tables
hdbsql -U BACKUP_USER << EOF
EXPORT /DMIS/LOG INTO '$BACKUP_DIR/logs.csv' WITH CSV;
EXPORT /DMIS/DT_STATUS INTO '$BACKUP_DIR/status.csv' WITH CSV;
EOF

# Compress and upload to cloud
tar -czf "$BACKUP_DIR.tar.gz" "$BACKUP_DIR"
aws s3 cp "$BACKUP_DIR.tar.gz" s3://backup-bucket/slt/

echo "Backup completed: $BACKUP_DIR.tar.gz"

4. Disaster Recovery Setup

DR Site Architecture

Primary Site (Production):
├── SLT Server: slt-prod (Active)
├── HANA: hana-prod (Active)
├── Replication: System Replication to DR
└── RTO: 2 hours, RPO: 5 minutes

DR Site (Standby):
├── SLT Server: slt-dr (Passive)
├── HANA: hana-dr (Secondary)
├── Replication: Async from Primary
└── Activation: Manual/Automatic

HANA System Replication

-- On Primary site
ALTER SYSTEM ALTER CONFIGURATION ('global.ini', 'SYSTEM')
SET ('system_replication', 'mode') = 'async',
('system_replication', 'operation_mode') = 'logreplay'
WITH RECONFIGURE;

-- Enable system replication
hdbnsutil -sr_enable --name=PRIMARY

-- On DR site (stop HANA first)
HDB stop

-- Register as secondary
hdbnsutil -sr_register \
--name=SECONDARY \
--remoteHost=hana-prod \
--remoteInstance=00 \
--replicationMode=async \
--operationMode=logreplay

-- Start secondary
HDB start

-- Check replication status
hdbnsutil -sr_state

-- Result:
mode: async
operation mode: logreplay
mapping: PRIMARY -> SECONDARY
status: ACTIVE

SLT DR Configuration

SLT DR Setup:

1. Install SLT on DR site (same version as prod)
2. Restore configuration backups
3. Configure RFC destinations to DR ERP
4. Update target system to DR HANA
5. Keep MT_IDs in standby mode

Transaction: LTRC (on DR SLT)
MT_ID: MT_ID_PROD_01_DR
Source: ERP_PROD (via DR network)
Target: hana-dr
Status: ● Standby
Auto-start on failover: ☑ Enabled

5. Failover Procedures

Automatic Failover

#!/bin/bash
# /usr/sap/scripts/slt_failover.sh

LOG_FILE="/var/log/slt_failover.log"

log() {
echo "$(date): $1" | tee -a "$LOG_FILE"
}

# Health check primary
if ! ping -c 3 hana-prod > /dev/null 2>&1; then
log "ERROR: Primary HANA unreachable"

# Initiate failover
log "Starting failover to DR site..."

# 1. Takeover DR HANA
log "Step 1: HANA takeover"
su - hdbadm -c "hdbnsutil -sr_takeover"

# 2. Start SLT on DR
log "Step 2: Starting SLT on DR"
su - sltadm -c "startsap"

# 3. Activate MT_IDs
log "Step 3: Activating replication"
for MT_ID in MT_ID_PROD_01_DR MT_ID_PROD_02_DR; do
Rfc_call LTRC_START MT_ID="$MT_ID"
log "Activated: $MT_ID"
done

# 4. Update DNS
log "Step 4: Updating DNS"
aws route53 change-resource-record-sets \
--hosted-zone-id Z123456 \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "hana.company.com",
"Type": "A",
"TTL": 60,
"ResourceRecords": [{"Value": "10.20.30.40"}]
}
}]
}'

# 5. Notify team
log "Step 5: Sending notifications"
curl -X POST https://alerts.company.com/api/notify \
-d '{"message":"SLT failover to DR completed","severity":"critical"}'

log "Failover completed successfully"
else
log "Primary site healthy - no action needed"
fi

Manual Failover Steps

1. Stop replication on primary (if accessible):
Transaction: LTRC → Stop All

2. Take over DR HANA:
hdbnsutil -sr_takeover

3. Verify DR HANA status:
HDB info
→ Should show: Mode = PRIMARY

4. Start SLT on DR site:
startsap

5. Activate MT_IDs:
Transaction: LTRC → Start Replication

6. Verify data flow:
Check /DMIS/DT_STATUS

7. Update application connections:
Point apps to hana-dr:30015

8. Monitor for 1 hour:
Check throughput, latency, errors

9. Communicate to stakeholders:
Send status update email

Estimated time: 30-45 minutes

6. Failback Procedures

Planned Failback

Scenario: Primary site restored, failback from DR

Step 1: Sync primary with DR data
hdbnsutil -sr_register \
--name=PRIMARY_RESTORED \
--remoteHost=hana-dr \
--remoteInstance=00 \
--replicationMode=sync \
--operationMode=logreplay_readaccess

Step 2: Wait for full sync
hdbnsutil -sr_state
→ Wait until: "fully synchronized"

Step 3: Stop SLT on DR
Transaction: LTRC → Stop All
su - sltadm -c "stopsap"

Step 4: Takeover primary
hdbnsutil -sr_takeover

Step 5: Reconfigure replication (DR as secondary)
On DR site:
hdbnsutil -sr_register \
--name=SECONDARY \
--remoteHost=hana-prod \
--remoteInstance=00 \
--replicationMode=async \
--operationMode=logreplay

Step 6: Start SLT on primary
su - sltadm -c "startsap"
Transaction: LTRC → Start Replication

Step 7: Update DNS back to primary
aws route53 change-resource-record-sets ...

Step 8: Verify operations
Monitor for 24 hours

Estimated time: 2-3 hours

7. Testing DR Plan

DR Test Schedule

Quarterly DR Test Plan:

Week 1: Planning
├── Review DR procedures
├── Update contact list
├── Check backup validity
└── Verify DR site readiness

Week 2: Execution
├── Announce test window
├── Perform controlled failover
├── Run application tests
├── Verify data consistency
└── Measure RTO/RPO

Week 3: Failback
├── Return to primary
├── Verify operations
└── Document issues

Week 4: Review
├── Analyze test results
├── Update DR procedures
├── Training for gaps
└── Report to management

DR Test Script

#!/bin/bash
# dr_test.sh - Non-disruptive DR test

TEST_DATE=$(date +%Y%m%d_%H%M%S)
TEST_LOG="/var/log/dr_test_$TEST_DATE.log"

log() {
echo "$(date): $1" | tee -a "$TEST_LOG"
}

log "=== DR Test Started ==="

# 1. Verify DR HANA is in sync
log "Checking HANA replication status..."
REPL_STATUS=$(hdbnsutil -sr_state | grep "mode: async")
if [ $? -eq 0 ]; then
log "✓ HANA replication active"
else
log "✗ HANA replication issue"
exit 1
fi

# 2. Test read access on DR
log "Testing DR HANA read access..."
hdbsql -n hana-dr:30015 -u TESTUSER -p TestPass \
"SELECT COUNT(*) FROM SLTREPL.VBAP"
if [ $? -eq 0 ]; then
log "✓ DR HANA accessible"
else
log "✗ DR HANA access failed"
exit 1
fi

# 3. Verify SLT config backups
log "Checking SLT config backups..."
LATEST_BACKUP=$(ls -t /backup/slt/*.tar.gz | head -1)
if [ -f "$LATEST_BACKUP" ]; then
BACKUP_AGE=$(find "$LATEST_BACKUP" -mtime -1)
if [ -n "$BACKUP_AGE" ]; then
log "✓ Recent backup found: $LATEST_BACKUP"
else
log "⚠ Backup older than 24 hours"
fi
else
log "✗ No backup found"
exit 1
fi

# 4. Simulate failover (without actual takeover)
log "Simulating failover procedures..."
log " - DNS update command validated"
log " - SLT start command validated"
log " - MT_ID activation command validated"

# 5. Measure RTO/RPO
log "Calculating RTO/RPO..."
LAST_LOG_BACKUP=$(hdbsql -n hana-dr:30015 -u SYSTEM -p SysPass \
"SELECT MAX(UTC_START_TIME) FROM M_BACKUP_CATALOG WHERE ENTRY_TYPE_NAME='log backup'")
RPO_MINUTES=$(( ($(date +%s) - $(date -d "$LAST_LOG_BACKUP" +%s)) / 60 ))
log "Current RPO: $RPO_MINUTES minutes (target: < 15 minutes)"

log "=== DR Test Completed ==="
log "Results summary:"
log " HANA Replication: ✓"
log " DR Access: ✓"
log " Backups: ✓"
log " RPO: $RPO_MINUTES minutes"
log " RTO Estimate: 30-45 minutes (based on procedure)"

# Email results
mail -s "DR Test Results - $TEST_DATE" \
dr-team@company.com < "$TEST_LOG"

8. Monitoring and Alerting

Backup Monitoring

-- Check backup completion
SELECT
CASE
WHEN MAX(END_TIME) >= ADD_HOURS(CURRENT_TIMESTAMP, -24)
THEN 'OK'
ELSE 'ALERT'
END as BACKUP_STATUS,
MAX(END_TIME) as LAST_BACKUP,
HOURS_BETWEEN(MAX(END_TIME), CURRENT_TIMESTAMP) as HOURS_AGO
FROM M_BACKUP_CATALOG
WHERE ENTRY_TYPE_NAME = 'complete data backup';

-- Alert if backup failed
SELECT
BACKUP_ID,
STATE_NAME,
MESSAGE
FROM M_BACKUP_CATALOG
WHERE STATE_NAME = 'failed'
AND START_TIME >= ADD_DAYS(CURRENT_DATE, -7);

Replication Lag Monitoring

-- Monitor DR lag
SELECT
SECONDARY_HOST,
SECONDARY_PORT,
REPLICATION_MODE,
REPLICATION_STATUS,
SECONDS_BETWEEN(LAST_LOG_POSITION_TIME, CURRENT_TIMESTAMP) as LAG_SEC
FROM M_SERVICE_REPLICATION
WHERE REPLICATION_STATUS <> 'ACTIVE'
OR SECONDS_BETWEEN(LAST_LOG_POSITION_TIME, CURRENT_TIMESTAMP) > 300;

-- Alert if lag > 5 minutes

9. Recovery Time Objective (RTO) / Recovery Point Objective (RPO)

RTO/RPO Matrix

ScenarioRPORTOCostComplexity
Basic Backup24 hours8 hoursLowLow
Daily Backup + Logs15 minutes4 hoursMediumMedium
System Replication (Async)5 minutes2 hoursMedium-HighMedium
System Replication (Sync)0 minutes1 hourHighHigh
Active-Active0 minutes0 minutesVery HighVery High

Achieving Target RTO/RPO

Target: RTO = 2 hours, RPO = 5 minutes

Implementation:
├── HANA System Replication: Async (RPO = 5 min)
├── Automated Failover Script (RTO reduction)
├── Log Backup: Every 15 minutes (backup safety)
├── DR Testing: Quarterly (validate RTO)
└── 24/7 Monitoring (early detection)

Cost Analysis:
├── DR Hardware: $50K/year
├── Network (dedicated line): $12K/year
├── Storage: $8K/year
├── Personnel (on-call): $30K/year
└── Total: $100K/year

Vs. Downtime Cost:
├── Revenue loss: $50K/hour
├── 2-hour downtime = $100K
└── ROI: Breakeven at 1 incident/year

10. Best Practices

Backup Checklist

  • Full backup weekly (Sunday)
  • Incremental backup daily
  • Log backup every 15 minutes
  • SLT config backup daily
  • Test restores monthly
  • Offsite backup replication
  • Backup retention: 30 days online, 1 year archive
  • Encryption for backup data
  • Automated backup verification

DR Checklist

  • DR site provisioned and tested
  • System replication configured
  • Automated failover scripts ready
  • DNS failover configured
  • Application connection strings updated
  • DR testing quarterly
  • Runbook documented and current
  • Team trained on procedures
  • RTO/RPO metrics monitored
  • 24/7 monitoring and alerting

Summary

✅ HANA backup strategies (full, incremental, log) ✅ SLT configuration backup procedures ✅ Disaster recovery architecture ✅ HANA system replication setup ✅ Automated failover procedures ✅ Planned failback processes ✅ DR testing methodology ✅ Monitoring and alerting ✅ RTO/RPO optimization ✅ Best practices checklist

Next: Module 17 - Migration Strategies