Skip to main content

Module 8: Error Handling and Recovery

Master error identification, resolution strategies, and recovery procedures to maintain reliable SLT replication.

1. Error Types and Categories

Error Classification

graph TD
A[SLT Errors] --> B[Source Errors]
A --> C[Network Errors]
A --> D[Target Errors]
A --> E[Application Errors]

B --> B1[Connection failure]
B --> B2[Authorization]
B --> B3[Table locked]

C --> C1[Timeout]
C --> C2[Connection lost]
C --> C3[Bandwidth]

D --> D1[Write failure]
D --> D2[Constraint violation]
D --> D3[Disk full]

E --> E1[Transformation error]
E --> E2[Data type mismatch]
E --> E3[Job failure]

2. Common Errors and Solutions

Error 1: RFC_COMMUNICATION_FAILURE

Error Message: RFC connection to source failed
Severity: CRITICAL
Impact: Replication stopped

Root Causes:
1. Network connectivity issue
2. Source system down
3. RFC user locked/password changed
4. Gateway service not running

Resolution Steps:
1. Check network: ping source-system
2. Test RFC: Transaction SM59 → Test Connection
3. Verify user: SU01 in source
4. Restart gateway: sm -g restart
5. Check firewall rules

Prevention:
- Heartbeat monitoring (every 60s)
- Redundant network paths
- RFC user with no expiration

Error 2: Table Lock Timeout

Error: Cannot acquire lock on target table
Severity: WARNING
Impact: Temporary delay

Resolution:
1. Identify blocking query:
SELECT * FROM M_BLOCKED_TRANSACTIONS;

2. Kill blocking session (if safe):
ALTER SYSTEM CANCEL SESSION '<session_id>';

3. Automatic retry (configured):
Retry after 30 seconds (up to 3 attempts)

Prevention:
- Schedule long queries during low-replication periods
- Use READ UNCOMMITTED where appropriate
- Increase lock timeout: ALTER TABLE ... SET LOCK WAIT TIMEOUT 300

Error 3: Foreign Key Violation

Error: Referential integrity constraint violated
Example: Insert VBAP without parent VBAK

Resolution:
1. Check load sequence:
Tables must load in dependency order
VBAK (header) before VBAP (items)

2. Temporary FK disable (initial load):
ALTER TABLE SLTREPL.VBAP DISABLE FOREIGN KEY CONSTRAINT FK_VBAP_VBAK;
[Load data]
ALTER TABLE SLTREPL.VBAP ENABLE FOREIGN KEY CONSTRAINT FK_VBAP_VBAK;

3. Deferred constraint checking:
SET CONSTRAINT_MODE = DEFERRED;

3. Error Queue Management

Viewing Error Queue

Transaction: LTRC → Errors Tab

┌──────────────────────────────────────────────────┐
│ Error Queue: ECC_HANA_01 │
├──────────────────────────────────────────────────┤
│ Total Errors: 45 │
│ Last Hour: 12 | Last 24h: 45 │
│ │
│ Table │ Operation │ Error Type │ Count │
│ ────────┼───────────┼─────────────────┼─────────│
│ VBAP │ INSERT │ FK Violation │ 23 │
│ MARA │ UPDATE │ Lock Timeout │ 15 │
│ KNA1 │ INSERT │ Duplicate Key │ 5 │
│ BSEG │ INSERT │ Data Too Long │ 2 │
│ │
│ [Retry All] [Retry Selected] [Skip] [Download] │
└──────────────────────────────────────────────────┘

Error Details

-- Query error details
SELECT
ERROR_ID,
TABLE_NAME,
OPERATION,
ERROR_MESSAGE,
ERROR_TIMESTAMP,
RETRY_COUNT,
SOURCE_DATA
FROM /DMIS/ERROR_QUEUE
WHERE MT_ID = 'ECC_HANA_01'
AND ERROR_TIMESTAMP >= ADD_DAYS(CURRENT_DATE, -1)
ORDER BY ERROR_TIMESTAMP DESC;

4. Automatic Recovery

Retry Configuration

LTRC → Advanced Settings → Error Handling

Retry Strategy:
┌──────────────────────────────────────┐
│ Automatic Retry Settings │
├──────────────────────────────────────┤
│ ☑ Enable Auto Retry │
│ │
│ Max Retry Attempts: [3] │
│ Initial Retry Delay: [30] seconds │
│ Backoff Strategy: ● Exponential │
│ ○ Linear │
│ ○ Fixed │
│ │
│ Retry Schedule: │
│ Attempt 1: After 30 seconds │
│ Attempt 2: After 60 seconds │
│ Attempt 3: After 120 seconds │
│ │
│ After Max Retries: │
│ ● Move to error queue │
│ ○ Skip record │
│ ○ Stop replication │
└──────────────────────────────────────┘

Retry Workflow

sequenceDiagram
participant Job as Replication Job
participant Target as Target System
participant Queue as Error Queue

Job->>Target: Attempt to write
Target-->>Job: ❌ Error
Job->>Job: Wait 30s (Attempt 1)
Job->>Target: Retry write
Target-->>Job: ❌ Error
Job->>Job: Wait 60s (Attempt 2)
Job->>Target: Retry write
Target-->>Job: ❌ Error
Job->>Job: Wait 120s (Attempt 3)
Job->>Target: Retry write
Target-->>Job: ❌ Error
Job->>Queue: Move to error queue
Job->>Job: Alert administrator

5. Manual Error Resolution

Step-by-Step Resolution

Step 1: Identify Error Pattern

-- Group errors by type
SELECT
ERROR_TYPE,
COUNT(*) as ERROR_COUNT,
MAX(ERROR_MESSAGE) as SAMPLE_MESSAGE
FROM /DMIS/ERROR_QUEUE
GROUP BY ERROR_TYPE
ORDER BY ERROR_COUNT DESC;

-- Result:
ERROR_TYPE COUNT SAMPLE_MESSAGE
FK_VIOLATION 23 Parent record not found
LOCK_TIMEOUT 15 Table locked by session 12345
DUPLICATE_KEY 5 Unique constraint violated

Step 2: Fix Root Cause

FK Violations → Adjust load sequence or disable FK temporarily
Lock Timeouts → Identify and terminate blocking queries
Duplicate Keys → Check for duplicate records in source
Data Issues → Clean source data or add transformation

Step 3: Retry Failed Records

LTRC → Errors Tab
1. Select errors to retry
2. Click [Retry Selected]
3. Monitor success rate
4. Repeat if needed

Step 4: Skip Unrecoverable Errors

For errors that cannot be resolved:
1. Document the issue
2. Export error details
3. Click [Skip]
4. Monitor for recurrence

6. Recovery Procedures

Scenario 1: Complete Replication Failure

Symptoms:

  • All replication jobs stopped
  • MT_ID status: ● Red (Error)
  • Thousands of pending records

Recovery:

Step 1: Stop all replication
LTRC → [Stop Replication]

Step 2: Identify root cause
- Check logs: SM21, SM37
- Check connectivity: SM59
- Check target: DBACOCKPIT

Step 3: Fix issue
- Restart target database if needed
- Fix network connectivity
- Reset RFC connections

Step 4: Clear error queue
LTRC → Errors → [Clear Resolved]

Step 5: Resume replication
LTRC → [Resume Replication]

Step 6: Monitor catch-up
Watch logging table size decrease
Verify no new errors

Scenario 2: Logging Table Overflow

Symptoms:

  • Logging tables > 50 GB
  • Slow replication
  • Disk space warnings

Recovery:

-- Emergency cleanup
-- 1. Stop replication temporarily
LTRC → [Pause]

-- 2. Archive error records
CREATE TABLE /DMIS/ERROR_ARCHIVE AS
SELECT * FROM /DMIS/ERROR_QUEUE;

-- 3. Delete processed records
DELETE FROM /DMIS/LOG_* WHERE PROCESSED = 'X';

-- 4. Delete old errors (> 7 days)
DELETE FROM /DMIS/ERROR_QUEUE
WHERE TIMESTAMP < ADD_DAYS(CURRENT_DATE, -7);

-- 5. Reorganize tables
MERGE DELTA OF /DMIS/LOG_*;

-- 6. Resume replication
LTRC → [Resume]

Scenario 3: Data Corruption Detection

Detection:

-- Row count mismatch
Source: 1,000,000 rows
Target: 999,850 rows
Missing: 150 rows

-- Checksum validation
Source checksum: ABC123...
Target checksum: ABC456...
Result: MISMATCH ❌

Recovery:

Step 1: Stop delta replication
Prevent further changes

Step 2: Identify missing/corrupt records
Compare source vs target
Use checksums or row counts

Step 3: Re-initialize table
Option A: Full reload
- Remove table from MT_ID
- Re-add with initial load

Option B: Partial reload
- Load only missing date ranges
- WHERE ERDAT >= '20260115'

Step 4: Validate consistency
- Verify row counts match
- Sample data comparison
- Checksum validation

Step 5: Resume delta
Restart triggers and replication

7. Preventive Measures

Proactive Monitoring

# Automated health check script
import hdbcli
from datetime import datetime, timedelta

def check_slt_health():
conn = hdbcli.connect(host='hana', port=30015, user='SLTREPL')

# Check 1: Logging table size
size = conn.execute("SELECT SUM(DISK_SIZE) FROM M_TABLES WHERE TABLE_NAME LIKE '/DMIS/LOG%'").fetchone()[0]
if size > 10 * 1024**3: # 10 GB
send_alert('Logging tables > 10 GB')

# Check 2: Error rate
errors = conn.execute("SELECT COUNT(*) FROM /DMIS/ERROR_QUEUE WHERE TIMESTAMP >= ?",
(datetime.now() - timedelta(hours=1),)).fetchone()[0]
if errors > 100:
send_alert(f'High error rate: {errors} errors/hour')

# Check 3: Latency
latency = conn.execute("""
SELECT MAX(TIMESTAMPDIFF(SECOND, TIMESTAMP, CURRENT_TIMESTAMP))
FROM /DMIS/LOG_* WHERE PROCESSED = ''
""").fetchone()[0]
if latency > 300: # 5 minutes
send_alert(f'High latency: {latency} seconds')

Best Practices

✅ Schedule daily health checks ✅ Set up real-time alerts for critical errors ✅ Maintain error rate < 0.1% ✅ Keep logging tables < 10 GB ✅ Test recovery procedures monthly ✅ Document all error resolutions ✅ Maintain error resolution runbook

8. Error Documentation Template

## Error Report Template

**Error ID:** ERR-2026-01-21-001
**Date/Time:** 2026-01-21 14:30:00
**Severity:** High
**MT_ID:** ECC_HANA_01
**Table:** VBAP

**Error Message:**
Foreign key constraint violation - parent record not found

**Impact:**
- 156 sales items failed to replicate
- Order 1234567 incomplete in target

**Root Cause:**
Sales header (VBAK) replication delayed due to network issue,
items (VBAP) attempted to replicate before header

**Resolution:**
1. Identified network latency spike at 14:25
2. Waited for VBAK replication to catch up
3. Retried VBAP records
4. All 156 records successfully replicated

**Time to Resolve:** 15 minutes
**Preventive Actions:**
- Increase retry delay for FK errors to 60s
- Add dependency checking before item replication

Summary

✅ Error types and classification ✅ Common errors and resolutions ✅ Error queue management ✅ Automatic retry configuration ✅ Manual resolution procedures ✅ Recovery scenarios (failure, overflow, corruption) ✅ Preventive monitoring ✅ Error documentation

Next: Module 9 - Performance Tuning