Module 8: Error Handling and Recovery

Master error identification, resolution strategies, and recovery procedures to maintain reliable SLT replication.

1. Error Types and Categories

Error Classification

graph TD
    A[SLT Errors] --> B[Source Errors]
    A --> C[Network Errors]
    A --> D[Target Errors]
    A --> E[Application Errors]
    
    B --> B1[Connection failure]
    B --> B2[Authorization]
    B --> B3[Table locked]
    
    C --> C1[Timeout]
    C --> C2[Connection lost]
    C --> C3[Bandwidth]
    
    D --> D1[Write failure]
    D --> D2[Constraint violation]
    D --> D3[Disk full]
    
    E --> E1[Transformation error]
    E --> E2[Data type mismatch]
    E --> E3[Job failure]

2. Common Errors and Solutions

Error 1: RFC_COMMUNICATION_FAILURE

Error Message: RFC connection to source failed
Severity: CRITICAL
Impact: Replication stopped

Root Causes:
1. Network connectivity issue
2. Source system down
3. RFC user locked/password changed
4. Gateway service not running

Resolution Steps:
1. Check network: ping source-system
2. Test RFC: Transaction SM59 → Test Connection
3. Verify user: SU01 in source
4. Restart gateway: sm -g restart
5. Check firewall rules

Prevention:
- Heartbeat monitoring (every 60s)
- Redundant network paths
- RFC user with no expiration

Error 2: Table Lock Timeout

Error: Cannot acquire lock on target table
Severity: WARNING
Impact: Temporary delay

Resolution:
1. Identify blocking query:
   SELECT * FROM M_BLOCKED_TRANSACTIONS;
   
2. Kill blocking session (if safe):
   ALTER SYSTEM CANCEL SESSION '<session_id>';
   
3. Automatic retry (configured):
   Retry after 30 seconds (up to 3 attempts)

Prevention:
- Schedule long queries during low-replication periods
- Use READ UNCOMMITTED where appropriate
- Increase lock timeout: ALTER TABLE ... SET LOCK WAIT TIMEOUT 300

Error 3: Foreign Key Violation

Error: Referential integrity constraint violated
Example: Insert VBAP without parent VBAK

Resolution:
1. Check load sequence:
   Tables must load in dependency order
   VBAK (header) before VBAP (items)

2. Temporary FK disable (initial load):
   ALTER TABLE SLTREPL.VBAP DISABLE FOREIGN KEY CONSTRAINT FK_VBAP_VBAK;
   [Load data]
   ALTER TABLE SLTREPL.VBAP ENABLE FOREIGN KEY CONSTRAINT FK_VBAP_VBAK;

3. Deferred constraint checking:
   SET CONSTRAINT_MODE = DEFERRED;

3. Error Queue Management

Viewing Error Queue

Transaction: LTRC → Errors Tab

┌──────────────────────────────────────────────────┐
│ Error Queue: ECC_HANA_01                        │
├──────────────────────────────────────────────────┤
│ Total Errors: 45                                 │
│ Last Hour: 12  |  Last 24h: 45                  │
│                                                   │
│ Table   │ Operation │ Error Type      │ Count   │
│ ────────┼───────────┼─────────────────┼─────────│
│ VBAP    │ INSERT    │ FK Violation    │ 23      │
│ MARA    │ UPDATE    │ Lock Timeout    │ 15      │
│ KNA1    │ INSERT    │ Duplicate Key   │ 5       │
│ BSEG    │ INSERT    │ Data Too Long   │ 2       │
│                                                   │
│ [Retry All] [Retry Selected] [Skip] [Download]  │
└──────────────────────────────────────────────────┘

Error Details

-- Query error details
SELECT 
  ERROR_ID,
  TABLE_NAME,
  OPERATION,
  ERROR_MESSAGE,
  ERROR_TIMESTAMP,
  RETRY_COUNT,
  SOURCE_DATA
FROM /DMIS/ERROR_QUEUE
WHERE MT_ID = 'ECC_HANA_01'
AND ERROR_TIMESTAMP >= ADD_DAYS(CURRENT_DATE, -1)
ORDER BY ERROR_TIMESTAMP DESC;

4. Automatic Recovery

Retry Configuration

LTRC → Advanced Settings → Error Handling

Retry Strategy:
┌──────────────────────────────────────┐
│ Automatic Retry Settings            │
├──────────────────────────────────────┤
│ ☑ Enable Auto Retry                │
│                                      │
│ Max Retry Attempts: [3]             │
│ Initial Retry Delay: [30] seconds   │
│ Backoff Strategy: ● Exponential     │
│                    ○ Linear          │
│                    ○ Fixed           │
│                                      │
│ Retry Schedule:                      │
│ Attempt 1: After 30 seconds         │
│ Attempt 2: After 60 seconds         │
│ Attempt 3: After 120 seconds        │
│                                      │
│ After Max Retries:                  │
│ ● Move to error queue               │
│ ○ Skip record                       │
│ ○ Stop replication                  │
└──────────────────────────────────────┘

Retry Workflow

sequenceDiagram
    participant Job as Replication Job
    participant Target as Target System
    participant Queue as Error Queue
    
    Job->>Target: Attempt to write
    Target-->>Job: ❌ Error
    Job->>Job: Wait 30s (Attempt 1)
    Job->>Target: Retry write
    Target-->>Job: ❌ Error
    Job->>Job: Wait 60s (Attempt 2)
    Job->>Target: Retry write
    Target-->>Job: ❌ Error
    Job->>Job: Wait 120s (Attempt 3)
    Job->>Target: Retry write
    Target-->>Job: ❌ Error
    Job->>Queue: Move to error queue
    Job->>Job: Alert administrator

5. Manual Error Resolution

Step-by-Step Resolution

Step 1: Identify Error Pattern

-- Group errors by type
SELECT 
  ERROR_TYPE,
  COUNT(*) as ERROR_COUNT,
  MAX(ERROR_MESSAGE) as SAMPLE_MESSAGE
FROM /DMIS/ERROR_QUEUE
GROUP BY ERROR_TYPE
ORDER BY ERROR_COUNT DESC;

-- Result:
ERROR_TYPE           COUNT  SAMPLE_MESSAGE
FK_VIOLATION         23     Parent record not found
LOCK_TIMEOUT         15     Table locked by session 12345
DUPLICATE_KEY        5      Unique constraint violated

Step 2: Fix Root Cause

FK Violations → Adjust load sequence or disable FK temporarily
Lock Timeouts → Identify and terminate blocking queries
Duplicate Keys → Check for duplicate records in source
Data Issues → Clean source data or add transformation

Step 3: Retry Failed Records

LTRC → Errors Tab
Select errors to retry
Click [Retry Selected]
Monitor success rate
Repeat if needed

Step 4: Skip Unrecoverable Errors

For errors that cannot be resolved:
Document the issue
Export error details
Click [Skip]
Monitor for recurrence

6. Recovery Procedures

Scenario 1: Complete Replication Failure

Symptoms:

All replication jobs stopped
MT_ID status: ● Red (Error)
Thousands of pending records

Recovery:

Step 1: Stop all replication
  LTRC → [Stop Replication]

Step 2: Identify root cause
  - Check logs: SM21, SM37
  - Check connectivity: SM59
  - Check target: DBACOCKPIT

Step 3: Fix issue
  - Restart target database if needed
  - Fix network connectivity
  - Reset RFC connections

Step 4: Clear error queue
  LTRC → Errors → [Clear Resolved]

Step 5: Resume replication
  LTRC → [Resume Replication]

Step 6: Monitor catch-up
  Watch logging table size decrease
  Verify no new errors

Scenario 2: Logging Table Overflow

Symptoms:

Logging tables > 50 GB
Slow replication
Disk space warnings

Recovery:

-- Emergency cleanup
-- 1. Stop replication temporarily
LTRC → [Pause]

-- 2. Archive error records
CREATE TABLE /DMIS/ERROR_ARCHIVE AS
SELECT * FROM /DMIS/ERROR_QUEUE;

-- 3. Delete processed records
DELETE FROM /DMIS/LOG_* WHERE PROCESSED = 'X';

-- 4. Delete old errors (> 7 days)
DELETE FROM /DMIS/ERROR_QUEUE 
WHERE TIMESTAMP < ADD_DAYS(CURRENT_DATE, -7);

-- 5. Reorganize tables
MERGE DELTA OF /DMIS/LOG_*;

-- 6. Resume replication
LTRC → [Resume]

Scenario 3: Data Corruption Detection

Detection:

-- Row count mismatch
Source: 1,000,000 rows
Target: 999,850 rows
Missing: 150 rows

-- Checksum validation
Source checksum: ABC123...
Target checksum: ABC456...
Result: MISMATCH ❌

Recovery:

Step 1: Stop delta replication
  Prevent further changes

Step 2: Identify missing/corrupt records
  Compare source vs target
  Use checksums or row counts

Step 3: Re-initialize table
  Option A: Full reload
    - Remove table from MT_ID
    - Re-add with initial load
  
  Option B: Partial reload
    - Load only missing date ranges
    - WHERE ERDAT >= '20260115'

Step 4: Validate consistency
  - Verify row counts match
  - Sample data comparison
  - Checksum validation

Step 5: Resume delta
  Restart triggers and replication

7. Preventive Measures

Proactive Monitoring

# Automated health check script
import hdbcli
from datetime import datetime, timedelta

def check_slt_health():
    conn = hdbcli.connect(host='hana', port=30015, user='SLTREPL')
    
    # Check 1: Logging table size
    size = conn.execute("SELECT SUM(DISK_SIZE) FROM M_TABLES WHERE TABLE_NAME LIKE '/DMIS/LOG%'").fetchone()[0]
    if size > 10 * 1024**3:  # 10 GB
        send_alert('Logging tables > 10 GB')
    
    # Check 2: Error rate
    errors = conn.execute("SELECT COUNT(*) FROM /DMIS/ERROR_QUEUE WHERE TIMESTAMP >= ?", 
                         (datetime.now() - timedelta(hours=1),)).fetchone()[0]
    if errors > 100:
        send_alert(f'High error rate: {errors} errors/hour')
    
    # Check 3: Latency
    latency = conn.execute("""
        SELECT MAX(TIMESTAMPDIFF(SECOND, TIMESTAMP, CURRENT_TIMESTAMP))
        FROM /DMIS/LOG_* WHERE PROCESSED = ''
    """).fetchone()[0]
    if latency > 300:  # 5 minutes
        send_alert(f'High latency: {latency} seconds')

Best Practices

✅ Schedule daily health checks ✅ Set up real-time alerts for critical errors ✅ Maintain error rate < 0.1% ✅ Keep logging tables < 10 GB ✅ Test recovery procedures monthly ✅ Document all error resolutions ✅ Maintain error resolution runbook

8. Error Documentation Template

## Error Report Template

**Error ID:** ERR-2026-01-21-001
**Date/Time:** 2026-01-21 14:30:00
**Severity:** High
**MT_ID:** ECC_HANA_01
**Table:** VBAP

**Error Message:**
Foreign key constraint violation - parent record not found

**Impact:**
- 156 sales items failed to replicate
- Order 1234567 incomplete in target

**Root Cause:**
Sales header (VBAK) replication delayed due to network issue,
items (VBAP) attempted to replicate before header

**Resolution:**
1. Identified network latency spike at 14:25
2. Waited for VBAK replication to catch up
3. Retried VBAP records
4. All 156 records successfully replicated

**Time to Resolve:** 15 minutes
**Preventive Actions:**
- Increase retry delay for FK errors to 60s
- Add dependency checking before item replication

Summary

✅ Error types and classification ✅ Common errors and resolutions ✅ Error queue management ✅ Automatic retry configuration ✅ Manual resolution procedures ✅ Recovery scenarios (failure, overflow, corruption) ✅ Preventive monitoring ✅ Error documentation

Next: Module 9 - Performance Tuning

1. Error Types and Categories​

Error Classification​

2. Common Errors and Solutions​

Error 1: RFC_COMMUNICATION_FAILURE​

Error 2: Table Lock Timeout​

Error 3: Foreign Key Violation​

3. Error Queue Management​

Viewing Error Queue​

Error Details​

4. Automatic Recovery​

Retry Configuration​

Retry Workflow​

5. Manual Error Resolution​

Step-by-Step Resolution​

6. Recovery Procedures​

Scenario 1: Complete Replication Failure​

Scenario 2: Logging Table Overflow​

Scenario 3: Data Corruption Detection​

7. Preventive Measures​

Proactive Monitoring​

Best Practices​

8. Error Documentation Template​

Summary​

1. Error Types and Categories

Error Classification

2. Common Errors and Solutions

Error 1: RFC_COMMUNICATION_FAILURE

Error 2: Table Lock Timeout

Error 3: Foreign Key Violation

3. Error Queue Management

Viewing Error Queue

Error Details

4. Automatic Recovery

Retry Configuration

Retry Workflow

5. Manual Error Resolution

Step-by-Step Resolution

6. Recovery Procedures

Scenario 1: Complete Replication Failure

Scenario 2: Logging Table Overflow

Scenario 3: Data Corruption Detection

7. Preventive Measures

Proactive Monitoring

Best Practices

8. Error Documentation Template

Summary