Skip to main content

Python for SAP Data Science: HANA Connectivity with HDBCLI and PyHDB

Python has become the de facto language for data science, while SAP HANA holds massive amounts of enterprise data. Connecting the two unlocks powerful possibilities for predictive analytics, machine learning, and advanced reporting on SAP data.

This guide covers everything you need to know about connecting Python to SAP HANA using hdbcli (official) and pyhdb (community) libraries, with practical data science examples.

Why Python + SAP HANA?

The Power Combination

Python StrengthSAP HANA StrengthCombined Benefit
scikit-learn ML algorithmsIn-memory computationFast feature engineering
pandas DataFramesMassive datasets (TB scale)Efficient data manipulation
matplotlib/seaborn vizReal-time business dataInteractive dashboards
Deep learning (TensorFlow)Integrated PAL/APLProduction ML models
Jupyter notebooksSAP Calculation ViewsExploratory analysis

Library Comparison: hdbcli vs pyhdb

Featurehdbcli (Official)pyhdb (Community)
MaintenanceSAP-supportedCommunity (less active)
Installationpip install hdbclipip install pyhdb
PerformanceOptimized (C extensions)Pure Python (slower)
FeaturesFull HANA feature supportBasic SQL only
Python Versions3.7 - 3.122.7 - 3.9
Connection PoolingYesNo
Stored ProceduresFull supportLimited
Recommendation✅ Use for production⚠️ Legacy systems only

Setting Up hdbcli (Recommended)

Installation

# Install hdbcli
pip install hdbcli

# Verify installation
python -c "import hdbcli; print(hdbcli.__version__)"

# Install data science stack
pip install pandas numpy matplotlib seaborn scikit-learn

# Optional: Jupyter for notebooks
pip install jupyter ipython

Basic Connection

from hdbcli import dbapi

# Connect to HANA
connection = dbapi.connect(
    address='hana-server.company.com',  # HANA host
    port=30015,                         # SQL port (3XX15 for tenant DB)
    user='DATAscientist',              # Database user
    password='SecurePassword123',       # Password
    encrypt=True,                       # Use SSL/TLS
    sslValidateCertificate=False        # For self-signed certs (dev only)
)

# Create cursor for executing queries
cursor = connection.cursor()

# Execute simple query
cursor.execute("SELECT COUNT(*) FROM SALES_DATA")
result = cursor.fetchone()
print(f"Total records: {result[0]}")

# Clean up
cursor.close()
connection.close()

Connection with Context Manager

from hdbcli import dbapi
from contextlib import closing

# Best practice: Use context manager for automatic cleanup
def query_hana(sql):
    with closing(dbapi.connect(
        address='hana-server.company.com',
        port=30015,
        user='DATAUSER',
        password='Password123'
    )) as conn:
        with closing(conn.cursor()) as cursor:
            cursor.execute(sql)
            return cursor.fetchall()

# Usage
results = query_hana("SELECT * FROM CUSTOMERS LIMIT 10")
for row in results:
    print(row)

Working with Pandas

Loading HANA Data into DataFrame

import pandas as pd
from hdbcli import dbapi

def hana_to_dataframe(sql, connection_params):
    """
    Execute SQL query and return pandas DataFrame
    """
    conn = dbapi.connect(**connection_params)
    
    try:
        # Option 1: Using pandas read_sql
        df = pd.read_sql(sql, conn)
        return df
    
    finally:
        conn.close()

# Connection parameters
conn_params = {
    'address': 'hana-server.company.com',
    'port': 30015,
    'user': 'DATAUSER',
    'password': 'Password123'
}

# Load sales data
sql = """
    SELECT 
        CUSTOMER_ID,
        PRODUCT_ID,
        ORDER_DATE,
        QUANTITY,
        REVENUE,
        PROFIT_MARGIN
    FROM SALES_FACT
    WHERE ORDER_DATE >= '2025-01-01'
"""

df_sales = hana_to_dataframe(sql, conn_params)

# Now it's a pandas DataFrame!
print(df_sales.head())
print(df_sales.describe())
print(df_sales.dtypes)

Writing DataFrame to HANA

import pandas as pd
from hdbcli import dbapi

def dataframe_to_hana(df, table_name, connection_params, if_exists='append'):
    """
    Write pandas DataFrame to HANA table
    
    Args:
        df: pandas DataFrame
        table_name: Target table name (SCHEMA.TABLE)
        connection_params: HANA connection dictionary
        if_exists: 'append', 'replace', or 'fail'
    """
    conn = dbapi.connect(**connection_params)
    
    try:
        # Option 1: Using pandas to_sql
        df.to_sql(
            name=table_name,
            con=conn,
            if_exists=if_exists,
            index=False,
            method='multi'  # Faster batch insert
        )
        print(f"Successfully wrote {len(df)} rows to {table_name}")
    
    except Exception as e:
        print(f"Error writing to HANA: {e}")
        raise
    
    finally:
        conn.close()

# Example: Write predictions back to HANA
df_predictions = pd.DataFrame({
    'CUSTOMER_ID': [1001, 1002, 1003],
    'PREDICTED_REVENUE': [45000, 67000, 32000],
    'PREDICTION_DATE': pd.Timestamp.now(),
    'MODEL_VERSION': '1.2.0'
})

dataframe_to_hana(
    df_predictions,
    'ML_SCHEMA.CUSTOMER_PREDICTIONS',
    conn_params,
    if_exists='append'
)

Real-World Data Science Use Cases

Use Case 1: Customer Churn Prediction

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score
from hdbcli import dbapi

# Step 1: Extract features from HANA
sql = """
    SELECT 
        c.CUSTOMER_ID,
        c.CUSTOMER_SINCE_DAYS,
        c.TOTAL_ORDERS,
        c.AVG_ORDER_VALUE,
        c.DAYS_SINCE_LAST_ORDER,
        c.SUPPORT_TICKETS,
        c.PAYMENT_DELAYS,
        c.CHURNED  -- Target variable (0 or 1)
    FROM CUSTOMER_360_VIEW c
    WHERE c.CUSTOMER_SINCE_DAYS >= 90  -- At least 3 months history
"""

conn = dbapi.connect(**conn_params)
df = pd.read_sql(sql, conn)
conn.close()

print(f"Loaded {len(df)} customers")
print(f"Churn rate: {df['CHURNED'].mean():.2%}")

# Step 2: Prepare data
X = df.drop(['CUSTOMER_ID', 'CHURNED'], axis=1)
y = df['CHURNED']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Step 3: Train model
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42
)
model.fit(X_train, y_train)

# Step 4: Evaluate
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

print("Classification Report:")
print(classification_report(y_test, y_pred))
print(f"ROC AUC Score: {roc_auc_score(y_test, y_pred_proba):.3f}")

# Step 5: Feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 5 Churn Predictors:")
print(feature_importance.head())

# Step 6: Score all active customers and save to HANA
sql_active = """
    SELECT 
        c.CUSTOMER_ID,
        c.CUSTOMER_SINCE_DAYS,
        c.TOTAL_ORDERS,
        c.AVG_ORDER_VALUE,
        c.DAYS_SINCE_LAST_ORDER,
        c.SUPPORT_TICKETS,
        c.PAYMENT_DELAYS
    FROM CUSTOMER_360_VIEW c
    WHERE c.CHURNED = 0  -- Active customers only
"""

conn = dbapi.connect(**conn_params)
df_active = pd.read_sql(sql_active, conn)

# Predict churn probability
df_active['CHURN_PROBABILITY'] = model.predict_proba(
    df_active.drop('CUSTOMER_ID', axis=1)
)[:, 1]

df_active['RISK_CATEGORY'] = pd.cut(
    df_active['CHURN_PROBABILITY'],
    bins=[0, 0.3, 0.7, 1.0],
    labels=['LOW', 'MEDIUM', 'HIGH']
)

df_active['PREDICTION_DATE'] = pd.Timestamp.now()

# Write back to HANA
df_active[['CUSTOMER_ID', 'CHURN_PROBABILITY', 'RISK_CATEGORY', 'PREDICTION_DATE']].to_sql(
    'ML_SCHEMA.CHURN_PREDICTIONS',
    conn,
    if_exists='replace',
    index=False
)

conn.close()
print(f"Saved predictions for {len(df_active)} customers to HANA")

Use Case 2: Sales Forecasting

import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, r2_score
import matplotlib.pyplot as plt

# Extract historical sales with date features
sql = """
    SELECT 
        ORDER_DATE,
        PRODUCT_ID,
        SUM(QUANTITY) as TOTAL_QUANTITY,
        SUM(REVENUE) as TOTAL_REVENUE
    FROM SALES_FACT
    WHERE ORDER_DATE >= ADD_YEARS(CURRENT_DATE, -2)
    GROUP BY ORDER_DATE, PRODUCT_ID
    ORDER BY ORDER_DATE
"""

conn = dbapi.connect(**conn_params)
df = pd.read_sql(sql, conn)
conn.close()

# Feature engineering
df['ORDER_DATE'] = pd.to_datetime(df['ORDER_DATE'])
df['YEAR'] = df['ORDER_DATE'].dt.year
df['MONTH'] = df['ORDER_DATE'].dt.month
df['DAY_OF_WEEK'] = df['ORDER_DATE'].dt.dayofweek
df['QUARTER'] = df['ORDER_DATE'].dt.quarter
df['WEEK_OF_YEAR'] = df['ORDER_DATE'].dt.isocalendar().week

# Add lag features (previous week's sales)
df = df.sort_values('ORDER_DATE')
df['REVENUE_LAG_7'] = df.groupby('PRODUCT_ID')['TOTAL_REVENUE'].shift(7)
df['REVENUE_LAG_14'] = df.groupby('PRODUCT_ID')['TOTAL_REVENUE'].shift(14)

# Rolling averages
df['REVENUE_MA_7'] = df.groupby('PRODUCT_ID')['TOTAL_REVENUE'].rolling(7).mean().reset_index(0, drop=True)
df['REVENUE_MA_30'] = df.groupby('PRODUCT_ID')['TOTAL_REVENUE'].rolling(30).mean().reset_index(0, drop=True)

# Drop NaN from lag features
df = df.dropna()

# Train/test split (temporal - last 30 days as test)
split_date = df['ORDER_DATE'].max() - pd.Timedelta(days=30)
train = df[df['ORDER_DATE'] <= split_date]
test = df[df['ORDER_DATE'] > split_date]

feature_cols = ['YEAR', 'MONTH', 'DAY_OF_WEEK', 'QUARTER', 'WEEK_OF_YEAR',
                'REVENUE_LAG_7', 'REVENUE_LAG_14', 'REVENUE_MA_7', 'REVENUE_MA_30']

X_train = train[feature_cols]
y_train = train['TOTAL_REVENUE']
X_test = test[feature_cols]
y_test = test['TOTAL_REVENUE']

# Train model
model = GradientBoostingRegressor(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=5,
    random_state=42
)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error: {mae:,.2f}")
print(f"R² Score: {r2:.3f}")

# Visualize
plt.figure(figsize=(12, 6))
plt.plot(test['ORDER_DATE'], y_test, label='Actual', marker='o')
plt.plot(test['ORDER_DATE'], y_pred, label='Predicted', marker='x')
plt.xlabel('Date')
plt.ylabel('Revenue')
plt.title('Sales Forecast: Actual vs Predicted')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('sales_forecast.png')
plt.show()

Use Case 3: Anomaly Detection in Financial Transactions

import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler

# Extract transaction data
sql = """
    SELECT 
        TRANSACTION_ID,
        CUSTOMER_ID,
        TRANSACTION_DATE,
        AMOUNT,
        TRANSACTION_TYPE,
        MERCHANT_CATEGORY,
        DISTANCE_FROM_HOME_KM,
        HOUR_OF_DAY,
        IS_WEEKEND
    FROM TRANSACTIONS
    WHERE TRANSACTION_DATE >= ADD_DAYS(CURRENT_DATE, -90)
"""

conn = dbapi.connect(**conn_params)
df = pd.read_sql(sql, conn)
conn.close()

# Feature engineering
df['HOUR_OF_DAY'] = pd.to_datetime(df['TRANSACTION_DATE']).dt.hour
df['IS_WEEKEND'] = pd.to_datetime(df['TRANSACTION_DATE']).dt.dayofweek.isin([5, 6]).astype(int)

# Calculate customer-specific features
customer_stats = df.groupby('CUSTOMER_ID').agg({
    'AMOUNT': ['mean', 'std', 'max'],
    'TRANSACTION_ID': 'count'
}).reset_index()
customer_stats.columns = ['CUSTOMER_ID', 'AVG_AMOUNT', 'STD_AMOUNT', 'MAX_AMOUNT', 'TRANSACTION_COUNT']

df = df.merge(customer_stats, on='CUSTOMER_ID')

# Calculate deviation from customer's normal behavior
df['AMOUNT_Z_SCORE'] = (df['AMOUNT'] - df['AVG_AMOUNT']) / (df['STD_AMOUNT'] + 1e-6)

# One-hot encode categorical features
df = pd.get_dummies(df, columns=['TRANSACTION_TYPE', 'MERCHANT_CATEGORY'])

# Select features for anomaly detection
feature_cols = [col for col in df.columns if col not in [
    'TRANSACTION_ID', 'CUSTOMER_ID', 'TRANSACTION_DATE', 
    'AVG_AMOUNT', 'STD_AMOUNT', 'MAX_AMOUNT'
]]

X = df[feature_cols]

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train Isolation Forest
iso_forest = IsolationForest(
    contamination=0.01,  # Expect 1% anomalies
    random_state=42,
    n_estimators=100
)
df['ANOMALY'] = iso_forest.fit_predict(X_scaled)
df['ANOMALY_SCORE'] = iso_forest.score_samples(X_scaled)

# Mark anomalies (Isolation Forest returns -1 for anomalies, 1 for normal)
df['IS_ANOMALY'] = (df['ANOMALY'] == -1).astype(int)

# Get top anomalies
anomalies = df[df['IS_ANOMALY'] == 1].sort_values('ANOMALY_SCORE')

print(f"Total transactions: {len(df)}")
print(f"Detected anomalies: {len(anomalies)} ({len(anomalies)/len(df)*100:.2f}%)")
print("\nTop 10 Most Anomalous Transactions:")
print(anomalies[['TRANSACTION_ID', 'CUSTOMER_ID', 'AMOUNT', 'DISTANCE_FROM_HOME_KM', 'ANOMALY_SCORE']].head(10))

# Write anomalies back to HANA for investigation
conn = dbapi.connect(**conn_params)
anomalies[['TRANSACTION_ID', 'ANOMALY_SCORE', 'TRANSACTION_DATE']].to_sql(
    'FRAUD_SCHEMA.ANOMALOUS_TRANSACTIONS',
    conn,
    if_exists='append',
    index=False
)
conn.close()

print("\nAnomalies written to HANA for review")

Advanced Patterns

Connection Pooling for Production

from hdbcli import dbapi
from contextlib import contextmanager
import threading

class HANAConnectionPool:
    """
    Simple connection pool for HANA
    """
    def __init__(self, connection_params, pool_size=5):
        self.connection_params = connection_params
        self.pool_size = pool_size
        self.pool = []
        self.lock = threading.Lock()
        
        # Initialize pool
        for _ in range(pool_size):
            conn = dbapi.connect(**connection_params)
            self.pool.append(conn)
    
    @contextmanager
    def get_connection(self):
        """
        Get connection from pool (context manager)
        """
        with self.lock:
            if not self.pool:
                raise Exception("Connection pool exhausted")
            conn = self.pool.pop()
        
        try:
            yield conn
        finally:
            with self.lock:
                self.pool.append(conn)
    
    def close_all(self):
        """
        Close all connections in pool
        """
        with self.lock:
            for conn in self.pool:
                conn.close()
            self.pool.clear()

# Usage
pool = HANAConnectionPool(conn_params, pool_size=10)

def query_data(sql):
    with pool.get_connection() as conn:
        df = pd.read_sql(sql, conn)
        return df

# Use in multiple threads
import concurrent.futures

queries = [
    "SELECT * FROM TABLE1",
    "SELECT * FROM TABLE2",
    "SELECT * FROM TABLE3"
]

with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
    results = list(executor.map(query_data, queries))

# Clean up
pool.close_all()

Calling HANA Stored Procedures

from hdbcli import dbapi

def call_hana_procedure(procedure_name, params, connection_params):
    """
    Call HANA stored procedure
    
    Args:
        procedure_name: Fully qualified procedure name (SCHEMA.PROCEDURE)
        params: Dictionary of input parameters
        connection_params: HANA connection parameters
    
    Returns:
        List of result sets
    """
    conn = dbapi.connect(**connection_params)
    cursor = conn.cursor()
    
    try:
        # Build CALL statement
        param_placeholders = ', '.join(['?' for _ in params.values()])
        call_sql = f"CALL {procedure_name}({param_placeholders})"
        
        # Execute procedure
        cursor.execute(call_sql, list(params.values()))
        
        # Fetch results (procedures can return multiple result sets)
        results = []
        while True:
            try:
                result = cursor.fetchall()
                if result:
                    results.append(result)
                if not cursor.nextset():
                    break
            except:
                break
        
        return results
    
    finally:
        cursor.close()
        conn.close()

# Example: Call predictive analysis procedure
results = call_hana_procedure(
    'ML_SCHEMA.PREDICT_CUSTOMER_VALUE',
    params={'CUSTOMER_ID': 1001, 'PREDICTION_MONTHS': 12},
    connection_params=conn_params
)

print(f"Procedure returned {len(results)} result sets")
for i, result_set in enumerate(results):
    print(f"Result set {i+1}: {len(result_set)} rows")

Batch Processing with Progress Tracking

import pandas as pd
from tqdm import tqdm  # Progress bar library

def process_customers_in_batches(batch_size=1000):
    """
    Process large customer dataset in batches with progress tracking
    """
    # Get total count
    conn = dbapi.connect(**conn_params)
    cursor = conn.cursor()
    cursor.execute("SELECT COUNT(*) FROM CUSTOMERS")
    total_customers = cursor.fetchone()[0]
    cursor.close()
    conn.close()
    
    print(f"Processing {total_customers} customers in batches of {batch_size}")
    
    # Process in batches
    offset = 0
    results = []
    
    with tqdm(total=total_customers) as pbar:
        while offset < total_customers:
            # Query batch
            sql = f"""
                SELECT * FROM CUSTOMERS
                ORDER BY CUSTOMER_ID
                LIMIT {batch_size} OFFSET {offset}
            """
            
            conn = dbapi.connect(**conn_params)
            df_batch = pd.read_sql(sql, conn)
            conn.close()
            
            # Process batch (e.g., ML prediction)
            df_batch['PREDICTED_VALUE'] = predict_customer_value(df_batch)
            
            results.append(df_batch)
            
            # Update progress
            offset += batch_size
            pbar.update(len(df_batch))
    
    # Combine all batches
    df_final = pd.concat(results, ignore_index=True)
    
    # Write back to HANA
    conn = dbapi.connect(**conn_params)
    df_final.to_sql('ML_SCHEMA.CUSTOMER_PREDICTIONS', conn, if_exists='replace', index=False)
    conn.close()
    
    print(f"Processed and saved {len(df_final)} customer predictions")

def predict_customer_value(df):
    """Dummy prediction function"""
    return df['TOTAL_ORDERS'] * df['AVG_ORDER_VALUE'] * 1.2

# Run batch processing
process_customers_in_batches(batch_size=5000)

Performance Best Practices

PracticeDescriptionImpact
Use SQL filteringFilter in HANA, not in PythonHigh - reduces data transfer
Select only needed columnsAvoid SELECT *Medium - reduces memory
Use connection poolingReuse connectionsHigh - reduces overhead
Batch writesUse method='multi' in to_sqlHigh - faster inserts
Leverage HANA viewsQuery Calculation ViewsHigh - push computation to HANA
Use pandas chunkingProcess large datasets in chunksCritical - prevents OOM

Security Considerations

Secure Credential Management

import os
from dotenv import load_dotenv

# Store credentials in .env file (never commit to Git!)
# .env file:
# HANA_HOST=hana-server.company.com
# HANA_PORT=30015
# HANA_USER=DATAUSER
# HANA_PASSWORD=SecurePassword123

# Load from environment
load_dotenv()

conn_params = {
    'address': os.getenv('HANA_HOST'),
    'port': int(os.getenv('HANA_PORT')),
    'user': os.getenv('HANA_USER'),
    'password': os.getenv('HANA_PASSWORD'),
    'encrypt': True,
    'sslValidateCertificate': True
}

# Alternative: Use keyring library
import keyring
password = keyring.get_password('hana', 'DATAUSER')

Troubleshooting

ErrorCauseSolution
Connection timeoutFirewall/networkCheck port 3XX15, VPN
Authentication failedWrong credentialsVerify user/password in HANA Studio
Memory errorLarge datasetUse chunking or batch processing
Slow queriesNo indexes/filtersAdd WHERE clauses, create indexes
SSL certificate errorSelf-signed certSet sslValidateCertificate=False (dev only)

Conclusion

Connecting Python to SAP HANA with hdbcli unlocks powerful data science capabilities on enterprise SAP data. Key takeaways:

  • ✅ Use hdbcli (official, supported, performant)
  • ✅ Leverage pandas for data manipulation
  • ✅ Push computation to HANA (filters, aggregations)
  • ✅ Use connection pooling for production workloads
  • ✅ Implement batch processing for large datasets
  • ✅ Secure credentials with environment variables

Python + SAP HANA = Enterprise data meets world-class data science tools.

About the Author: Yogesh Pandey is a passionate developer and consultant specializing in SAP technologies and full-stack development.