Python for SAP Data Science: HANA Connectivity with HDBCLI and PyHDB
Python has become the de facto language for data science, while SAP HANA holds massive amounts of enterprise data. Connecting the two unlocks powerful possibilities for predictive analytics, machine learning, and advanced reporting on SAP data.
This guide covers everything you need to know about connecting Python to SAP HANA using hdbcli (official) and pyhdb (community) libraries, with practical data science examples.
Why Python + SAP HANA?
The Power Combination
| Python Strength | SAP HANA Strength | Combined Benefit |
|---|---|---|
| scikit-learn ML algorithms | In-memory computation | Fast feature engineering |
| pandas DataFrames | Massive datasets (TB scale) | Efficient data manipulation |
| matplotlib/seaborn viz | Real-time business data | Interactive dashboards |
| Deep learning (TensorFlow) | Integrated PAL/APL | Production ML models |
| Jupyter notebooks | SAP Calculation Views | Exploratory analysis |
Library Comparison: hdbcli vs pyhdb
| Feature | hdbcli (Official) | pyhdb (Community) |
|---|---|---|
| Maintenance | SAP-supported | Community (less active) |
| Installation | pip install hdbcli | pip install pyhdb |
| Performance | Optimized (C extensions) | Pure Python (slower) |
| Features | Full HANA feature support | Basic SQL only |
| Python Versions | 3.7 - 3.12 | 2.7 - 3.9 |
| Connection Pooling | Yes | No |
| Stored Procedures | Full support | Limited |
| Recommendation | ✅ Use for production | ⚠️ Legacy systems only |
Setting Up hdbcli (Recommended)
Installation
# Install hdbcli pip install hdbcli # Verify installation python -c "import hdbcli; print(hdbcli.__version__)" # Install data science stack pip install pandas numpy matplotlib seaborn scikit-learn # Optional: Jupyter for notebooks pip install jupyter ipython
Basic Connection
from hdbcli import dbapi
# Connect to HANA
connection = dbapi.connect(
address='hana-server.company.com', # HANA host
port=30015, # SQL port (3XX15 for tenant DB)
user='DATAscientist', # Database user
password='SecurePassword123', # Password
encrypt=True, # Use SSL/TLS
sslValidateCertificate=False # For self-signed certs (dev only)
)
# Create cursor for executing queries
cursor = connection.cursor()
# Execute simple query
cursor.execute("SELECT COUNT(*) FROM SALES_DATA")
result = cursor.fetchone()
print(f"Total records: {result[0]}")
# Clean up
cursor.close()
connection.close()Connection with Context Manager
from hdbcli import dbapi
from contextlib import closing
# Best practice: Use context manager for automatic cleanup
def query_hana(sql):
with closing(dbapi.connect(
address='hana-server.company.com',
port=30015,
user='DATAUSER',
password='Password123'
)) as conn:
with closing(conn.cursor()) as cursor:
cursor.execute(sql)
return cursor.fetchall()
# Usage
results = query_hana("SELECT * FROM CUSTOMERS LIMIT 10")
for row in results:
print(row)Working with Pandas
Loading HANA Data into DataFrame
import pandas as pd
from hdbcli import dbapi
def hana_to_dataframe(sql, connection_params):
"""
Execute SQL query and return pandas DataFrame
"""
conn = dbapi.connect(**connection_params)
try:
# Option 1: Using pandas read_sql
df = pd.read_sql(sql, conn)
return df
finally:
conn.close()
# Connection parameters
conn_params = {
'address': 'hana-server.company.com',
'port': 30015,
'user': 'DATAUSER',
'password': 'Password123'
}
# Load sales data
sql = """
SELECT
CUSTOMER_ID,
PRODUCT_ID,
ORDER_DATE,
QUANTITY,
REVENUE,
PROFIT_MARGIN
FROM SALES_FACT
WHERE ORDER_DATE >= '2025-01-01'
"""
df_sales = hana_to_dataframe(sql, conn_params)
# Now it's a pandas DataFrame!
print(df_sales.head())
print(df_sales.describe())
print(df_sales.dtypes)Writing DataFrame to HANA
import pandas as pd
from hdbcli import dbapi
def dataframe_to_hana(df, table_name, connection_params, if_exists='append'):
"""
Write pandas DataFrame to HANA table
Args:
df: pandas DataFrame
table_name: Target table name (SCHEMA.TABLE)
connection_params: HANA connection dictionary
if_exists: 'append', 'replace', or 'fail'
"""
conn = dbapi.connect(**connection_params)
try:
# Option 1: Using pandas to_sql
df.to_sql(
name=table_name,
con=conn,
if_exists=if_exists,
index=False,
method='multi' # Faster batch insert
)
print(f"Successfully wrote {len(df)} rows to {table_name}")
except Exception as e:
print(f"Error writing to HANA: {e}")
raise
finally:
conn.close()
# Example: Write predictions back to HANA
df_predictions = pd.DataFrame({
'CUSTOMER_ID': [1001, 1002, 1003],
'PREDICTED_REVENUE': [45000, 67000, 32000],
'PREDICTION_DATE': pd.Timestamp.now(),
'MODEL_VERSION': '1.2.0'
})
dataframe_to_hana(
df_predictions,
'ML_SCHEMA.CUSTOMER_PREDICTIONS',
conn_params,
if_exists='append'
)Real-World Data Science Use Cases
Use Case 1: Customer Churn Prediction
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score
from hdbcli import dbapi
# Step 1: Extract features from HANA
sql = """
SELECT
c.CUSTOMER_ID,
c.CUSTOMER_SINCE_DAYS,
c.TOTAL_ORDERS,
c.AVG_ORDER_VALUE,
c.DAYS_SINCE_LAST_ORDER,
c.SUPPORT_TICKETS,
c.PAYMENT_DELAYS,
c.CHURNED -- Target variable (0 or 1)
FROM CUSTOMER_360_VIEW c
WHERE c.CUSTOMER_SINCE_DAYS >= 90 -- At least 3 months history
"""
conn = dbapi.connect(**conn_params)
df = pd.read_sql(sql, conn)
conn.close()
print(f"Loaded {len(df)} customers")
print(f"Churn rate: {df['CHURNED'].mean():.2%}")
# Step 2: Prepare data
X = df.drop(['CUSTOMER_ID', 'CHURNED'], axis=1)
y = df['CHURNED']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Step 3: Train model
model = RandomForestClassifier(
n_estimators=100,
max_depth=10,
random_state=42
)
model.fit(X_train, y_train)
# Step 4: Evaluate
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]
print("Classification Report:")
print(classification_report(y_test, y_pred))
print(f"ROC AUC Score: {roc_auc_score(y_test, y_pred_proba):.3f}")
# Step 5: Feature importance
feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
print("\nTop 5 Churn Predictors:")
print(feature_importance.head())
# Step 6: Score all active customers and save to HANA
sql_active = """
SELECT
c.CUSTOMER_ID,
c.CUSTOMER_SINCE_DAYS,
c.TOTAL_ORDERS,
c.AVG_ORDER_VALUE,
c.DAYS_SINCE_LAST_ORDER,
c.SUPPORT_TICKETS,
c.PAYMENT_DELAYS
FROM CUSTOMER_360_VIEW c
WHERE c.CHURNED = 0 -- Active customers only
"""
conn = dbapi.connect(**conn_params)
df_active = pd.read_sql(sql_active, conn)
# Predict churn probability
df_active['CHURN_PROBABILITY'] = model.predict_proba(
df_active.drop('CUSTOMER_ID', axis=1)
)[:, 1]
df_active['RISK_CATEGORY'] = pd.cut(
df_active['CHURN_PROBABILITY'],
bins=[0, 0.3, 0.7, 1.0],
labels=['LOW', 'MEDIUM', 'HIGH']
)
df_active['PREDICTION_DATE'] = pd.Timestamp.now()
# Write back to HANA
df_active[['CUSTOMER_ID', 'CHURN_PROBABILITY', 'RISK_CATEGORY', 'PREDICTION_DATE']].to_sql(
'ML_SCHEMA.CHURN_PREDICTIONS',
conn,
if_exists='replace',
index=False
)
conn.close()
print(f"Saved predictions for {len(df_active)} customers to HANA")Use Case 2: Sales Forecasting
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, r2_score
import matplotlib.pyplot as plt
# Extract historical sales with date features
sql = """
SELECT
ORDER_DATE,
PRODUCT_ID,
SUM(QUANTITY) as TOTAL_QUANTITY,
SUM(REVENUE) as TOTAL_REVENUE
FROM SALES_FACT
WHERE ORDER_DATE >= ADD_YEARS(CURRENT_DATE, -2)
GROUP BY ORDER_DATE, PRODUCT_ID
ORDER BY ORDER_DATE
"""
conn = dbapi.connect(**conn_params)
df = pd.read_sql(sql, conn)
conn.close()
# Feature engineering
df['ORDER_DATE'] = pd.to_datetime(df['ORDER_DATE'])
df['YEAR'] = df['ORDER_DATE'].dt.year
df['MONTH'] = df['ORDER_DATE'].dt.month
df['DAY_OF_WEEK'] = df['ORDER_DATE'].dt.dayofweek
df['QUARTER'] = df['ORDER_DATE'].dt.quarter
df['WEEK_OF_YEAR'] = df['ORDER_DATE'].dt.isocalendar().week
# Add lag features (previous week's sales)
df = df.sort_values('ORDER_DATE')
df['REVENUE_LAG_7'] = df.groupby('PRODUCT_ID')['TOTAL_REVENUE'].shift(7)
df['REVENUE_LAG_14'] = df.groupby('PRODUCT_ID')['TOTAL_REVENUE'].shift(14)
# Rolling averages
df['REVENUE_MA_7'] = df.groupby('PRODUCT_ID')['TOTAL_REVENUE'].rolling(7).mean().reset_index(0, drop=True)
df['REVENUE_MA_30'] = df.groupby('PRODUCT_ID')['TOTAL_REVENUE'].rolling(30).mean().reset_index(0, drop=True)
# Drop NaN from lag features
df = df.dropna()
# Train/test split (temporal - last 30 days as test)
split_date = df['ORDER_DATE'].max() - pd.Timedelta(days=30)
train = df[df['ORDER_DATE'] <= split_date]
test = df[df['ORDER_DATE'] > split_date]
feature_cols = ['YEAR', 'MONTH', 'DAY_OF_WEEK', 'QUARTER', 'WEEK_OF_YEAR',
'REVENUE_LAG_7', 'REVENUE_LAG_14', 'REVENUE_MA_7', 'REVENUE_MA_30']
X_train = train[feature_cols]
y_train = train['TOTAL_REVENUE']
X_test = test[feature_cols]
y_test = test['TOTAL_REVENUE']
# Train model
model = GradientBoostingRegressor(
n_estimators=200,
learning_rate=0.1,
max_depth=5,
random_state=42
)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Absolute Error: {mae:,.2f}")
print(f"R² Score: {r2:.3f}")
# Visualize
plt.figure(figsize=(12, 6))
plt.plot(test['ORDER_DATE'], y_test, label='Actual', marker='o')
plt.plot(test['ORDER_DATE'], y_pred, label='Predicted', marker='x')
plt.xlabel('Date')
plt.ylabel('Revenue')
plt.title('Sales Forecast: Actual vs Predicted')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('sales_forecast.png')
plt.show()Use Case 3: Anomaly Detection in Financial Transactions
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
# Extract transaction data
sql = """
SELECT
TRANSACTION_ID,
CUSTOMER_ID,
TRANSACTION_DATE,
AMOUNT,
TRANSACTION_TYPE,
MERCHANT_CATEGORY,
DISTANCE_FROM_HOME_KM,
HOUR_OF_DAY,
IS_WEEKEND
FROM TRANSACTIONS
WHERE TRANSACTION_DATE >= ADD_DAYS(CURRENT_DATE, -90)
"""
conn = dbapi.connect(**conn_params)
df = pd.read_sql(sql, conn)
conn.close()
# Feature engineering
df['HOUR_OF_DAY'] = pd.to_datetime(df['TRANSACTION_DATE']).dt.hour
df['IS_WEEKEND'] = pd.to_datetime(df['TRANSACTION_DATE']).dt.dayofweek.isin([5, 6]).astype(int)
# Calculate customer-specific features
customer_stats = df.groupby('CUSTOMER_ID').agg({
'AMOUNT': ['mean', 'std', 'max'],
'TRANSACTION_ID': 'count'
}).reset_index()
customer_stats.columns = ['CUSTOMER_ID', 'AVG_AMOUNT', 'STD_AMOUNT', 'MAX_AMOUNT', 'TRANSACTION_COUNT']
df = df.merge(customer_stats, on='CUSTOMER_ID')
# Calculate deviation from customer's normal behavior
df['AMOUNT_Z_SCORE'] = (df['AMOUNT'] - df['AVG_AMOUNT']) / (df['STD_AMOUNT'] + 1e-6)
# One-hot encode categorical features
df = pd.get_dummies(df, columns=['TRANSACTION_TYPE', 'MERCHANT_CATEGORY'])
# Select features for anomaly detection
feature_cols = [col for col in df.columns if col not in [
'TRANSACTION_ID', 'CUSTOMER_ID', 'TRANSACTION_DATE',
'AVG_AMOUNT', 'STD_AMOUNT', 'MAX_AMOUNT'
]]
X = df[feature_cols]
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train Isolation Forest
iso_forest = IsolationForest(
contamination=0.01, # Expect 1% anomalies
random_state=42,
n_estimators=100
)
df['ANOMALY'] = iso_forest.fit_predict(X_scaled)
df['ANOMALY_SCORE'] = iso_forest.score_samples(X_scaled)
# Mark anomalies (Isolation Forest returns -1 for anomalies, 1 for normal)
df['IS_ANOMALY'] = (df['ANOMALY'] == -1).astype(int)
# Get top anomalies
anomalies = df[df['IS_ANOMALY'] == 1].sort_values('ANOMALY_SCORE')
print(f"Total transactions: {len(df)}")
print(f"Detected anomalies: {len(anomalies)} ({len(anomalies)/len(df)*100:.2f}%)")
print("\nTop 10 Most Anomalous Transactions:")
print(anomalies[['TRANSACTION_ID', 'CUSTOMER_ID', 'AMOUNT', 'DISTANCE_FROM_HOME_KM', 'ANOMALY_SCORE']].head(10))
# Write anomalies back to HANA for investigation
conn = dbapi.connect(**conn_params)
anomalies[['TRANSACTION_ID', 'ANOMALY_SCORE', 'TRANSACTION_DATE']].to_sql(
'FRAUD_SCHEMA.ANOMALOUS_TRANSACTIONS',
conn,
if_exists='append',
index=False
)
conn.close()
print("\nAnomalies written to HANA for review")Advanced Patterns
Connection Pooling for Production
from hdbcli import dbapi
from contextlib import contextmanager
import threading
class HANAConnectionPool:
"""
Simple connection pool for HANA
"""
def __init__(self, connection_params, pool_size=5):
self.connection_params = connection_params
self.pool_size = pool_size
self.pool = []
self.lock = threading.Lock()
# Initialize pool
for _ in range(pool_size):
conn = dbapi.connect(**connection_params)
self.pool.append(conn)
@contextmanager
def get_connection(self):
"""
Get connection from pool (context manager)
"""
with self.lock:
if not self.pool:
raise Exception("Connection pool exhausted")
conn = self.pool.pop()
try:
yield conn
finally:
with self.lock:
self.pool.append(conn)
def close_all(self):
"""
Close all connections in pool
"""
with self.lock:
for conn in self.pool:
conn.close()
self.pool.clear()
# Usage
pool = HANAConnectionPool(conn_params, pool_size=10)
def query_data(sql):
with pool.get_connection() as conn:
df = pd.read_sql(sql, conn)
return df
# Use in multiple threads
import concurrent.futures
queries = [
"SELECT * FROM TABLE1",
"SELECT * FROM TABLE2",
"SELECT * FROM TABLE3"
]
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
results = list(executor.map(query_data, queries))
# Clean up
pool.close_all()Calling HANA Stored Procedures
from hdbcli import dbapi
def call_hana_procedure(procedure_name, params, connection_params):
"""
Call HANA stored procedure
Args:
procedure_name: Fully qualified procedure name (SCHEMA.PROCEDURE)
params: Dictionary of input parameters
connection_params: HANA connection parameters
Returns:
List of result sets
"""
conn = dbapi.connect(**connection_params)
cursor = conn.cursor()
try:
# Build CALL statement
param_placeholders = ', '.join(['?' for _ in params.values()])
call_sql = f"CALL {procedure_name}({param_placeholders})"
# Execute procedure
cursor.execute(call_sql, list(params.values()))
# Fetch results (procedures can return multiple result sets)
results = []
while True:
try:
result = cursor.fetchall()
if result:
results.append(result)
if not cursor.nextset():
break
except:
break
return results
finally:
cursor.close()
conn.close()
# Example: Call predictive analysis procedure
results = call_hana_procedure(
'ML_SCHEMA.PREDICT_CUSTOMER_VALUE',
params={'CUSTOMER_ID': 1001, 'PREDICTION_MONTHS': 12},
connection_params=conn_params
)
print(f"Procedure returned {len(results)} result sets")
for i, result_set in enumerate(results):
print(f"Result set {i+1}: {len(result_set)} rows")Batch Processing with Progress Tracking
import pandas as pd
from tqdm import tqdm # Progress bar library
def process_customers_in_batches(batch_size=1000):
"""
Process large customer dataset in batches with progress tracking
"""
# Get total count
conn = dbapi.connect(**conn_params)
cursor = conn.cursor()
cursor.execute("SELECT COUNT(*) FROM CUSTOMERS")
total_customers = cursor.fetchone()[0]
cursor.close()
conn.close()
print(f"Processing {total_customers} customers in batches of {batch_size}")
# Process in batches
offset = 0
results = []
with tqdm(total=total_customers) as pbar:
while offset < total_customers:
# Query batch
sql = f"""
SELECT * FROM CUSTOMERS
ORDER BY CUSTOMER_ID
LIMIT {batch_size} OFFSET {offset}
"""
conn = dbapi.connect(**conn_params)
df_batch = pd.read_sql(sql, conn)
conn.close()
# Process batch (e.g., ML prediction)
df_batch['PREDICTED_VALUE'] = predict_customer_value(df_batch)
results.append(df_batch)
# Update progress
offset += batch_size
pbar.update(len(df_batch))
# Combine all batches
df_final = pd.concat(results, ignore_index=True)
# Write back to HANA
conn = dbapi.connect(**conn_params)
df_final.to_sql('ML_SCHEMA.CUSTOMER_PREDICTIONS', conn, if_exists='replace', index=False)
conn.close()
print(f"Processed and saved {len(df_final)} customer predictions")
def predict_customer_value(df):
"""Dummy prediction function"""
return df['TOTAL_ORDERS'] * df['AVG_ORDER_VALUE'] * 1.2
# Run batch processing
process_customers_in_batches(batch_size=5000)Performance Best Practices
| Practice | Description | Impact |
|---|---|---|
| Use SQL filtering | Filter in HANA, not in Python | High - reduces data transfer |
| Select only needed columns | Avoid SELECT * | Medium - reduces memory |
| Use connection pooling | Reuse connections | High - reduces overhead |
| Batch writes | Use method='multi' in to_sql | High - faster inserts |
| Leverage HANA views | Query Calculation Views | High - push computation to HANA |
| Use pandas chunking | Process large datasets in chunks | Critical - prevents OOM |
Security Considerations
Secure Credential Management
import os
from dotenv import load_dotenv
# Store credentials in .env file (never commit to Git!)
# .env file:
# HANA_HOST=hana-server.company.com
# HANA_PORT=30015
# HANA_USER=DATAUSER
# HANA_PASSWORD=SecurePassword123
# Load from environment
load_dotenv()
conn_params = {
'address': os.getenv('HANA_HOST'),
'port': int(os.getenv('HANA_PORT')),
'user': os.getenv('HANA_USER'),
'password': os.getenv('HANA_PASSWORD'),
'encrypt': True,
'sslValidateCertificate': True
}
# Alternative: Use keyring library
import keyring
password = keyring.get_password('hana', 'DATAUSER')Troubleshooting
| Error | Cause | Solution |
|---|---|---|
| Connection timeout | Firewall/network | Check port 3XX15, VPN |
| Authentication failed | Wrong credentials | Verify user/password in HANA Studio |
| Memory error | Large dataset | Use chunking or batch processing |
| Slow queries | No indexes/filters | Add WHERE clauses, create indexes |
| SSL certificate error | Self-signed cert | Set sslValidateCertificate=False (dev only) |
Conclusion
Connecting Python to SAP HANA with hdbcli unlocks powerful data science capabilities on enterprise SAP data. Key takeaways:
- ✅ Use hdbcli (official, supported, performant)
- ✅ Leverage pandas for data manipulation
- ✅ Push computation to HANA (filters, aggregations)
- ✅ Use connection pooling for production workloads
- ✅ Implement batch processing for large datasets
- ✅ Secure credentials with environment variables
Python + SAP HANA = Enterprise data meets world-class data science tools.
