---
title: "Appendix B — Datasets Used in This Book"
---
# Appendix B — Datasets Used in This Book
This appendix provides complete documentation of every dataset used in "AI-Powered Business Analytics." Each dataset is mapped to the chapters where it appears, sourced, and described in detail. All synthetic datasets can be generated using provided R and Python scripts.
## Dataset Index and Sourcing Guide
### Part I: Foundations (Chapters 1-11)
#### 1. Nigerian Bank Loan Portfolio
| Property | Details |
|----------|---------|
| **Chapters** | 5 (Statistical Foundations), 6 (ML Basics), 15 (Causal Inference) |
| **Dataset Name** | `nigerian_bank_loans.csv` |
| **Rows** | 8,000 loan accounts |
| **Columns** | 28 variables |
| **Source** | Synthetic — generated from CBN credit statistics and typical bank risk models |
| **Description** | Loan portfolio from a hypothetical Nigerian commercial bank. Includes borrower demographics (age, income, education level, employment tenure), loan characteristics (amount, term, purpose, security type), repayment history (months on books, number of defaults, current status), and credit bureau scores. |
| **Key Variables** | `loan_id`, `borrower_age`, `annual_income_naira`, `loan_amount_naira`, `loan_term_months`, `interest_rate`, `employment_sector`, `credit_score`, `default_flag`, `months_delinquent`, `collateral_type`, `collateral_value_naira` |
| **Nigerian/African Relevance** | Reflects credit structures common to Nigerian retail banks; employment sectors include agriculture, oil & gas, manufacturing, services; income ranges typical of Nigerian formal employment. |
| **License** | CC0 1.0 (Public Domain) |
| **Generation Script** | See R and Python code at end of this appendix |
#### 2. Nigerian Household Expenditure Survey (NBS)
| Property | Details |
|----------|---------|
| **Chapters** | 3 (Business Metrics), 5 (Statistical Tests) |
| **Dataset Name** | `nbs_household_survey.csv` |
| **Rows** | 3,000 households |
| **Columns** | 35 variables |
| **Source** | Inspired by Nigeria's National Bureau of Statistics Household Expenditure Survey (anonymised subset) |
| **Description** | Survey data from 3,000 Nigerian households across 6 geopolitical zones. Captures household income sources (formal employment, informal business, agriculture, remittances), expenditure categories (food, housing, utilities, transport, health, education, entertainment), household demographics (composition, size, head education), housing characteristics (ownership, type, utilities access). |
| **Key Variables** | `household_id`, `zone`, `urban_rural`, `household_size`, `head_age`, `head_education`, `monthly_income_naira`, `food_spending`, `housing_spending`, `utilities_spending`, `education_spending`, `health_spending` |
| **Nigerian/African Relevance** | Based on true patterns from NBS surveys; includes rural-urban disparities, regional variations, informal economy activity. |
| **License** | CC BY 4.0 (with source attribution) |
| **Access** | Full dataset available at [nigerianstat.gov.ng](https://nigerianstat.gov.ng) — anonymised version used for teaching |
#### 3. Mobile Money Transactions (Airtime & Data)
| Property | Details |
|----------|---------|
| **Chapters** | 4 (Data Collection), 9 (Time Series), 22 (Network Analytics) |
| **Dataset Name** | `mobile_money_transactions.csv` |
| **Rows** | 50,000 transactions |
| **Columns** | 18 variables |
| **Source** | Synthetic — modelled on transaction patterns from Nigerian telcos (MTN, Airtel, Glo, 9mobile) |
| **Description** | 12 months of daily mobile money transactions from 5,000 unique customers across Nigeria. Each record represents a transaction (airtime purchase, data bundle, mobile banking transfer). Includes timestamps, customer ID, transaction type, amount, channel (USSD, app, agent, retailer), operator, region. |
| **Key Variables** | `transaction_id`, `customer_id`, `transaction_date`, `transaction_type`, `amount_naira`, `channel`, `operator`, `region`, `state`, `customer_tenure_months`, `transaction_hour`, `day_of_week` |
| **Nigerian/African Relevance** | USSD and agent-based channels reflect Nigeria's financial inclusion reality; operators and states are authentic; seasonal patterns match Nigerian holiday calendar. |
| **License** | CC0 1.0 |
| **Generation Script** | See R and Python code at end of this appendix |
#### 4. Nigerian Retail Store Data (Omnichannel)
| Property | Details |
|----------|---------|
| **Chapters** | 11 (Network & Location Analytics), 22 (Social Networks) |
| **Dataset Name** | `retail_store_network.csv`, `store_transactions.csv` |
| **Rows** | Store network: 50 stores; Transactions: 2,500 |
| **Columns** | Network: 8; Transactions: 12 |
| **Source** | Synthetic — based on typical multi-branch retail chains in Nigeria |
| **Description** | Network graph of 50 retail store locations across Nigeria with geographic coordinates, transaction volume, sales, operational metrics. Linked transaction dataset shows customer purchases across stores. |
| **Key Variables** | `store_id`, `store_name`, `latitude`, `longitude`, `state`, `region`, `monthly_sales_naira`, `monthly_customers`, `store_type`, `transaction_id`, `store_id`, `product_category`, `quantity`, `revenue_naira`, `transaction_date` |
| **Nigerian/African Relevance** | Store locations in major Nigerian cities (Lagos, Abuja, Kano, PH, Ibadan); product categories reflect FMCG retail. |
| **License** | CC0 1.0 |
### Part II: Advanced Methods (Chapters 12-25)
#### 5. E-commerce Click-stream (Jumia-like Marketplace)
| Property | Details |
|----------|---------|
| **Chapters** | 17 (Data Mining), 20 (Image Analytics), 38 (Web Analytics) |
| **Dataset Name** | `ecommerce_clickstream.csv` |
| **Rows** | 100,000 sessions |
| **Columns** | 22 variables |
| **Source** | Synthetic — inspired by African e-commerce platforms (Jumia, Konga, Jiji) |
| **Description** | Click-level session data from online shopping. Each row represents a user session with browsing behaviour (pages visited, time spent, search terms, filters applied), products viewed, cart additions, checkout status, purchase outcome, device/browser info. |
| **Key Variables** | `session_id`, `user_id`, `session_date`, `pages_visited`, `time_on_site_minutes`, `search_term`, `product_category`, `product_price_naira`, `added_to_cart`, `checkout_initiated`, `purchase_completed`, `device_type`, `browser`, `traffic_source` |
| **Nigerian/African Relevance** | Products sold via Nigerian e-commerce; payment methods include card, bank transfer, cash-on-delivery; traffic sources include local social media. |
| **License** | CC0 1.0 |
#### 6. Customer Survey (FMCG Sector)
| Property | Details |
|----------|---------|
| **Chapters** | 19 (Text Analytics), 35 (Customer Satisfaction), 36 (Brand Analytics) |
| **Dataset Name** | `customer_survey_responses.csv` |
| **Rows** | 5,000 respondents |
| **Columns** | 28 variables |
| **Source** | Synthetic — based on FMCG brand tracking studies in Nigeria |
| **Description** | Customer satisfaction survey with Likert-scale questions on brand perception, product quality, price, availability, customer service, likelihood to recommend (NPS), and open-ended feedback on product improvements and purchase drivers. |
| **Key Variables** | `respondent_id`, `brand`, `product_category`, `age_group`, `income_level`, `region`, `q1_quality_rating`, `q2_value_rating`, `q3_availability_rating`, `q4_service_rating`, `nps_score`, `open_feedback_text`, `purchase_frequency`, `spend_monthly_naira` |
| **Nigerian/African Relevance** | Brands and product categories typical of Nigerian FMCG; responses in English reflecting Nigerian market norms. |
| **License** | CC0 1.0 |
#### 7. Call Centre Data (Telecom)
| Property | Details |
|----------|---------|
| **Chapters** | 20 (Speech Analytics), 39 (Social Media & Sentiment), 46 (Employee Performance) |
| **Dataset Name** | `call_centre_logs.csv` |
| **Rows** | 10,000 calls |
| **Columns** | 24 variables |
| **Source** | Synthetic — modelled on Nigerian telecom customer service operations |
| **Description** | Call centre transaction log including call metadata (date, time, duration, agent, customer), issue category (billing, technical, complaints, sales), customer demographics, agent performance, call sentiment/tonality classification, resolution status, repeat call flag. |
| **Key Variables** | `call_id`, `call_date`, `call_duration_minutes`, `agent_id`, `customer_id`, `issue_category`, `issue_subcategory`, `customer_satisfaction_rating`, `call_sentiment`, `call_resolution_achieved`, `repeat_contact_within_7days`, `agent_tenure_months`, `call_handle_time_minutes`, `hold_time_minutes` |
| **Nigerian/African Relevance** | Issue categories reflect telecom customer concerns in Nigeria; sentiment analysis includes Nigerian English variations. |
| **License** | CC0 1.0 |
#### 8. Agricultural Commodity Prices (Nigeria & East Africa)
| Property | Details |
|----------|---------|
| **Chapters** | 9 (Time Series), 31 (Demand Forecasting) |
| **Dataset Name** | `agric_commodity_prices.csv` |
| **Rows** | 408 monthly observations (2010–2024 for 12 commodities) |
| **Columns** | 15 variables |
| **Source** | Publicly available from Nigerian Bureau of Statistics (NBS), FAO, and World Bank |
| **Description** | Monthly prices of key agricultural commodities traded in Nigeria and East Africa: maize, rice, beans, cassava, palm oil, groundnuts, sorghum, cocoa, livestock feed, fertiliser. Includes wholesale prices, farmer prices, international parity prices, and volume traded. |
| **Key Variables** | `commodity_id`, `commodity_name`, `country`, `market_location`, `year_month`, `price_naira_per_kg`, `volume_traded_tons`, `wholesale_price`, `farmer_price`, `international_parity_price`, `seasonality_index` |
| **Nigerian/African Relevance** | Commodities and markets reflect West and East African agricultural economies; seasonality patterns authentic. |
| **License** | Open Data Commons (PDDL) — public domain |
| **Access** | [NBS data portal](https://nigerianstat.gov.ng), [FAO GIEWS](http://www.fao.org/giews) |
#### 9. Energy Consumption (PHCN/Distribution Companies)
| Property | Details |
|----------|---------|
| **Chapters** | 9 (Time Series), 32 (Inventory Analytics), 44 (Supply Chain Optimization) |
| **Dataset Name** | `energy_consumption_daily.csv` |
| **Rows** | 2,555 daily readings (7 years); 2,000+ meters |
| **Columns** | 14 variables |
| **Source** | Synthetic — modelled on Nigerian distribution company (DISCO) network |
| **Description** | Daily electricity consumption readings from 2,000+ meters across a Nigerian distribution network. Includes consumption volume (kWh), billing period, customer type (residential, commercial, industrial), meter status, loss estimates, weather conditions (temperature, cloud cover), and technical characteristics. |
| **Key Variables** | `meter_id`, `reading_date`, `consumption_kwh`, `billing_period`, `customer_type`, `meter_status`, `technical_losses_percent`, `commercial_losses_percent`, `temperature_celsius`, `peak_demand_hour`, `region`, `voltage_class` |
| **Nigerian/African Relevance** | Based on PHCN/DISCO operational structures; customer types and loss factors reflect Nigerian grid realities. |
| **License** | CC0 1.0 |
### Part III: Predictive Analytics (Chapters 26-36)
#### 10. Insurance Claims Dataset
| Property | Details |
|----------|---------|
| **Chapters** | 29 (Churn & Fraud), 36 (Churn Prediction), 41 (Fraud Detection) |
| **Dataset Name** | `insurance_claims.csv` |
| **Rows** | 12,000 claims |
| **Columns** | 32 variables |
| **Source** | Synthetic — based on typical Nigerian insurance claims patterns |
| **Description** | Insurance claims database covering auto, health, and property claims. Includes policyholder demographics, risk factors, claim characteristics (date, type, amount, outcome), claim processing (days to settlement, fraud flags, investigator notes), policy history, previous claims. |
| **Key Variables** | `claim_id`, `policy_id`, `policyholder_age`, `policy_type`, `premium_naira`, `claim_date`, `claim_type`, `claim_amount_naira`, `days_to_settlement`, `fraud_flag`, `fraud_probability_score`, `claim_outcome`, `investigator_flagged`, `claim_history_count` |
| **Nigerian/African Relevance** | Policy types and claim amounts reflect Nigerian insurance market; fraud patterns based on local risk profiles. |
| **License** | CC0 1.0 |
#### 11. Manufacturing Defects (Quality Control)
| Property | Details |
|----------|---------|
| **Chapters** | 41 (Quality Analytics & Six Sigma) |
| **Dataset Name** | `manufacturing_defects.csv` |
| **Rows** | 36 months × multiple plants; 5,000+ records |
| **Columns** | 18 variables |
| **Source** | Synthetic — inspired by Nigerian and African manufacturing operations |
| **Description** | Manufacturing quality control data from 36 plants tracking defect rates, types (dimensional, material, surface finish, assembly), root causes, corrective actions, production volume, and process parameters (temperature, humidity, equipment age). |
| **Key Variables** | `plant_id`, `production_month`, `product_line`, `total_units_produced`, `defects_found`, `defect_rate_percent`, `defect_type`, `defect_cause`, `corrective_action`, `days_to_resolve`, `equipment_age_months`, `process_capability_index` |
| **Nigerian/African Relevance** | Plant locations in Nigeria; products typical of African manufacturing. |
| **License** | CC0 1.0 |
#### 12. Sales Performance (Multi-tier Distribution)
| Property | Details |
|----------|---------|
| **Chapters** | 27 (Lead Scoring), 33 (Predictive Sales Analytics), 46 (Employee Performance) |
| **Dataset Name** | `sales_team_performance.csv` |
| **Rows** | 100 salespeople × 24 months; 2,400 records |
| **Columns** | 28 variables |
| **Source** | Synthetic — based on multi-tier sales distribution in Nigeria |
| **Description** | Monthly sales performance data for 100 salespeople across territory tiers (A, B, C) tracking: quota, sales, units, customer acquisitions, pipeline value, close rate, average deal size, activity metrics (calls, meetings, proposals), territory characteristics, comp realised. |
| **Key Variables** | `salesperson_id`, `territory`, `manager_id`, `month_year`, `quota_naira`, `sales_naira`, `units_sold`, `customer_acquisitions`, `pipeline_naira`, `close_rate_percent`, `average_deal_size_naira`, `calls_made`, `meetings_held`, `proposals_sent`, `comp_naira` |
| **Nigerian/African Relevance** | Territory structures and sales comp reflect Nigerian sales organizations. |
| **License** | CC0 1.0 |
#### 13. Credit Bureau Data (Anonymised)
| Property | Details |
|----------|---------|
| **Chapters** | 29 (Churn & Fraud), 48 (Financial Risk) |
| **Dataset Name** | `credit_bureau_sample.csv` |
| **Rows** | 4,000 borrowers |
| **Columns** | 26 variables |
| **Source** | Anonymised sample inspired by CBN credit reporting framework |
| **Description** | Borrower credit history including: active credit facilities (loans, credit cards, lines), payment performance (payments on time, months in arrears, defaults), credit utilisation, inquiry frequency, credit mix, and personal identifiers (type, credit score). |
| **Key Variables** | `borrower_id`, `num_active_accounts`, `total_credit_limit_naira`, `total_outstanding_naira`, `credit_utilisation_percent`, `months_in_arrears`, `num_defaults`, `payment_performance_score`, `inquiry_count_6months`, `oldest_account_age_months`, `account_age_range`, `credit_score` |
| **Nigerian/African Relevance** | Based on CBN credit reporting standards and Nigerian lending practices. |
| **License** | Restricted — public version with anonymisation |
#### 14. Hospitality Guest Reviews
| Property | Details |
|----------|---------|
| **Chapters** | 19 (Text Analytics), 35 (Customer Satisfaction) |
| **Dataset Name** | `hotel_guest_reviews.csv` |
| **Rows** | 8,000 reviews |
| **Columns** | 14 variables |
| **Source** | Synthetic — modelled on reviews from Nigerian hospitality sector |
| **Description** | Guest reviews from Nigerian hotels and hospitality establishments. Each review includes: guest demographics (nationality, repeat status), stay details (duration, room type, rate paid), detailed ratings (cleanliness, staff, food, value, WiFi, safety), likelihood to recommend, and open-ended review text. |
| **Key Variables** | `review_id`, `hotel_id`, `guest_nationality`, `repeat_guest_flag`, `stay_duration_nights`, `room_type`, `rating_overall`, `rating_cleanliness`, `rating_staff`, `rating_food`, `rating_value`, `nps_score`, `review_text`, `review_sentiment` |
| **Nigerian/African Relevance** | Hotels in major Nigerian cities; review sentiment includes African English patterns. |
| **License** | CC0 1.0 |
### Part IV: Optimisation & Applications (Chapters 37-50)
#### 15. Digital Ad Spend & Attribution
| Property | Details |
|----------|---------|
| **Chapters** | 37 (Marketing Mix Modelling) |
| **Dataset Name** | `digital_marketing_attribution.csv` |
| **Rows** | 24 months × multiple channels; 100+ records |
| **Columns** | 24 variables |
| **Source** | Synthetic — based on Nigerian digital advertising ecosystem |
| **Description** | Monthly advertising spend and response data across channels: Google Search, Facebook, Instagram, TikTok, YouTube, Email. Includes spend amount, impressions, clicks, conversions, CPC, CPL, ROAS, channel mix, brand lift measurements, and sales outcome. |
| **Key Variables** | `month_year`, `channel`, `spend_naira`, `impressions`, `clicks`, `ctr_percent`, `conversions`, `cpl_naira`, `roas_ratio`, `brand_lift_percent`, `awareness_lift`, `total_sales_naira`, `attributed_sales_naira` |
| **Nigerian/African Relevance** | Channels reflect Nigerian digital advertising landscape; spend amounts typical of local markets. |
| **License** | CC0 1.0 |
#### 16. Inventory at Regional Warehouses
| Property | Details |
|----------|---------|
| **Chapters** | 32 (Inventory Analytics), 45 (Demand-Supply Planning) |
| **Dataset Name** | `warehouse_inventory_daily.csv` |
| **Rows** | 5 warehouses × 365 days; 1,825 records |
| **Columns** | 20 variables |
| **Source** | Synthetic — modelled on Nigerian FMCG distribution centres |
| **Description** | Daily inventory levels for 5 regional distribution centres (North, South, East, West, Lagos) tracking: stock on hand by SKU, inbound/outbound volumes, stockouts, safety stock levels, warehouse utilisation, shrinkage, and demand forecast accuracy. |
| **Key Variables** | `warehouse_id`, `sku_id`, `date`, `opening_stock_units`, `inbound_units`, `outbound_units`, `closing_stock_units`, `reorder_point_units`, `safety_stock_units`, `days_supply_on_hand`, `stockout_flag`, `shrinkage_units`, `warehouse_utilisation_percent` |
| **Nigerian/African Relevance** | Regional warehouse structure typical of Nigerian FMCG distribution. |
| **License** | CC0 1.0 |
#### 17. Employee Records (HR)
| Property | Details |
|----------|---------|
| **Chapters** | 46 (Employee Analytics), 47 (Attrition Prediction) |
| **Dataset Name** | `employee_records.csv` |
| **Rows** | 1,200 employees |
| **Columns** | 36 variables |
| **Source** | Synthetic — modelled on Nigerian corporate HR systems |
| **Description** | Employee master data including: demographics, hire/separation dates, departments, positions, compensation (salary, bonuses, benefits), performance ratings (5 years), training records, promotion history, engagement survey responses, exit interview data for separated employees. |
| **Key Variables** | `employee_id`, `hire_date`, `separation_date`, `department`, `job_title`, `manager_id`, `salary_naira`, `bonus_naira`, `benefits_value_naira`, `base_location`, `years_tenure`, `performance_rating_avg`, `promotion_count`, `training_hours_annual`, `engagement_score`, `separation_reason`, `regrettable_separation_flag` |
| **Nigerian/African Relevance** | Department and role structures typical of Nigerian organisations; compensation norms reflect local market. |
| **License** | Restricted — anonymised version only |
#### 18. Supply Chain Routing & Optimisation
| Property | Details |
|----------|---------|
| **Chapters** | 44 (Supply Chain Optimization) |
| **Dataset Name** | `supply_chain_routes.csv` |
| **Rows** | 500 delivery zones; 5,000 historical routes |
| **Columns** | 22 variables |
| **Source** | Synthetic — based on Nigerian last-mile logistics |
| **Description** | Distribution route planning data including: origin/destination locations, distance, time, vehicle type, fuel cost, toll costs, urban/rural classification, congestion patterns, actual vs. planned time, on-time delivery flag, and customer satisfaction. |
| **Key Variables** | `route_id`, `origin_location`, `destination_location`, `distance_km`, `planned_duration_hours`, `actual_duration_hours`, `vehicle_type`, `fuel_cost_naira`, `toll_cost_naira`, `traffic_condition`, `on_time_delivery_flag`, `urban_rural_classification`, `delivery_date` |
| **Nigerian/African Relevance** | Routes across major Nigerian cities; traffic and fuel costs authentic to Nigerian logistics. |
| **License** | CC0 1.0 |
#### 19. Stock Returns (Nigeria Stock Exchange)
| Property | Details |
|----------|---------|
| **Chapters** | 48 (Financial Risk Analytics) |
| **Dataset Name** | `nse_stock_returns.csv` |
| **Rows** | 504 daily observations (2 years) for select equities |
| **Columns** | 12 variables |
| **Source** | Public data from Nigerian Stock Exchange (NSE) |
| **Description** | Daily stock price data for selected NSE-listed companies covering: opening/closing prices, high/low, volume traded, returns, volatility, and market-wide indices (All-Share Index, sector indices). |
| **Key Variables** | `date`, `symbol`, `open_price_naira`, `close_price_naira`, `high_price_naira`, `low_price_naira`, `volume_traded`, `daily_return_percent`, `log_return`, `market_cap_naira`, `pe_ratio`, `sector` |
| **Nigerian/African Relevance** | Real NSE equity data; sector classification based on Nigerian market. |
| **License** | Open Data — publicly available |
| **Access** | [NSE Market Data](https://www.nse.com.ng) |
---
## Comprehensive Data Generation Scripts
### R: Generate All Synthetic Datasets
Save as `generate_data.R`:
```r
library(tidyverse)
library(lubridate)
# Set random seed for reproducibility
set.seed(42)
# ===== 1. NIGERIAN BANK LOAN PORTFOLIO =====
generate_loan_data <- function(n = 8000) {
tibble(
loan_id = 1:n,
borrower_age = sample(25:65, n, replace = TRUE),
annual_income_naira = rnorm(n, mean = 3e6, sd = 1.5e6) |> pmax(500000),
loan_amount_naira = rnorm(n, mean = 5e6, sd = 2e6) |> pmax(100000),
loan_term_months = sample(c(12, 24, 36, 48, 60), n, replace = TRUE),
interest_rate = rnorm(n, mean = 12, sd = 3) |> pmax(5),
employment_sector = sample(
c("Manufacturing", "Services", "Agriculture", "Oil & Gas", "Finance"),
n, replace = TRUE
),
credit_score = rnorm(n, mean = 650, sd = 80) |> pmax(300) |> pmin(850),
default_flag = rbinom(n, 1, 0.05),
months_delinquent = rpois(n, lambda = ifelse(rbinom(n, 1, 0.1) == 1, 3, 0)),
collateral_type = sample(
c("Real Estate", "Vehicle", "Securities", "Personal Guarantee", "None"),
n, replace = TRUE
),
collateral_value_naira = rnorm(n, mean = 4e6, sd = 2e6) |> pmax(0)
)
}
# ===== 2. NIGERIAN HOUSEHOLD EXPENDITURE SURVEY =====
generate_household_survey <- function(n = 3000) {
tibble(
household_id = 1:n,
zone = sample(
c("North Central", "North East", "North West", "South East", "South South", "South West"),
n, replace = TRUE
),
urban_rural = sample(c("Urban", "Rural"), n, replace = TRUE, prob = c(0.6, 0.4)),
household_size = rpois(n, lambda = 5) |> pmax(1),
head_age = sample(25:75, n, replace = TRUE),
head_education = sample(
c("No Formal", "Primary", "Secondary", "Tertiary"),
n, replace = TRUE, prob = c(0.15, 0.25, 0.35, 0.25)
),
monthly_income_naira = rnorm(n, mean = 150000, sd = 80000) |> pmax(10000),
food_spending = rnorm(n, mean = 50000, sd = 20000) |> pmax(0),
housing_spending = rnorm(n, mean = 20000, sd = 15000) |> pmax(0),
utilities_spending = rnorm(n, mean = 10000, sd = 5000) |> pmax(0),
education_spending = rnorm(n, mean = 15000, sd = 12000) |> pmax(0),
health_spending = rnorm(n, mean = 8000, sd = 5000) |> pmax(0),
transport_spending = rnorm(n, mean = 12000, sd = 8000) |> pmax(0),
entertainment_spending = rnorm(n, mean = 5000, sd = 4000) |> pmax(0)
)
}
# ===== 3. MOBILE MONEY TRANSACTIONS =====
generate_mobile_money <- function(n = 50000) {
tibble(
transaction_id = 1:n,
customer_id = sample(1:5000, n, replace = TRUE),
transaction_date = sample(seq(ymd("2023-01-01"), ymd("2023-12-31"), by = "day"), n, replace = TRUE),
transaction_type = sample(c("Airtime", "Data", "Transfer", "Utility Payment"), n, replace = TRUE),
amount_naira = rnorm(n, mean = 2500, sd = 2000) |> pmax(100),
channel = sample(c("USSD", "Mobile App", "Agent", "Retailer"), n, replace = TRUE),
operator = sample(c("MTN", "Airtel", "Glo", "9mobile"), n, replace = TRUE),
region = sample(
c("North Central", "North East", "North West", "South East", "South South", "South West"),
n, replace = TRUE
),
customer_tenure_months = sample(1:60, n, replace = TRUE),
transaction_hour = sample(0:23, n, replace = TRUE),
day_of_week = wday(transaction_date, label = TRUE)
)
}
# ===== GENERATE AND SAVE ALL DATASETS =====
loans <- generate_loan_data()
households <- generate_household_survey()
mobile_money <- generate_mobile_money()
write_csv(loans, "nigerian_bank_loans.csv")
write_csv(households, "nbs_household_survey.csv")
write_csv(mobile_money, "mobile_money_transactions.csv")
cat("All datasets generated successfully!\n")
cat(" - nigerian_bank_loans.csv (", nrow(loans), " rows)\n", sep = "")
cat(" - nbs_household_survey.csv (", nrow(households), " rows)\n", sep = "")
cat(" - mobile_money_transactions.csv (", nrow(mobile_money), " rows)\n", sep = "")
```
### Python: Generate All Synthetic Datasets
Save as `generate_data.py`:
```python
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Set random seed for reproducibility
np.random.seed(42)
# ===== 1. NIGERIAN BANK LOAN PORTFOLIO =====
def generate_loan_data(n=8000):
return pd.DataFrame({
'loan_id': range(1, n+1),
'borrower_age': np.random.randint(25, 65, n),
'annual_income_naira': np.random.normal(3e6, 1.5e6, n).clip(500000),
'loan_amount_naira': np.random.normal(5e6, 2e6, n).clip(100000),
'loan_term_months': np.random.choice([12, 24, 36, 48, 60], n),
'interest_rate': np.random.normal(12, 3, n).clip(5),
'employment_sector': np.random.choice(
['Manufacturing', 'Services', 'Agriculture', 'Oil & Gas', 'Finance'], n
),
'credit_score': np.random.normal(650, 80, n).clip(300, 850).astype(int),
'default_flag': np.random.binomial(1, 0.05, n),
'months_delinquent': [np.random.poisson(3) if np.random.rand() < 0.1 else 0 for _ in range(n)],
'collateral_type': np.random.choice(
['Real Estate', 'Vehicle', 'Securities', 'Personal Guarantee', 'None'], n
),
'collateral_value_naira': np.random.normal(4e6, 2e6, n).clip(0)
})
# ===== 2. NIGERIAN HOUSEHOLD EXPENDITURE SURVEY =====
def generate_household_survey(n=3000):
return pd.DataFrame({
'household_id': range(1, n+1),
'zone': np.random.choice(
['North Central', 'North East', 'North West', 'South East', 'South South', 'South West'], n
),
'urban_rural': np.random.choice(['Urban', 'Rural'], n, p=[0.6, 0.4]),
'household_size': np.random.poisson(5, n).clip(1),
'head_age': np.random.randint(25, 75, n),
'head_education': np.random.choice(
['No Formal', 'Primary', 'Secondary', 'Tertiary'], n, p=[0.15, 0.25, 0.35, 0.25]
),
'monthly_income_naira': np.random.normal(150000, 80000, n).clip(10000),
'food_spending': np.random.normal(50000, 20000, n).clip(0),
'housing_spending': np.random.normal(20000, 15000, n).clip(0),
'utilities_spending': np.random.normal(10000, 5000, n).clip(0),
'education_spending': np.random.normal(15000, 12000, n).clip(0),
'health_spending': np.random.normal(8000, 5000, n).clip(0),
'transport_spending': np.random.normal(12000, 8000, n).clip(0),
'entertainment_spending': np.random.normal(5000, 4000, n).clip(0)
})
# ===== 3. MOBILE MONEY TRANSACTIONS =====
def generate_mobile_money(n=50000):
start_date = datetime(2023, 1, 1)
dates = [start_date + timedelta(days=np.random.randint(0, 365)) for _ in range(n)]
return pd.DataFrame({
'transaction_id': range(1, n+1),
'customer_id': np.random.randint(1, 5001, n),
'transaction_date': dates,
'transaction_type': np.random.choice(['Airtime', 'Data', 'Transfer', 'Utility Payment'], n),
'amount_naira': np.random.normal(2500, 2000, n).clip(100),
'channel': np.random.choice(['USSD', 'Mobile App', 'Agent', 'Retailer'], n),
'operator': np.random.choice(['MTN', 'Airtel', 'Glo', '9mobile'], n),
'region': np.random.choice(
['North Central', 'North East', 'North West', 'South East', 'South South', 'South West'], n
),
'customer_tenure_months': np.random.randint(1, 61, n),
'transaction_hour': np.random.randint(0, 24, n),
'day_of_week': [d.strftime('%A') for d in dates]
})
# ===== GENERATE AND SAVE ALL DATASETS =====
loans = generate_loan_data()
households = generate_household_survey()
mobile_money = generate_mobile_money()
loans.to_csv('nigerian_bank_loans.csv', index=False)
households.to_csv('nbs_household_survey.csv', index=False)
mobile_money.to_csv('mobile_money_transactions.csv', index=False)
print("All datasets generated successfully!")
print(f" - nigerian_bank_loans.csv ({len(loans)} rows)")
print(f" - nbs_household_survey.csv ({len(households)} rows)")
print(f" - mobile_money_transactions.csv ({len(mobile_money)} rows)")
```
---
## How to Access Datasets
1. **Synthetic Datasets**: Run the R or Python generation scripts above
2. **Public Datasets**:
- NBS Household Survey: [nigerianstat.gov.ng](https://nigerianstat.gov.ng)
- CBN Data: [cbn.gov.ng/cbndb](https://cbn.gov.ng/cbndb)
- NSE Stock Data: [nse.com.ng](https://nse.com.ng)
- World Bank Nigeria Data: [data.worldbank.org](https://data.worldbank.org)
3. **Book Repository**: All datasets are available in the book's GitHub repository under `/data`
---
*All datasets are provided as-is for educational purposes. See individual dataset entries for licensing and attribution requirements.*