Data Literacy: What Is Data and Why Does It Matter?

Learning Objectives

📘 What You’ll Learn in This Chapter

By the end of this chapter, you will be able to:

Define data, information, insight, and decision, and explain how they relate to one another
Distinguish between raw data, processed data, and derived data
Classify data by type (structured, unstructured, semi-structured) and recognize examples from Nigerian and African businesses
Understand scales of measurement (nominal, ordinal, interval, ratio) and why they matter for analysis
Explain the five Vs of big data (volume, velocity, variety, veracity, value) and recognize them in real systems
Navigate the analytics value chain from raw data to business value
Distinguish between Key Performance Indicators (KPIs) and Key Predictive Indicators (KPIs)
Understand data ethics, privacy, bias, and governance in the African context, including Nigeria’s NDPR

What Is Data? Definitions and the Analytics Value Chain

Before we can analyze, we must understand what we are analyzing. Let’s start with the fundamental question: what is data?

In the broadest sense, data is recorded information about the world. It’s the stock prices from the Nigerian Exchange, the transaction logs from a bank’s ATM network, the temperatures recorded by weather stations in Lagos, the text of customer complaints written to a utility company. Data is everywhere.

But there’s a critical distinction we must make early. Most people use the words “data,” “information,” “insight,” and “decision” interchangeably, but they are not the same thing. Understanding the differences is the foundation of data literacy.

Raw data is unprocessed, unorganized fact: a list of numbers with no context, millions of database records in their native format, sensor readings streaming in without interpretation. Raw data is often messy, redundant, and incomplete. Imagine a CSV file with 2 million rows of bank transactions—that’s raw data. By itself, it tells you nothing.

Information is processed, organized data with context. When you take that CSV file of 2 million transactions, clean it, organize it, and compute summary statistics—“Total transaction volume this month: 500 billion Naira”—you have information. Information has been refined to answer a specific question.

Insight is interpretation and understanding derived from information. Insight sees patterns, asks “why?”, and draws conclusions. An insight might be: “Our transaction volume grows 5% every month, but fraud attempts grow 8% monthly—we need to strengthen our verification systems.” Insight connects dots that raw information leaves disconnected.

Decision is action based on insight. A decision is choosing between alternatives using what you’ve learned. “We will implement biometric authentication on all ATM withdrawals above 1 million Naira.” A good decision is informed by good insight, which comes from good information, which comes from good data.

🔑 The Analytics Value Chain

Data → Information → Insight → Decision → Value

Each step adds human understanding and judgment. Data without insight is just noise. Insight without decision is just interesting conversation. Decision without execution is just hope. The full chain creates business value.

This progression is so important that we’ll return to it throughout this book. For now, understand that your job as an analyst is to move stakeholders up this chain: from raw data to actionable insight.

Raw Data vs. Derived Data

Data comes in layers of processing:

Raw data: Unprocessed observations directly from a source (sensor readings, transaction logs, survey responses as entered)
Cleaned data: Raw data with errors removed, missing values handled, and outliers addressed
Processed data: Cleaned data organized, formatted, and structured for analysis
Derived data: New data computed from processed data (ratios, rankings, aggregations, predictions)

As you move down this list, you’re adding value but also adding assumptions and potential for error. A derived variable (like “customer lifetime value”) is only as good as the underlying raw data and the logic used to compute it.

The Practical Challenge

Here’s where it gets real: in the world, you almost never see truly raw data. By the time data reaches an analyst, someone else has usually done preliminary processing. A database schema has imposed structure. A CSV export has formatted numbers and dates in specific ways. Column names have been chosen (sometimes poorly).

Your job includes learning to see through these layers, understand what transformations have already happened, and know what quality issues might remain.

Types of Data: Structured, Unstructured, and Semi-Structured

Data comes in three primary forms, and your analytical approach depends on which type you’re working with.

Structured Data

Structured data is organized into predefined categories, formats, and relationships. Think of a table or spreadsheet: rows are observations, columns are variables, every cell follows a clear format.

Nigerian examples: - Bank account records: account number (column 1), customer name (column 2), balance (column 3), account type (column 4) - NBS Consumer Price Index: date, product category, price index value - Stock exchange data: ticker symbol, closing price, trading volume, timestamp - University enrollment: student ID, programme, admission year, GPA - Mobile money transactions: sender phone number, amount transferred, timestamp, receiver

Structured data is the easiest to analyze. It fits neatly into tables, databases, and statistical software. Most of the analysis you’ll do in your first career will use structured data.

📝 Review Questions: Structured Data

Why is structured data easier to analyze than unstructured data?
Name two sources of structured data from a Nigerian bank or fintech company.
What problems might arise if a database uses inconsistent formats for dates (e.g., some cells “01/02/2024” and others “2024-02-01”)?

Unstructured Data

Unstructured data has no predefined format or organization. Text, images, audio, video—these are unstructured. The data exists, but it’s not organized into rows and columns.

Nigerian examples: - WhatsApp messages between a business and its customers - Customer service complaints posted on Twitter or written in emails - Medical records and doctor’s notes in a hospital system (often written as free text) - Photographs of damaged goods submitted in insurance claims - Audio recordings of customer service calls at a call center - News articles published by Nigerian newspapers online

Unstructured data is harder to analyze because it requires additional processing to extract meaning. You can’t just put an email message into a spreadsheet and compute its mean. You need specialized tools: natural language processing to extract meaning from text, computer vision to understand images, speech recognition for audio.

However, unstructured data often contains rich information. A customer’s email complaint contains their frustration, the specific product they’re upset about, and clues about what went wrong. Standard statistical analysis would miss all of this.

Semi-Structured Data

Semi-structured data has some organization but doesn’t fit neatly into tables. The classic example is JSON (JavaScript Object Notation) or XML (Extensible Markup Language).

Nigerian examples: - XML form submissions from government agencies (e.g., tax forms with nested sections) - JSON API responses from financial platforms (e.g., mobile money providers returning user transaction history) - HTML pages from e-commerce sites (product data mixed with presentation formatting) - Email messages (structured headers like “From:” and “Date:”, but unstructured body text) - Log files from web servers (timestamps and codes are structured, but error messages are free text)

Semi-structured data requires specialized parsing tools—libraries that understand JSON or XML—but less heavy processing than fully unstructured data.

Section Review: Data Types

📝 Review Questions: Data Types and Sources

You’re working at an e-commerce company in Lagos. The company’s database contains product records (product ID, name, price, stock quantity) and also stores customer reviews as free-text comments. Which data is structured and which is unstructured?
An African payment processor receives data from member banks. The data arrives as JSON with nested fields for different transaction types. Is this structured or semi-structured data? Why?
A health insurance company needs to analyze both policy records (structured: policy number, premium amount, coverage type) and doctor’s notes from claims (unstructured: free text). What additional tools or skills would be needed to analyze the doctor’s notes compared to the policy records?
Give an example of semi-structured data you encounter in your daily life, and explain why it’s not fully structured.
A Nigerian bank wants to analyze customer sentiment from WhatsApp customer service messages. What type of data is this, and what challenges might an analyst face?

Scales of Measurement: Why Numbers Aren’t All the Same

Here’s a truth that seems obvious but whose implications are profound: not all numbers are the same. The number 5 means very different things depending on context.

Scales of measurement are categories that describe what numbers represent. The scale you’re working with determines which statistical operations make sense.

There are four scales: nominal, ordinal, interval, and ratio. Let’s explore each, with Nigerian business examples.

Nominal Scale: Categories Without Order

Nominal data describes categories with no inherent order or ranking. The categories are mutually exclusive—something is either in one category or another, not in between.

What you can do: Count, find mode (most common), test for association between categories.

What you cannot do: Compute a mean, median, or standard deviation. Ranking doesn’t make sense.

Nigerian examples:

Bank account type: Savings, Checking, Business, Student. A savings account is not “greater than” or “less than” a checking account—they’re just different. If you were to code these as numbers (1 = Savings, 2 = Checking, 3 = Business, 4 = Student), computing the mean (2.5) would be meaningless.
Mobile network: MTN, Airtel, Glo, 9mobile. These are categories. An MTN customer is not higher or lower than an Airtel customer. If a telecom analyst wanted to know the most popular network, they’d report the mode (most frequent category), not the average.
Product category in a retail store: Groceries, Electronics, Clothing, Household. These are distinct types, nothing more.
Gender: As typically recorded (Male, Female). Two categories, neither ordered.
State in Nigeria: Lagos, Kano, Rivers, Kaduna, etc. These are categories; Kano is not “greater than” Lagos.

📝 Review Questions: Nominal Scale

A bank records the “department” of each employee (Sales, Operations, HR, IT). What scale is this, and why can’t you compute the average department?
Why would converting nominal categories to numbers (1, 2, 3, 4) and computing a mean give you meaningless results?

Ordinal Scale: Categories With Order, But Unequal Gaps

Ordinal data describes categories with a natural order, but the distances between categories are not equal or meaningful.

What you can do: Count, find mode, find median, rank, compute correlations.

What you cannot do: Assume equal intervals between ranks. Computing a mean might be misleading.

Nigerian examples:

Net Promoter Score (NPS): A question like “How likely are you to recommend this service to a friend?” answered on a scale of 0-10. You know that 8 > 5 > 2, but is the gap from 2 to 5 the same as the gap from 5 to 8? Probably not. Someone rating 2 versus 5 is a big difference in sentiment; someone rating 5 versus 8 is a smaller difference. The gaps are unequal.
Educational attainment: Primary, Secondary, Tertiary, Postgraduate. There’s an order (Postgraduate > Tertiary > Secondary > Primary), but the “distance” between Primary and Secondary (typically 6 years) is different from Secondary to Tertiary (typically 4 years).
Customer satisfaction survey: Very Dissatisfied, Dissatisfied, Neutral, Satisfied, Very Satisfied. There’s an order, but is the gap from Dissatisfied to Neutral the same as from Neutral to Satisfied? It depends on how people interpret the scale.
Employee performance rating: Below Expectations, Meets Expectations, Exceeds Expectations, Far Exceeds Expectations. Again, there’s an order, but equal intervals between ratings are assumed, not proven.
Income bracket for survey respondents: ₦0–₦50,000/month, ₦50,000–₦100,000/month, ₦100,000–₦250,000/month, ₦250,000+/month. While these have ordering, the intervals differ greatly (the first two are ₦50k wide; the last two are ₦150k and open-ended).

With ordinal data, you can sensibly compute a median (the middle value) but not a mean (which assumes equal intervals). You can say “50% of customers rate us 7 or higher” but not “the average rating is 6.2” without being careful about interpretation.

📝 Review Questions: Ordinal Scale

An e-commerce platform asks customers to rate product quality on a scale of 1 (Poor) to 5 (Excellent). Is the difference in sentiment between a 1 and 2 rating the same as between a 4 and 5? Why does this matter?
Why is the median more appropriate than the mean for ordinal data? (Hint: think about the assumption that mean makes about equal intervals.)

Interval Scale: Ordered With Equal Intervals, But No True Zero

Interval data has ordered categories with equal intervals between values. Critically, there is no true zero—zero doesn’t mean “the absence of the quantity.”

What you can do: Everything from ordinal scale, plus compute mean, standard deviation, and correlation. Use most common statistical tests.

What you cannot do: Divide or multiply meaningfully. “30°C is twice as hot as 15°C” is false because Celsius has no true zero.

Nigerian examples:

Temperature in Lagos: A weather station records 28°C one day and 14°C another. The difference is 14 degrees, and differences are meaningful. But 28°C is not twice as hot as 14°C (that only makes sense on the Kelvin scale, which has a true zero). You can compute the average temperature over a week: (28 + 26 + 24 + 22 + 25 + 27 + 26) / 7 = 25.4°C. This is meaningful.
Year (as a number): The year 2024 and the year 2000 are 24 years apart. The difference is meaningful. But the year 2024 is not 1.12 times the year 2000. There’s no such thing as “year zero” in this calendar system (year 1 came after year 0, or rather, year 1 BC preceded year 1 AD).
Test scores (with a defined scale): If a standardized test is designed such that 0 means “no correct answers” but is still not interpreted as “the absence of knowledge,” it’s interval. Most school tests work this way.
Time of day: 14:00 (2 PM) and 10:00 (10 AM) are 4 hours apart. You can average times: the average of 10:00 and 14:00 is 12:00. But 14:00 is not “1.4 times” 10:00.

The Nigerian weather application reports temperatures—interval data. Averaging them makes sense. Nigerian exam boards publish test scores on a 0-100 scale (interval)—you can meaningfully discuss average performance.

Ratio Scale: The Gold Standard

Ratio data has all the properties of interval data plus a true, meaningful zero. Zero means the absence of the quantity.

What you can do: Everything. All statistical operations, multiplication, and division are meaningful.

Nigerian examples:

Annual revenue: A small business earns ₦5 million; a larger one earns ₦10 million. The larger business earns twice as much (10 / 5 = 2). Zero revenue means no money came in—that’s a real, meaningful state.
Number of employees: A startup has 5 employees; a mature company has 15. The mature company has three times as many (15 / 5 = 3). Zero employees means no one works there.
Transaction amount in Naira: A customer transfers ₦50,000; another transfers ₦100,000. The second transfer is twice as large. ₦0 means no money moved.
Age in years: A person is 25 years old; another is 50. The second person is twice as old. Age zero (birth) is a real event.
Loan default rate: Of 1,000 loans, 50 defaulted. The default rate is 5%. A default rate of 0% means no loans defaulted—that’s a real state.

With ratio data, you can meaningfully compute ratios (hence the name). You can say “our 2024 revenue is 1.3 times our 2023 revenue” and that statement carries real weight.

Why Scales of Measurement Matter: The Statistical Consequence

The scale of measurement determines which analyses are valid. Using the wrong analysis because you misunderstood your data’s scale leads to nonsense results.

🔑 Statistical Operations by Scale

Operation	Nominal	Ordinal	Interval	Ratio
Count frequency	✓	✓	✓	✓
Mode (most common)	✓	✓	✓	✓
Median	✗	✓	✓	✓
Mean	✗	⚠	✓	✓
Standard deviation	✗	⚠	✓	✓
Percentages	✓	✓	✓	✓
Ratio (A/B)	✗	✗	✗	✓
t-test, ANOVA	✗	⚠	✓	✓

Note: ✓ = Valid and meaningful; ⚠ = Possible but requires caution; ✗ = Invalid.

Example: A Nigerian bank classifies customer account types as Savings (coded 1), Checking (coded 2), Business (coded 3), and Student (coded 4). These are nominal data. A careless analyst might compute the average account type as (1 + 2 + 3 + 4) / 4 = 2.5 and conclude that the “average account type is 2.5.” This is meaningless gibberish. You cannot average nominal categories.

The correct analysis would be to report the distribution: “40% of our customers have Savings accounts, 35% have Checking, 20% have Business, and 5% have Student accounts.”

Section Review: Scales of Measurement

📝 Review Questions: Scales of Measurement

A hospital in Kano records patient blood pressure readings. Blood pressure is measured in mmHg (millimeters of mercury). Is this nominal, ordinal, interval, or ratio data? Justify your answer. What statistical operations can the hospital safely perform?
A retail chain categorizes customers as “Regular,” “Occasional,” or “First-time.” What scale is this? Why can the company find the mode of this variable but not the mean? What would the mean tell you (if you computed it anyway)?
A telecommunications company records the following for each customer: (a) mobile network (MTN, Airtel, Glo), (b) monthly data usage in GB, (c) satisfaction rating on a 1-5 scale, (d) years as a customer. Identify the scale for each variable.
Why is it incorrect to say that a customer satisfaction rating of 8 is twice as satisfied as a rating of 4? What scale is satisfaction rating, and what would be the correct interpretation of 8 vs. 4?
An analyst is building a machine learning model to predict customer churn. Why does it matter to distinguish between the scales of the input variables?

Big Data Characteristics: The Five Vs

In the modern era, “big data” is a term you’ll hear constantly. But what makes data “big”? Size alone (number of gigabytes) is only one dimension. Academics and practitioners have identified five key characteristics—the five Vs of big data.

Volume: Sheer Quantity

Volume refers to the amount of data. We’re talking terabytes, petabytes—data at a scale that traditional databases struggle with.

Nigerian context: The NIBSS (Nigeria Inter-Bank Settlement System) processes millions of electronic payment transactions daily. A single large bank in Nigeria might process 10 million+ transactions daily, each transaction a data record with timestamp, amount, sender, receiver, type. Over a year, that’s 3.6 billion transaction records—genuine big data in volume.

Mobile operators like MTN Nigeria and Airtel Africa each track billions of call detail records monthly—when every customer made a call, to whom, for how long. Combined, this is volume at an astonishing scale.

Velocity: Speed of Generation

Velocity refers to how fast data is generated and how fast it must be processed. Some data is static (census data collected once every 10 years). Other data flows continuously.

Nigerian context: Stock market data from the Nigerian Exchange streams in real-time during trading hours. Cryptocurrency exchanges operating in Nigeria (though in a regulatory gray zone) generate tick-by-tick price updates. Payment platforms like Flutterwave and Remita process transactions second-by-second.

For a bank’s fraud detection system, processing velocity matters: a fraudulent transaction detected in milliseconds can be blocked; detected in hours, it’s already done damage. Velocity demands real-time or near-real-time processing.

Variety: Different Types and Sources

Variety refers to heterogeneous data types and sources. Not all data is numbers in a table. You’ve got structured data (databases), unstructured data (text, images), semi-structured data (JSON APIs), and more.

Nigerian context: A fintech company in Lagos might combine: - Structured: customer account records from their database - Semi-structured: JSON responses from payment gateway APIs - Unstructured: customer feedback texts from WhatsApp, email complaints - Images: photos of identity documents for KYC (Know Your Customer) - Geographic: GPS coordinates of ATM locations - Behavioral: clickstream data from their mobile app

Handling this variety requires tools and expertise beyond traditional SQL databases.

Veracity: Quality and Trustworthiness

Veracity refers to data quality—accuracy, consistency, completeness. Much real-world data is messy: missing values, typos, duplicates, contradictions.

Nigerian context: In a survey across rural Nigeria, phone numbers might be recorded inconsistently (with or without country code, spaces in different places). In a government database integrating data from multiple agencies, the same person might appear under different IDs if their name was spelled differently. In IoT sensor data from weather stations, a malfunctioning sensor might report impossible values (like -500°C).

Veracity is why data cleaning (discussed in later chapters) is often 80% of an analyst’s time.

Value: What It’s Worth

Value refers to the utility—the potential for the data to drive insight and decision. Terabytes of data are worthless if you can’t extract value from them.

Nigerian context: A bank’s transaction data has high value: it directly informs fraud detection, credit risk assessment, and customer behavior understanding. A photo archive of employee IDs has low value for analytical purposes. Data exhaust (incidental byproducts of systems) like logs of how long customers spend on each page of a website might have value for UX improvement but needs processing to extract it.

Value depends on how the data connects to business questions.

📝 Review Questions: Big Data Five Vs

NIBSS transaction data (Nigeria’s inter-bank payment system) processes millions of transactions daily. Which of the five Vs apply? Explain.
A hospital in Abuja collects patient data: structured (demographics, lab results), semi-structured (XML-formatted doctor notes), and unstructured (patient narratives). Which V does this exemplify? Why does managing this variety pose challenges?
A company has been collecting data for 5 years but has never extracted meaningful business insights from it. What might be the problem from the perspective of the five Vs?
Describe a high-volume, high-velocity, high-variety dataset you might encounter in a Nigerian e-commerce company. What would be the veracity challenges?

The Analytics Value Chain: From Data to Value

Let’s return to the idea we introduced at the start of this chapter: the analytics value chain. Now that you understand data types, scales, and characteristics, we can explore this chain in depth.

The chain is: Data → Information → Insight → Decision → Value

Each stage builds on the previous. Let’s walk through a concrete example from a Nigerian bank.

Stage 1: Data

A bank’s loan department has given you a dataset of 10,000 loans issued over the past two years. Each loan record contains: - Loan ID - Customer ID - Loan amount (Naira) - Interest rate (%) - Loan term (months) - Customer age - Customer job title - Loan purpose (personal, home, auto) - Default status (Yes/No) - Months since issue

This is raw data: unprocessed, unorganized. A 10,000-row spreadsheet tells you nothing by itself.

Stage 2: Information

You process this data:

Cleaning: Remove duplicate records, handle missing values, correct typos in job titles
Organizing: Sort by issue date, categorize customers by age groups
Aggregating: Compute totals, averages, counts

Now you have information: structured facts with context.

Examples of information derived from the raw loan data: - “Total loans issued: 10,000; total amount: ₦15 billion” - “Average loan amount: ₦1.5 million; median: ₦1.2 million” - “Default rate: 8% (800 loans of 10,000)” - “By loan purpose: Personal (45%), Home (35%), Auto (20%)” - “Default rates by age group: Under 30 (12%), 30-40 (7%), Over 40 (5%)”

Stage 3: Insight

Now you interpret the information. You look for patterns, ask why, and draw conclusions. This is where analytics truly begins.

From the processed information above, you might notice: - “Younger customers (under 30) default at 2.4 times the rate of customers over 40” - “Personal loans have a 10% default rate, but auto loans have only 5%” - “Default rate has increased from 6% in the first year to 10% in the second year”

These are insights: they suggest causes and implications. Younger customers might have less stable income. Auto loans might be secured (the bank can repossess the car). The increasing default rate might reflect economic downturn.

Stage 4: Decision

Insight informs decision. A bank executive asks: “Given these insights, what should we do?”

Possible decisions: - Tighten lending standards for customers under 30: Require higher income verification, lower loan-to-value ratios, or higher interest rates to compensate for risk - Reduce personal loan limits: Personal loans are riskier; cap them at lower amounts - Adjust interest rates: Increase rates for higher-risk customer segments to compensate for expected defaults - Enhance collections processes: Implement earlier intervention when customers miss first payment

Stage 5: Value

Now the bank executes these decisions, and value is realized: - Risk reduction: By making smarter lending decisions, the bank reduces future losses to defaults - Revenue optimization: By charging higher rates for higher-risk loans, the bank recovers more of the risk premium - Efficiency: By intervening earlier in delinquencies, collections costs decrease - Reputation: By lending more carefully, the bank avoids negative publicity from aggressive collections

Value is measurable: reduced loss rates, higher net interest margin, lower collection costs, or improved customer sentiment.

🔑 The Analytics Value Chain in Practice

Stage	What Happens	Example Output
Data	Raw, unprocessed observation	“10,000 loan records in a CSV file”
Information	Processed, organized facts	“Default rate is 8%; average loan is ₦1.5M”
Insight	Interpreted patterns and understanding	“Younger customers default at 2.4x the rate of older customers”
Decision	Choice informed by insight	“Increase interest rates for under-30 borrowers by 2%”
Value	Tangible business improvement	“Expected loan loss reduction of ₦50M annually”

Section Review: Analytics Value Chain

📝 Review Questions: Analytics Value Chain

A Nigerian e-commerce company collects data on every product purchase: what was bought, when, by whom, at what price, using which payment method. This is raw data. Walk through an example of how this could become information, then insight, then a business decision.
Why is it insufficient for an analyst to simply report “The data says X is true”? What additional thinking must happen to convert data into actionable insight?
Consider a telecommunications company with data on customer monthly phone bills, churn (whether they switched to a competitor), and customer service call frequency. Describe one insight this data might reveal, and one decision it could inform.
What can go wrong if a company makes decisions based on information but without generating genuine insight?

Key Performance Indicators vs. Key Predictive Indicators

Businesses are obsessed with metrics—numbers that measure performance. But not all metrics are created equal. Two important categories are often confused: Key Performance Indicators (KPIs) and Key Predictive Indicators (KPIs). (Confusingly, both are abbreviated KPI; context determines which is meant.)

Key Performance Indicators: What’s Happening?

A Key Performance Indicator (KPI) is a metric that measures current performance against a strategic objective. KPIs answer: “How are we doing right now?”

KPIs are lagging indicators—they measure results that have already happened. By the time you see a KPI, the underlying events are in the past.

Nigerian examples:

Customer acquisition cost (CAC): How much, on average, does it cost to bring in one new customer? A fintech company tracks: “Our CAC is ₦2,500 per customer.” This is a KPI—it measures a realized cost that has already occurred. It’s useful for budgeting and evaluating marketing efficiency, but it doesn’t predict future success.
Monthly revenue: A SaaS company in Lagos targets ₦100 million in monthly revenue. They track actual revenue each month. This is a KPI. By the time they know their revenue, the month is over.
Customer satisfaction (via NPS): A bank surveys customers: “Would you recommend us to a friend?” Net Promoter Score is a KPI. It measures past customer satisfaction, not future behavior.
Employee turnover rate: HR department tracks: “We lost 10% of our workforce this year.” This is a lagging indicator—employees have already left.
Return on Marketing Investment (ROMI): “For every ₦1 spent on ads, we generated ₦4 in revenue.” A realized metric, measured after the fact.

KPIs are essential for accountability and strategic tracking, but they’re not predictive. A high NPS doesn’t guarantee customers won’t churn next month. Low employee turnover doesn’t predict future retention.

Key Predictive Indicators: What Will Happen?

A Key Predictive Indicator (KPI or sometimes leading indicator) is a metric that forecasts future performance. These are forward-looking.

Nigerian examples:

Employee engagement scores: An employee’s score on an engagement survey—asking about job satisfaction, opportunities for growth, relationship with manager—predicts whether they’ll leave. A low engagement score today predicts likely turnover within 6 months.
Sales pipeline value: A sales team tracks deals in their pipeline by stage (prospect, proposal, negotiation, close). The value of deals in “negotiation” stage predicts revenue in the coming months. As a KPI, it’s predictive: if the pipeline drops, revenue will likely drop later.
Website traffic and conversion funnel: A trend in website traffic (increasing or decreasing) predicts future e-commerce sales. Rising traffic suggests growing future orders; falling traffic predicts declining revenue unless reversed.
Customer product usage: For a SaaS company, early usage of a product feature by new customers predicts whether they’ll retain and expand. High usage = likely to stay; low usage = likely to churn.
Loan application approval rate: If a bank’s approval rate drops sharply, it signals tightened credit conditions, which predicts lower loan volume and revenue in coming months.
Customer support response time and resolution rate: Poor support metrics predict future churn. Customers with unresolved issues leave.

The Strategic Implication

Here’s why this distinction matters: KPIs tell you what happened. Predictive indicators tell you what will happen if you don’t act.

An excellent manager monitors both: - KPIs to know the current state and hold teams accountable - Predictive indicators to spot trends early and intervene before problems become crises

A bank that tracks lagging customer satisfaction KPIs might discover they’ve lost market share—too late to prevent it. A bank that tracks predictive indicators (like early warning signs in customer behavior) can intervene before satisfaction drops.

🔑 KPIs vs. Predictive Indicators

Aspect	Key Performance Indicator	Key Predictive Indicator
Timing	Measures past/current state	Forecasts future state
Information	What has happened	What will likely happen
Lag	Lagging indicator	Leading indicator
Value	Accountability, tracking	Early warning, intervention
Example	Monthly revenue, NPS score	Pipeline value, website traffic trend

Section Review: KPIs and Prediction

📝 Review Questions: KPIs and Predictive Indicators

A Nigerian bank measures its “non-performing loan ratio” (loans in default divided by total loans). Is this a KPI or a predictive indicator? Why? What predictive indicator might help the bank anticipate future NPLs?
You’re building a dashboard for a retail company. You suggest including “employee engagement score” alongside “sales per employee.” Justify why both matter, and explain which is predictive and which is lagging.
An online lending platform tracks “approval rate” (percent of loan applications approved). Why might a sudden drop in approval rate be a predictive indicator of future problems?

Data Ethics and Governance in the African Context

As you become skilled with data, you gain power. Power without ethical grounding becomes dangerous. The final section of this chapter addresses the human and societal dimensions of data work.

Bias and Fairness

Data reflects the world—including the world’s injustices. If your training data contains historical bias, your models will perpetuate it.

Nigerian example: A loan approval algorithm trained on historical lending data might learn that women receive loans less often than men (a historical bias reflecting discrimination). If an algorithm is built on this data without correction, it might deny loans to women at higher rates. The algorithm isn’t intentionally sexist, but it reflects and amplifies historical bias.

Similar issues arise around: - Geographic bias: Data from urban centers (Lagos, Abuja) is often richer than from rural areas, leading to models that work poorly in underrepresented regions - Language bias: Most NLP (natural language processing) models are trained on English data and perform poorly on Nigerian Pidgin or local languages - Age bias: Algorithms trained on working-age people might not generalize to elderly customers - Income bias: Data from wealthy segments is often overrepresented; models may not work for low-income populations

Your responsibility as an analyst is to: - Audit data for known biases - Test models across subgroups to ensure fairness - Document limitations explicitly - Recommend collecting more representative data - Challenge assumptions that perpetuate bias

Data Sovereignty

In the context of African nations, data sovereignty is increasingly important. Data sovereignty means a nation’s right to govern the collection, storage, and use of data within its borders.

Why it matters: - Foreign companies have sometimes extracted African data (user information, behavioral data, location data) and used it without clear local benefit - African data infrastructure is sometimes weak, making local populations dependent on foreign cloud services - Regulatory control over data supports accountability

Nigerian regulators and fellow African governments are asserting sovereignty: requiring that sensitive data be stored within the country, ensuring Africans have a say in how data about Africans is used.

As an analyst working in Nigeria or Africa, be aware of: - Where data is stored (locally vs. international servers) - Regulatory requirements for data residency - Your organization’s commitments to local data benefit

Transparency and Explainability

When you build a model or algorithm that affects people’s lives—approving loans, recommending hire/fire decisions, flagging fraud—people deserve to understand how it works.

Nigerian example: If a bank uses a machine learning model to approve or deny loans, customers who are denied deserve to know why. A “black box” that says “denied” is unethical and likely violates NDPR. You should be able to explain which factors the model considered and why it reached its decision.

As you progress to building models (later chapters), remember: interpretability matters. Simpler models that humans can understand are often preferable to complex black-box models, even if the black-box model is slightly more accurate.

Section Review: Data Ethics and Governance

📝 Review Questions: Data Ethics and Governance

A Nigerian bank collects data on customers’ mobile money transfers, SMS text messages with the bank, and financial website browsing history. Under the NDPR, what must the bank do before collecting this data? What limitations must it observe in how it uses this data?
A machine learning model for loan approval is trained on 10 years of historical data from a bank. The data shows that applicants from certain regions were approved less often than applicants from Lagos. How might this bias manifest in the model, and what should the analyst do?
A multinational fintech company is considering building a fraud detection system that serves customers across Africa. What data sovereignty concerns should they consider for Nigerian customers?
An organization wants to use customer data for “improving our services,” but this is vague. Why is this problematic under NDPR, and what clearer purposes should be documented?
Compare and contrast data privacy (NDPR compliance) and data fairness (avoiding bias). Can a system be compliant with privacy law but still unfair?

Chapter Summary: Key Takeaways

Data is everywhere, but context is everything. Raw data becomes information only when processed and organized. Information becomes insight only when interpreted. Insight becomes value only when acted upon.
Types of data determine methods. Structured data fits in tables; unstructured data (text, images) requires specialized tools; semi-structured data (JSON, XML) sits between.
Scales of measurement are not just academic. Knowing whether your data is nominal, ordinal, interval, or ratio determines which statistical operations are valid. Treating nominal data as numeric leads to absurd conclusions.
Big data is about more than size. Volume, velocity, variety, veracity, and value—the five Vs—characterize modern datasets. You must manage all five.
The analytics value chain has five stages: data, information, insight, decision, and value. Your job is moving organizations up this chain.
Lead and lag. Key Performance Indicators measure what has happened. Key Predictive Indicators forecast what will happen. Excellent organizations track both.
Data has ethical weight. As you gain skill with data, commit to privacy (NDPR), fairness (reducing bias), sovereignty (local benefit), and transparency (explainability). These aren’t optional—they’re central to legitimate, sustainable data work.

Exercises

Chapter 2 Exercises

1. Defining Terms (Recall) For each of the following, state whether it is data, information, insight, or decision: a) A spreadsheet of 50,000 customer transactions with no summary or analysis b) “Our average customer spent ₦45,000 this year, up from ₦38,000 last year” c) “Given that spending is up 18% but customer acquisition cost is also up 15%, our real revenue growth is only 3%, and this unsustainable acquisition strategy must change” d) “We will reduce customer acquisition spending in digital channels by 30% and shift funds to referral incentives”

2. Data Types in Practice (Comprehension) A Nigerian healthcare startup collects the following information for each patient: - Date of birth (Ratio) - Medical history (Unstructured) - Blood type (Nominal) - Severity of current condition: mild, moderate, severe (Ordinal) - Blood pressure reading in mmHg (Interval) - Number of medications currently taking (Ratio) - Preferred hospital location (from a dropdown list) (Nominal)

Classify each as structured, unstructured, or semi-structured. Then identify its scale of measurement.

3. Scales of Measurement Matter (Comprehension) An analyst at a retail company codes customer segments as: Regular (1), Occasional (2), New (3). They then compute the “average segment” as 2.0 and declare that “the typical customer is Occasional.”

Explain why this analysis is wrong. What scale is customer segment? What would be the correct analysis?

4. Five Vs in Your Life (Application) Pick a real-world data source you encounter (social media, banking app, e-commerce site, etc.). Analyze it through the lens of the five Vs. Which Vs apply, and which don’t? How does high volume and velocity affect how that organization must manage data?

5. Analytics Value Chain (Application) You’re an analyst for a Nigerian fast-food chain with 50 locations. You notice that locations in low-income neighborhoods have higher food waste as a percentage of sales than locations in high-income neighborhoods. Walk this through the analytics value chain: a) What is the raw data here? b) What is the information (processed facts)? c) What insight might you draw? d) What decisions might this insight support? e) What value might be realized?

6. KPIs and Predictive Indicators (Analysis) For a ride-hailing platform like Uber or Bolt operating in Lagos, identify: a) Two Key Performance Indicators (lagging metrics that measure current performance) b) Two Key Predictive Indicators (leading metrics that forecast future performance) c) For each predictive indicator, explain why it predicts future performance

7. Bias in Data (Analysis) A bank built a credit risk model on 5 years of historical lending data. The model performs well on the overall test set (85% accuracy) but performs worse on female applicants (78% accuracy). How might this have happened? What does “worse” mean for real applicants? What should the bank do?

8. Privacy and Ethics (Comprehension) Under Nigeria’s NDPR, explain what an organization must do before: a) Collecting location data from users’ mobile phones b) Using customer email addresses to send marketing messages c) Analyzing customer purchase history to understand buying patterns

9. Sovereignty and Scale (Synthesis) A multinational SaaS company with customers across Africa wants to consolidate all data in a single cloud region (US-based) for cost efficiency. A Nigerian regulator argues this violates data sovereignty principles. a) What is the regulator’s concern? b) What are the company’s practical constraints? c) How might a compromise be structured?

10. Ethics and Profit (Analysis) A credit scoring algorithm used by Nigerian banks improves profitability by 5% but is found to systematically approve loans at lower rates for applicants from certain ethnic regions (based on names). The differences are statistically small and not the algorithm’s explicit purpose—just a learned pattern in the data. a) Is this a technical problem or an ethical problem? b) Should the bank use the algorithm? c) What should the bank disclose to regulators and customers?

Chapter Appendix: Formal Definitions and Measurement Theory

Appendix 2.A: Stevens’ Scale of Measurement

The classification of nominal, ordinal, interval, and ratio scales comes from S. S. Stevens’ 1946 paper “On the Theory of Scales of Measurement.” Stevens defined measurement as “the assignment of numerals to objects or events according to rule.”

Formally:

Nominal scale: Numbers are assigned purely as labels with no order. Only the identity relation holds: A = A, and A ≠ B (if A ≠ B). Allowed statistics: mode, frequency, association measures.
Ordinal scale: Numbers convey order. The order relations hold: A > B > C. But the magnitude of differences is unknown. Allowed statistics: median, percentiles, monotonic correlation.
Interval scale: Numbers convey order and equal intervals. If the difference A – B equals the difference C – D, then the intervals are equal. Zero is arbitrary (not the absence of the quantity). Allowed statistics: mean, standard deviation, Pearson correlation, t-tests.
Ratio scale: Numbers convey order, equal intervals, and a meaningful zero. If A is twice B, that’s a meaningful statement. All operations allowed.

Appendix 2.B: The Problem of Treating Ordinal as Interval

A common mistake in analysis is treating ordinal data (like survey scales 1-5) as if they were interval. This leads to incorrect conclusions.

Why it’s problematic: The mean of ordinal data assumes equal intervals. But ordinal scales don’t guarantee this.

If you ask 100 people “How satisfied are you?” on a 1-5 scale and 80 people answer “1” while 20 answer “5”, the mean is (80×1 + 20×5)/100 = 1.8. But this doesn’t mean the “average satisfaction is 1.8.” It means most people are very dissatisfied, with a small group very satisfied. The median (1) better represents the central tendency.

However, in practice, many analysts do treat ordinal data as interval (especially Likert scales), accepting a small inaccuracy in exchange for analytical power. This is pragmatic but should be done with awareness of the limitation.

Appendix 2.C: Big Data and the Unreasonable Effectiveness of Data

There’s a phenomenon in machine learning where with enough data, even very simple models work surprisingly well. This is captured in the paper “The Unreasonable Effectiveness of Data” (Halevy, Norvig, Pereira, 2009).

The insight: with sufficient volume and variety, you don’t need perfect models. A simple statistical model on a billion observations might outperform a sophisticated model on a million observations.

This has implications for Nigerian and African data science: - Investment in data collection infrastructure is often more valuable than sophisticated algorithms - Big data allows less bias (because patterns emerge despite imperfect models) - Privacy and governance concerns become more acute with larger data

Appendix 2.D: The DIKW Hierarchy and Beyond

The progression Data → Information → Insight → Knowledge → Wisdom (sometimes called DIKW) is a classical framework in knowledge management.

Data: Raw facts
Information: Data with context
Knowledge: Actionable understanding
Wisdom: Judgment about when and how to apply knowledge

As an analyst, you typically work in the Data-Information-Insight realm, but you should recognize that decision-makers need wisdom: judgment about whether the insight applies in their context, considering factors (relationships, trust, organizational dynamics) that aren’t in the data.

--- title: "Data Literacy: What Is Data and Why Does It Matter?" number-sections: false --- ## Learning Objectives ::: {.callout-note icon="false"} ## 📘 What You'll Learn in This Chapter By the end of this chapter, you will be able to: - Define data, information, insight, and decision, and explain how they relate to one another - Distinguish between raw data, processed data, and derived data - Classify data by type (structured, unstructured, semi-structured) and recognize examples from Nigerian and African businesses - Understand scales of measurement (nominal, ordinal, interval, ratio) and why they matter for analysis - Explain the five Vs of big data (volume, velocity, variety, veracity, value) and recognize them in real systems - Navigate the analytics value chain from raw data to business value - Distinguish between Key Performance Indicators (KPIs) and Key Predictive Indicators (KPIs) - Understand data ethics, privacy, bias, and governance in the African context, including Nigeria's NDPR ::: --- ## What Is Data? Definitions and the Analytics Value Chain Before we can analyze, we must understand what we are analyzing. Let's start with the fundamental question: **what is data?** In the broadest sense, **data** is recorded information about the world. It's the stock prices from the Nigerian Exchange, the transaction logs from a bank's ATM network, the temperatures recorded by weather stations in Lagos, the text of customer complaints written to a utility company. Data is everywhere. But there's a critical distinction we must make early. Most people use the words "data," "information," "insight," and "decision" interchangeably, but they are not the same thing. Understanding the differences is the foundation of data literacy. **Raw data** is unprocessed, unorganized fact: a list of numbers with no context, millions of database records in their native format, sensor readings streaming in without interpretation. Raw data is often messy, redundant, and incomplete. Imagine a CSV file with 2 million rows of bank transactions—that's raw data. By itself, it tells you nothing. **Information** is processed, organized data with context. When you take that CSV file of 2 million transactions, clean it, organize it, and compute summary statistics—"Total transaction volume this month: 500 billion Naira"—you have information. Information has been refined to answer a specific question. **Insight** is interpretation and understanding derived from information. Insight sees patterns, asks "why?", and draws conclusions. An insight might be: "Our transaction volume grows 5% every month, but fraud attempts grow 8% monthly—we need to strengthen our verification systems." Insight connects dots that raw information leaves disconnected. **Decision** is action based on insight. A decision is choosing between alternatives using what you've learned. "We will implement biometric authentication on all ATM withdrawals above 1 million Naira." A good decision is informed by good insight, which comes from good information, which comes from good data. ::: {.callout-tip icon="false"} ## 🔑 The Analytics Value Chain Data → Information → Insight → Decision → Value Each step adds human understanding and judgment. Data without insight is just noise. Insight without decision is just interesting conversation. Decision without execution is just hope. The full chain creates business value. ::: This progression is so important that we'll return to it throughout this book. For now, understand that your job as an analyst is to move stakeholders up this chain: from raw data to actionable insight. ### Raw Data vs. Derived Data Data comes in layers of processing: - **Raw data:** Unprocessed observations directly from a source (sensor readings, transaction logs, survey responses as entered) - **Cleaned data:** Raw data with errors removed, missing values handled, and outliers addressed - **Processed data:** Cleaned data organized, formatted, and structured for analysis - **Derived data:** New data computed from processed data (ratios, rankings, aggregations, predictions) As you move down this list, you're adding value but also adding assumptions and potential for error. A derived variable (like "customer lifetime value") is only as good as the underlying raw data and the logic used to compute it. ### The Practical Challenge Here's where it gets real: in the world, you almost never see truly raw data. By the time data reaches an analyst, someone else has usually done preliminary processing. A database schema has imposed structure. A CSV export has formatted numbers and dates in specific ways. Column names have been chosen (sometimes poorly). Your job includes learning to see through these layers, understand what transformations have already happened, and know what quality issues might remain. --- ## Types of Data: Structured, Unstructured, and Semi-Structured Data comes in three primary forms, and your analytical approach depends on which type you're working with. ### Structured Data **Structured data** is organized into predefined categories, formats, and relationships. Think of a table or spreadsheet: rows are observations, columns are variables, every cell follows a clear format. **Nigerian examples:** - Bank account records: account number (column 1), customer name (column 2), balance (column 3), account type (column 4) - NBS Consumer Price Index: date, product category, price index value - Stock exchange data: ticker symbol, closing price, trading volume, timestamp - University enrollment: student ID, programme, admission year, GPA - Mobile money transactions: sender phone number, amount transferred, timestamp, receiver Structured data is the easiest to analyze. It fits neatly into tables, databases, and statistical software. Most of the analysis you'll do in your first career will use structured data. ::: {.callout-caution icon="false"} ## 📝 Review Questions: Structured Data 1. Why is structured data easier to analyze than unstructured data? 2. Name two sources of structured data from a Nigerian bank or fintech company. 3. What problems might arise if a database uses inconsistent formats for dates (e.g., some cells "01/02/2024" and others "2024-02-01")? ::: ### Unstructured Data **Unstructured data** has no predefined format or organization. Text, images, audio, video—these are unstructured. The data exists, but it's not organized into rows and columns. **Nigerian examples:** - WhatsApp messages between a business and its customers - Customer service complaints posted on Twitter or written in emails - Medical records and doctor's notes in a hospital system (often written as free text) - Photographs of damaged goods submitted in insurance claims - Audio recordings of customer service calls at a call center - News articles published by Nigerian newspapers online Unstructured data is harder to analyze because it requires additional processing to extract meaning. You can't just put an email message into a spreadsheet and compute its mean. You need specialized tools: natural language processing to extract meaning from text, computer vision to understand images, speech recognition for audio. However, unstructured data often contains rich information. A customer's email complaint contains their frustration, the specific product they're upset about, and clues about what went wrong. Standard statistical analysis would miss all of this. ### Semi-Structured Data **Semi-structured data** has some organization but doesn't fit neatly into tables. The classic example is JSON (JavaScript Object Notation) or XML (Extensible Markup Language). **Nigerian examples:** - XML form submissions from government agencies (e.g., tax forms with nested sections) - JSON API responses from financial platforms (e.g., mobile money providers returning user transaction history) - HTML pages from e-commerce sites (product data mixed with presentation formatting) - Email messages (structured headers like "From:" and "Date:", but unstructured body text) - Log files from web servers (timestamps and codes are structured, but error messages are free text) Semi-structured data requires specialized parsing tools—libraries that understand JSON or XML—but less heavy processing than fully unstructured data. --- ## Section Review: Data Types ::: {.callout-caution icon="false"} ## 📝 Review Questions: Data Types and Sources 1. You're working at an e-commerce company in Lagos. The company's database contains product records (product ID, name, price, stock quantity) and also stores customer reviews as free-text comments. Which data is structured and which is unstructured? 2. An African payment processor receives data from member banks. The data arrives as JSON with nested fields for different transaction types. Is this structured or semi-structured data? Why? 3. A health insurance company needs to analyze both policy records (structured: policy number, premium amount, coverage type) and doctor's notes from claims (unstructured: free text). What additional tools or skills would be needed to analyze the doctor's notes compared to the policy records? 4. Give an example of semi-structured data you encounter in your daily life, and explain why it's not fully structured. 5. A Nigerian bank wants to analyze customer sentiment from WhatsApp customer service messages. What type of data is this, and what challenges might an analyst face? ::: --- ## Scales of Measurement: Why Numbers Aren't All the Same Here's a truth that seems obvious but whose implications are profound: not all numbers are the same. The number 5 means very different things depending on context. **Scales of measurement** are categories that describe what numbers represent. The scale you're working with determines which statistical operations make sense. There are four scales: **nominal**, **ordinal**, **interval**, and **ratio**. Let's explore each, with Nigerian business examples. ### Nominal Scale: Categories Without Order **Nominal** data describes categories with no inherent order or ranking. The categories are mutually exclusive—something is either in one category or another, not in between. **What you can do:** Count, find mode (most common), test for association between categories. **What you cannot do:** Compute a mean, median, or standard deviation. Ranking doesn't make sense. **Nigerian examples:** 1. **Bank account type:** Savings, Checking, Business, Student. A savings account is not "greater than" or "less than" a checking account—they're just different. If you were to code these as numbers (1 = Savings, 2 = Checking, 3 = Business, 4 = Student), computing the mean (2.5) would be meaningless. 2. **Mobile network:** MTN, Airtel, Glo, 9mobile. These are categories. An MTN customer is not higher or lower than an Airtel customer. If a telecom analyst wanted to know the most popular network, they'd report the mode (most frequent category), not the average. 3. **Product category in a retail store:** Groceries, Electronics, Clothing, Household. These are distinct types, nothing more. 4. **Gender:** As typically recorded (Male, Female). Two categories, neither ordered. 5. **State in Nigeria:** Lagos, Kano, Rivers, Kaduna, etc. These are categories; Kano is not "greater than" Lagos. ::: {.callout-caution icon="false"} ## 📝 Review Questions: Nominal Scale 1. A bank records the "department" of each employee (Sales, Operations, HR, IT). What scale is this, and why can't you compute the average department? 2. Why would converting nominal categories to numbers (1, 2, 3, 4) and computing a mean give you meaningless results? ::: ### Ordinal Scale: Categories With Order, But Unequal Gaps **Ordinal** data describes categories with a natural order, but the distances between categories are not equal or meaningful. **What you can do:** Count, find mode, find median, rank, compute correlations. **What you cannot do:** Assume equal intervals between ranks. Computing a mean might be misleading. **Nigerian examples:** 1. **Net Promoter Score (NPS):** A question like "How likely are you to recommend this service to a friend?" answered on a scale of 0-10. You know that 8 > 5 > 2, but is the gap from 2 to 5 the same as the gap from 5 to 8? Probably not. Someone rating 2 versus 5 is a big difference in sentiment; someone rating 5 versus 8 is a smaller difference. The gaps are unequal. 2. **Educational attainment:** Primary, Secondary, Tertiary, Postgraduate. There's an order (Postgraduate > Tertiary > Secondary > Primary), but the "distance" between Primary and Secondary (typically 6 years) is different from Secondary to Tertiary (typically 4 years). 3. **Customer satisfaction survey:** Very Dissatisfied, Dissatisfied, Neutral, Satisfied, Very Satisfied. There's an order, but is the gap from Dissatisfied to Neutral the same as from Neutral to Satisfied? It depends on how people interpret the scale. 4. **Employee performance rating:** Below Expectations, Meets Expectations, Exceeds Expectations, Far Exceeds Expectations. Again, there's an order, but equal intervals between ratings are assumed, not proven. 5. **Income bracket for survey respondents:** ₦0–₦50,000/month, ₦50,000–₦100,000/month, ₦100,000–₦250,000/month, ₦250,000+/month. While these have ordering, the intervals differ greatly (the first two are ₦50k wide; the last two are ₦150k and open-ended). With ordinal data, you can sensibly compute a **median** (the middle value) but not a **mean** (which assumes equal intervals). You can say "50% of customers rate us 7 or higher" but not "the average rating is 6.2" without being careful about interpretation. ::: {.callout-caution icon="false"} ## 📝 Review Questions: Ordinal Scale 1. An e-commerce platform asks customers to rate product quality on a scale of 1 (Poor) to 5 (Excellent). Is the difference in sentiment between a 1 and 2 rating the same as between a 4 and 5? Why does this matter? 2. Why is the median more appropriate than the mean for ordinal data? (Hint: think about the assumption that mean makes about equal intervals.) ::: ### Interval Scale: Ordered With Equal Intervals, But No True Zero **Interval** data has ordered categories with equal intervals between values. Critically, there is no true zero—zero doesn't mean "the absence of the quantity." **What you can do:** Everything from ordinal scale, plus compute mean, standard deviation, and correlation. Use most common statistical tests. **What you cannot do:** Divide or multiply meaningfully. "30°C is twice as hot as 15°C" is false because Celsius has no true zero. **Nigerian examples:** 1. **Temperature in Lagos:** A weather station records 28°C one day and 14°C another. The difference is 14 degrees, and differences are meaningful. But 28°C is not twice as hot as 14°C (that only makes sense on the Kelvin scale, which has a true zero). You can compute the average temperature over a week: (28 + 26 + 24 + 22 + 25 + 27 + 26) / 7 = 25.4°C. This is meaningful. 2. **Year (as a number):** The year 2024 and the year 2000 are 24 years apart. The difference is meaningful. But the year 2024 is not 1.12 times the year 2000. There's no such thing as "year zero" in this calendar system (year 1 came after year 0, or rather, year 1 BC preceded year 1 AD). 3. **Test scores (with a defined scale):** If a standardized test is designed such that 0 means "no correct answers" but is still not interpreted as "the absence of knowledge," it's interval. Most school tests work this way. 4. **Time of day:** 14:00 (2 PM) and 10:00 (10 AM) are 4 hours apart. You can average times: the average of 10:00 and 14:00 is 12:00. But 14:00 is not "1.4 times" 10:00. The Nigerian weather application reports temperatures—interval data. Averaging them makes sense. Nigerian exam boards publish test scores on a 0-100 scale (interval)—you can meaningfully discuss average performance. ### Ratio Scale: The Gold Standard **Ratio** data has all the properties of interval data plus a true, meaningful zero. Zero means the absence of the quantity. **What you can do:** Everything. All statistical operations, multiplication, and division are meaningful. **Nigerian examples:** 1. **Annual revenue:** A small business earns ₦5 million; a larger one earns ₦10 million. The larger business earns twice as much (10 / 5 = 2). Zero revenue means no money came in—that's a real, meaningful state. 2. **Number of employees:** A startup has 5 employees; a mature company has 15. The mature company has three times as many (15 / 5 = 3). Zero employees means no one works there. 3. **Transaction amount in Naira:** A customer transfers ₦50,000; another transfers ₦100,000. The second transfer is twice as large. ₦0 means no money moved. 4. **Age in years:** A person is 25 years old; another is 50. The second person is twice as old. Age zero (birth) is a real event. 5. **Loan default rate:** Of 1,000 loans, 50 defaulted. The default rate is 5%. A default rate of 0% means no loans defaulted—that's a real state. With ratio data, you can meaningfully compute ratios (hence the name). You can say "our 2024 revenue is 1.3 times our 2023 revenue" and that statement carries real weight. --- ## Why Scales of Measurement Matter: The Statistical Consequence The scale of measurement determines which analyses are valid. Using the wrong analysis because you misunderstood your data's scale leads to nonsense results. ::: {.callout-tip icon="false"} ## 🔑 Statistical Operations by Scale | Operation | Nominal | Ordinal | Interval | Ratio | |-----------|---------|---------|----------|-------| | Count frequency | ✓ | ✓ | ✓ | ✓ | | Mode (most common) | ✓ | ✓ | ✓ | ✓ | | Median | ✗ | ✓ | ✓ | ✓ | | Mean | ✗ | ⚠ | ✓ | ✓ | | Standard deviation | ✗ | ⚠ | ✓ | ✓ | | Percentages | ✓ | ✓ | ✓ | ✓ | | Ratio (A/B) | ✗ | ✗ | ✗ | ✓ | | t-test, ANOVA | ✗ | ⚠ | ✓ | ✓ | Note: ✓ = Valid and meaningful; ⚠ = Possible but requires caution; ✗ = Invalid. ::: **Example:** A Nigerian bank classifies customer account types as Savings (coded 1), Checking (coded 2), Business (coded 3), and Student (coded 4). These are nominal data. A careless analyst might compute the average account type as (1 + 2 + 3 + 4) / 4 = 2.5 and conclude that the "average account type is 2.5." This is meaningless gibberish. You cannot average nominal categories. The correct analysis would be to report the distribution: "40% of our customers have Savings accounts, 35% have Checking, 20% have Business, and 5% have Student accounts." --- ## Section Review: Scales of Measurement ::: {.callout-caution icon="false"} ## 📝 Review Questions: Scales of Measurement 1. A hospital in Kano records patient blood pressure readings. Blood pressure is measured in mmHg (millimeters of mercury). Is this nominal, ordinal, interval, or ratio data? Justify your answer. What statistical operations can the hospital safely perform? 2. A retail chain categorizes customers as "Regular," "Occasional," or "First-time." What scale is this? Why can the company find the mode of this variable but not the mean? What would the mean tell you (if you computed it anyway)? 3. A telecommunications company records the following for each customer: (a) mobile network (MTN, Airtel, Glo), (b) monthly data usage in GB, (c) satisfaction rating on a 1-5 scale, (d) years as a customer. Identify the scale for each variable. 4. Why is it incorrect to say that a customer satisfaction rating of 8 is twice as satisfied as a rating of 4? What scale is satisfaction rating, and what would be the correct interpretation of 8 vs. 4? 5. An analyst is building a machine learning model to predict customer churn. Why does it matter to distinguish between the scales of the input variables? ::: --- ## Big Data Characteristics: The Five Vs In the modern era, "big data" is a term you'll hear constantly. But what makes data "big"? Size alone (number of gigabytes) is only one dimension. Academics and practitioners have identified five key characteristics—the **five Vs** of big data. ### Volume: Sheer Quantity **Volume** refers to the amount of data. We're talking terabytes, petabytes—data at a scale that traditional databases struggle with. **Nigerian context:** The **NIBSS (Nigeria Inter-Bank Settlement System)** processes millions of electronic payment transactions daily. A single large bank in Nigeria might process 10 million+ transactions daily, each transaction a data record with timestamp, amount, sender, receiver, type. Over a year, that's 3.6 billion transaction records—genuine big data in volume. Mobile operators like MTN Nigeria and Airtel Africa each track billions of call detail records monthly—when every customer made a call, to whom, for how long. Combined, this is volume at an astonishing scale. ### Velocity: Speed of Generation **Velocity** refers to how fast data is generated and how fast it must be processed. Some data is static (census data collected once every 10 years). Other data flows continuously. **Nigerian context:** Stock market data from the **Nigerian Exchange** streams in real-time during trading hours. Cryptocurrency exchanges operating in Nigeria (though in a regulatory gray zone) generate tick-by-tick price updates. Payment platforms like Flutterwave and Remita process transactions second-by-second. For a bank's fraud detection system, processing velocity matters: a fraudulent transaction detected in milliseconds can be blocked; detected in hours, it's already done damage. Velocity demands real-time or near-real-time processing. ### Variety: Different Types and Sources **Variety** refers to heterogeneous data types and sources. Not all data is numbers in a table. You've got structured data (databases), unstructured data (text, images), semi-structured data (JSON APIs), and more. **Nigerian context:** A fintech company in Lagos might combine: - Structured: customer account records from their database - Semi-structured: JSON responses from payment gateway APIs - Unstructured: customer feedback texts from WhatsApp, email complaints - Images: photos of identity documents for KYC (Know Your Customer) - Geographic: GPS coordinates of ATM locations - Behavioral: clickstream data from their mobile app Handling this variety requires tools and expertise beyond traditional SQL databases. ### Veracity: Quality and Trustworthiness **Veracity** refers to data quality—accuracy, consistency, completeness. Much real-world data is messy: missing values, typos, duplicates, contradictions. **Nigerian context:** In a survey across rural Nigeria, phone numbers might be recorded inconsistently (with or without country code, spaces in different places). In a government database integrating data from multiple agencies, the same person might appear under different IDs if their name was spelled differently. In IoT sensor data from weather stations, a malfunctioning sensor might report impossible values (like -500°C). Veracity is why data cleaning (discussed in later chapters) is often 80% of an analyst's time. ### Value: What It's Worth **Value** refers to the utility—the potential for the data to drive insight and decision. Terabytes of data are worthless if you can't extract value from them. **Nigerian context:** A bank's transaction data has high value: it directly informs fraud detection, credit risk assessment, and customer behavior understanding. A photo archive of employee IDs has low value for analytical purposes. Data exhaust (incidental byproducts of systems) like logs of how long customers spend on each page of a website might have value for UX improvement but needs processing to extract it. Value depends on how the data connects to business questions. ::: {.callout-caution icon="false"} ## 📝 Review Questions: Big Data Five Vs 1. NIBSS transaction data (Nigeria's inter-bank payment system) processes millions of transactions daily. Which of the five Vs apply? Explain. 2. A hospital in Abuja collects patient data: structured (demographics, lab results), semi-structured (XML-formatted doctor notes), and unstructured (patient narratives). Which V does this exemplify? Why does managing this variety pose challenges? 3. A company has been collecting data for 5 years but has never extracted meaningful business insights from it. What might be the problem from the perspective of the five Vs? 4. Describe a high-volume, high-velocity, high-variety dataset you might encounter in a Nigerian e-commerce company. What would be the veracity challenges? ::: --- ## The Analytics Value Chain: From Data to Value Let's return to the idea we introduced at the start of this chapter: the **analytics value chain**. Now that you understand data types, scales, and characteristics, we can explore this chain in depth. The chain is: **Data → Information → Insight → Decision → Value** Each stage builds on the previous. Let's walk through a concrete example from a Nigerian bank. ### Stage 1: Data A bank's loan department has given you a dataset of 10,000 loans issued over the past two years. Each loan record contains: - Loan ID - Customer ID - Loan amount (Naira) - Interest rate (%) - Loan term (months) - Customer age - Customer job title - Loan purpose (personal, home, auto) - Default status (Yes/No) - Months since issue This is raw data: unprocessed, unorganized. A 10,000-row spreadsheet tells you nothing by itself. ### Stage 2: Information You process this data: - **Cleaning:** Remove duplicate records, handle missing values, correct typos in job titles - **Organizing:** Sort by issue date, categorize customers by age groups - **Aggregating:** Compute totals, averages, counts Now you have **information**: structured facts with context. Examples of information derived from the raw loan data: - "Total loans issued: 10,000; total amount: ₦15 billion" - "Average loan amount: ₦1.5 million; median: ₦1.2 million" - "Default rate: 8% (800 loans of 10,000)" - "By loan purpose: Personal (45%), Home (35%), Auto (20%)" - "Default rates by age group: Under 30 (12%), 30-40 (7%), Over 40 (5%)" ### Stage 3: Insight Now you interpret the information. You look for patterns, ask why, and draw conclusions. This is where analytics truly begins. From the processed information above, you might notice: - "Younger customers (under 30) default at 2.4 times the rate of customers over 40" - "Personal loans have a 10% default rate, but auto loans have only 5%" - "Default rate has increased from 6% in the first year to 10% in the second year" These are insights: they suggest causes and implications. Younger customers might have less stable income. Auto loans might be secured (the bank can repossess the car). The increasing default rate might reflect economic downturn. ### Stage 4: Decision Insight informs decision. A bank executive asks: "Given these insights, what should we do?" Possible decisions: - **Tighten lending standards for customers under 30:** Require higher income verification, lower loan-to-value ratios, or higher interest rates to compensate for risk - **Reduce personal loan limits:** Personal loans are riskier; cap them at lower amounts - **Adjust interest rates:** Increase rates for higher-risk customer segments to compensate for expected defaults - **Enhance collections processes:** Implement earlier intervention when customers miss first payment ### Stage 5: Value Now the bank executes these decisions, and value is realized: - **Risk reduction:** By making smarter lending decisions, the bank reduces future losses to defaults - **Revenue optimization:** By charging higher rates for higher-risk loans, the bank recovers more of the risk premium - **Efficiency:** By intervening earlier in delinquencies, collections costs decrease - **Reputation:** By lending more carefully, the bank avoids negative publicity from aggressive collections Value is measurable: reduced loss rates, higher net interest margin, lower collection costs, or improved customer sentiment. ::: {.callout-tip icon="false"} ## 🔑 The Analytics Value Chain in Practice | Stage | What Happens | Example Output | |-------|--------------|-----------------| | **Data** | Raw, unprocessed observation | "10,000 loan records in a CSV file" | | **Information** | Processed, organized facts | "Default rate is 8%; average loan is ₦1.5M" | | **Insight** | Interpreted patterns and understanding | "Younger customers default at 2.4x the rate of older customers" | | **Decision** | Choice informed by insight | "Increase interest rates for under-30 borrowers by 2%" | | **Value** | Tangible business improvement | "Expected loan loss reduction of ₦50M annually" | ::: --- ## Section Review: Analytics Value Chain ::: {.callout-caution icon="false"} ## 📝 Review Questions: Analytics Value Chain 1. A Nigerian e-commerce company collects data on every product purchase: what was bought, when, by whom, at what price, using which payment method. This is raw data. Walk through an example of how this could become information, then insight, then a business decision. 2. Why is it insufficient for an analyst to simply report "The data says X is true"? What additional thinking must happen to convert data into actionable insight? 3. Consider a telecommunications company with data on customer monthly phone bills, churn (whether they switched to a competitor), and customer service call frequency. Describe one insight this data might reveal, and one decision it could inform. 4. What can go wrong if a company makes decisions based on information but without generating genuine insight? ::: --- ## Key Performance Indicators vs. Key Predictive Indicators Businesses are obsessed with metrics—numbers that measure performance. But not all metrics are created equal. Two important categories are often confused: **Key Performance Indicators (KPIs)** and **Key Predictive Indicators (KPIs)**. (Confusingly, both are abbreviated KPI; context determines which is meant.) ### Key Performance Indicators: What's Happening? A **Key Performance Indicator** (KPI) is a metric that measures current performance against a strategic objective. KPIs answer: "How are we doing *right now*?" KPIs are **lagging indicators**—they measure results that have already happened. By the time you see a KPI, the underlying events are in the past. **Nigerian examples:** 1. **Customer acquisition cost (CAC):** How much, on average, does it cost to bring in one new customer? A fintech company tracks: "Our CAC is ₦2,500 per customer." This is a KPI—it measures a realized cost that has already occurred. It's useful for budgeting and evaluating marketing efficiency, but it doesn't predict future success. 2. **Monthly revenue:** A SaaS company in Lagos targets ₦100 million in monthly revenue. They track actual revenue each month. This is a KPI. By the time they know their revenue, the month is over. 3. **Customer satisfaction (via NPS):** A bank surveys customers: "Would you recommend us to a friend?" Net Promoter Score is a KPI. It measures past customer satisfaction, not future behavior. 4. **Employee turnover rate:** HR department tracks: "We lost 10% of our workforce this year." This is a lagging indicator—employees have already left. 5. **Return on Marketing Investment (ROMI):** "For every ₦1 spent on ads, we generated ₦4 in revenue." A realized metric, measured after the fact. KPIs are essential for accountability and strategic tracking, but they're **not predictive**. A high NPS doesn't guarantee customers won't churn next month. Low employee turnover doesn't predict future retention. ### Key Predictive Indicators: What Will Happen? A **Key Predictive Indicator** (KPI or sometimes **leading indicator**) is a metric that forecasts future performance. These are **forward-looking**. **Nigerian examples:** 1. **Employee engagement scores:** An employee's score on an engagement survey—asking about job satisfaction, opportunities for growth, relationship with manager—predicts whether they'll leave. A low engagement score today predicts likely turnover within 6 months. 2. **Sales pipeline value:** A sales team tracks deals in their pipeline by stage (prospect, proposal, negotiation, close). The value of deals in "negotiation" stage predicts revenue in the coming months. As a KPI, it's predictive: if the pipeline drops, revenue will likely drop later. 3. **Website traffic and conversion funnel:** A trend in website traffic (increasing or decreasing) predicts future e-commerce sales. Rising traffic suggests growing future orders; falling traffic predicts declining revenue unless reversed. 4. **Customer product usage:** For a SaaS company, early usage of a product feature by new customers predicts whether they'll retain and expand. High usage = likely to stay; low usage = likely to churn. 5. **Loan application approval rate:** If a bank's approval rate drops sharply, it signals tightened credit conditions, which predicts lower loan volume and revenue in coming months. 6. **Customer support response time and resolution rate:** Poor support metrics predict future churn. Customers with unresolved issues leave. ### The Strategic Implication Here's why this distinction matters: **KPIs tell you *what* happened. Predictive indicators tell you what will happen if you don't act.** An excellent manager monitors both: - KPIs to know the current state and hold teams accountable - Predictive indicators to spot trends early and intervene before problems become crises A bank that tracks lagging customer satisfaction KPIs might discover they've lost market share—too late to prevent it. A bank that tracks predictive indicators (like early warning signs in customer behavior) can intervene before satisfaction drops. ::: {.callout-tip icon="false"} ## 🔑 KPIs vs. Predictive Indicators | Aspect | Key Performance Indicator | Key Predictive Indicator | |--------|--------------------------|------------------------| | **Timing** | Measures past/current state | Forecasts future state | | **Information** | What has happened | What will likely happen | | **Lag** | Lagging indicator | Leading indicator | | **Value** | Accountability, tracking | Early warning, intervention | | **Example** | Monthly revenue, NPS score | Pipeline value, website traffic trend | ::: --- ## Section Review: KPIs and Prediction ::: {.callout-caution icon="false"} ## 📝 Review Questions: KPIs and Predictive Indicators 1. A Nigerian bank measures its "non-performing loan ratio" (loans in default divided by total loans). Is this a KPI or a predictive indicator? Why? What predictive indicator might help the bank anticipate future NPLs? 2. You're building a dashboard for a retail company. You suggest including "employee engagement score" alongside "sales per employee." Justify why both matter, and explain which is predictive and which is lagging. 3. An online lending platform tracks "approval rate" (percent of loan applications approved). Why might a sudden drop in approval rate be a predictive indicator of future problems? ::: --- ## Data Ethics and Governance in the African Context As you become skilled with data, you gain power. Power without ethical grounding becomes dangerous. The final section of this chapter addresses the human and societal dimensions of data work. ### Privacy and Consent When you collect data about people, you're collecting information about their lives, choices, and identities. Ethical data work respects privacy: people should know their data is being collected and should be able to opt out. **Nigerian context:** The **Nigeria Data Protection Regulation (NDPR)**, which came into effect in 2019, establishes rules for data collection, storage, and processing. Key principles: - **Consent:** Organizations must get explicit consent before collecting personal data - **Purpose limitation:** Data collected for one purpose cannot be used for another without consent - **Data minimization:** Collect only what you need - **Accuracy:** Keep data accurate and up-to-date - **Security:** Protect data from unauthorized access - **Right to access:** People can ask to see what data an organization has about them - **Right to be forgotten:** People can request deletion of their data As an analyst, you must: - Understand what data is personal and sensitive (health, financial, biometric, etc.) - Never use data beyond its stated purpose - Anonymize data when analyzing (remove names, IDs, identifying information) - Follow your organization's data governance policies ### Bias and Fairness Data reflects the world—including the world's injustices. If your training data contains historical bias, your models will perpetuate it. **Nigerian example:** A loan approval algorithm trained on historical lending data might learn that women receive loans less often than men (a historical bias reflecting discrimination). If an algorithm is built on this data without correction, it might deny loans to women at higher rates. The algorithm isn't intentionally sexist, but it reflects and amplifies historical bias. Similar issues arise around: - **Geographic bias:** Data from urban centers (Lagos, Abuja) is often richer than from rural areas, leading to models that work poorly in underrepresented regions - **Language bias:** Most NLP (natural language processing) models are trained on English data and perform poorly on Nigerian Pidgin or local languages - **Age bias:** Algorithms trained on working-age people might not generalize to elderly customers - **Income bias:** Data from wealthy segments is often overrepresented; models may not work for low-income populations Your responsibility as an analyst is to: - Audit data for known biases - Test models across subgroups to ensure fairness - Document limitations explicitly - Recommend collecting more representative data - Challenge assumptions that perpetuate bias ### Data Sovereignty In the context of African nations, **data sovereignty** is increasingly important. Data sovereignty means a nation's right to govern the collection, storage, and use of data within its borders. **Why it matters:** - Foreign companies have sometimes extracted African data (user information, behavioral data, location data) and used it without clear local benefit - African data infrastructure is sometimes weak, making local populations dependent on foreign cloud services - Regulatory control over data supports accountability Nigerian regulators and fellow African governments are asserting sovereignty: requiring that sensitive data be stored within the country, ensuring Africans have a say in how data about Africans is used. As an analyst working in Nigeria or Africa, be aware of: - Where data is stored (locally vs. international servers) - Regulatory requirements for data residency - Your organization's commitments to local data benefit ### Transparency and Explainability When you build a model or algorithm that affects people's lives—approving loans, recommending hire/fire decisions, flagging fraud—people deserve to understand how it works. **Nigerian example:** If a bank uses a machine learning model to approve or deny loans, customers who are denied deserve to know *why*. A "black box" that says "denied" is unethical and likely violates NDPR. You should be able to explain which factors the model considered and why it reached its decision. As you progress to building models (later chapters), remember: **interpretability matters**. Simpler models that humans can understand are often preferable to complex black-box models, even if the black-box model is slightly more accurate. --- ## Section Review: Data Ethics and Governance ::: {.callout-caution icon="false"} ## 📝 Review Questions: Data Ethics and Governance 1. A Nigerian bank collects data on customers' mobile money transfers, SMS text messages with the bank, and financial website browsing history. Under the NDPR, what must the bank do before collecting this data? What limitations must it observe in how it uses this data? 2. A machine learning model for loan approval is trained on 10 years of historical data from a bank. The data shows that applicants from certain regions were approved less often than applicants from Lagos. How might this bias manifest in the model, and what should the analyst do? 3. A multinational fintech company is considering building a fraud detection system that serves customers across Africa. What data sovereignty concerns should they consider for Nigerian customers? 4. An organization wants to use customer data for "improving our services," but this is vague. Why is this problematic under NDPR, and what clearer purposes should be documented? 5. Compare and contrast data privacy (NDPR compliance) and data fairness (avoiding bias). Can a system be compliant with privacy law but still unfair? ::: --- ## Chapter Summary: Key Takeaways - **Data is everywhere, but context is everything.** Raw data becomes information only when processed and organized. Information becomes insight only when interpreted. Insight becomes value only when acted upon. - **Types of data determine methods.** Structured data fits in tables; unstructured data (text, images) requires specialized tools; semi-structured data (JSON, XML) sits between. - **Scales of measurement are not just academic.** Knowing whether your data is nominal, ordinal, interval, or ratio determines which statistical operations are valid. Treating nominal data as numeric leads to absurd conclusions. - **Big data is about more than size.** Volume, velocity, variety, veracity, and value—the five Vs—characterize modern datasets. You must manage all five. - **The analytics value chain has five stages:** data, information, insight, decision, and value. Your job is moving organizations up this chain. - **Lead and lag.** Key Performance Indicators measure what has happened. Key Predictive Indicators forecast what will happen. Excellent organizations track both. - **Data has ethical weight.** As you gain skill with data, commit to privacy (NDPR), fairness (reducing bias), sovereignty (local benefit), and transparency (explainability). These aren't optional—they're central to legitimate, sustainable data work. --- ## Exercises ::: {.exercises} #### Chapter 2 Exercises **1. Defining Terms (Recall)** For each of the following, state whether it is data, information, insight, or decision: a) A spreadsheet of 50,000 customer transactions with no summary or analysis b) "Our average customer spent ₦45,000 this year, up from ₦38,000 last year" c) "Given that spending is up 18% but customer acquisition cost is also up 15%, our real revenue growth is only 3%, and this unsustainable acquisition strategy must change" d) "We will reduce customer acquisition spending in digital channels by 30% and shift funds to referral incentives" **2. Data Types in Practice (Comprehension)** A Nigerian healthcare startup collects the following information for each patient: - Date of birth (Ratio) - Medical history (Unstructured) - Blood type (Nominal) - Severity of current condition: mild, moderate, severe (Ordinal) - Blood pressure reading in mmHg (Interval) - Number of medications currently taking (Ratio) - Preferred hospital location (from a dropdown list) (Nominal) Classify each as structured, unstructured, or semi-structured. Then identify its scale of measurement. **3. Scales of Measurement Matter (Comprehension)** An analyst at a retail company codes customer segments as: Regular (1), Occasional (2), New (3). They then compute the "average segment" as 2.0 and declare that "the typical customer is Occasional." Explain why this analysis is wrong. What scale is customer segment? What would be the correct analysis? **4. Five Vs in Your Life (Application)** Pick a real-world data source you encounter (social media, banking app, e-commerce site, etc.). Analyze it through the lens of the five Vs. Which Vs apply, and which don't? How does high volume and velocity affect how that organization must manage data? **5. Analytics Value Chain (Application)** You're an analyst for a Nigerian fast-food chain with 50 locations. You notice that locations in low-income neighborhoods have higher food waste as a percentage of sales than locations in high-income neighborhoods. Walk this through the analytics value chain: a) What is the raw data here? b) What is the information (processed facts)? c) What insight might you draw? d) What decisions might this insight support? e) What value might be realized? **6. KPIs and Predictive Indicators (Analysis)** For a ride-hailing platform like Uber or Bolt operating in Lagos, identify: a) Two Key Performance Indicators (lagging metrics that measure current performance) b) Two Key Predictive Indicators (leading metrics that forecast future performance) c) For each predictive indicator, explain why it predicts future performance **7. Bias in Data (Analysis)** A bank built a credit risk model on 5 years of historical lending data. The model performs well on the overall test set (85% accuracy) but performs worse on female applicants (78% accuracy). How might this have happened? What does "worse" mean for real applicants? What should the bank do? **8. Privacy and Ethics (Comprehension)** Under Nigeria's NDPR, explain what an organization must do before: a) Collecting location data from users' mobile phones b) Using customer email addresses to send marketing messages c) Analyzing customer purchase history to understand buying patterns **9. Sovereignty and Scale (Synthesis)** A multinational SaaS company with customers across Africa wants to consolidate all data in a single cloud region (US-based) for cost efficiency. A Nigerian regulator argues this violates data sovereignty principles. a) What is the regulator's concern? b) What are the company's practical constraints? c) How might a compromise be structured? **10. Ethics and Profit (Analysis)** A credit scoring algorithm used by Nigerian banks improves profitability by 5% but is found to systematically approve loans at lower rates for applicants from certain ethnic regions (based on names). The differences are statistically small and not the algorithm's explicit purpose—just a learned pattern in the data. a) Is this a technical problem or an ethical problem? b) Should the bank use the algorithm? c) What should the bank disclose to regulators and customers? ::: --- ## Further Reading - **Wickham, H., & Grolemund, G. (2017).** *R for Data Science: Import, Tidy, Transform, Visualize, and Model Data.* Chapter 1 covers the data science process. https://r4ds.had.co.nz/ - **Tukey, J. W. (1977).** *Exploratory Data Analysis.* Addison-Wesley. The foundational text on understanding data through visualization and simple statistics. - **Royston, P., et al. (2009).** "Dichotomizing continuous predictors in multiple regression: a bad idea." *Journal of Clinical Epidemiology*, 62(10), 1091-1100. A technical paper on why treating scales incorrectly is harmful. - **Nigeria Data Protection Regulation (NDPR) Official Text.** https://ndpr.nitda.gov.ng/. The actual regulation you'll work under in Nigeria. - **O'Neill, C. (2016).** *Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy.* Crown Publishers. Excellent on how data systems can perpetuate bias and harm. --- ## Chapter Appendix: Formal Definitions and Measurement Theory ### Appendix 2.A: Stevens' Scale of Measurement The classification of nominal, ordinal, interval, and ratio scales comes from **S. S. Stevens' 1946 paper "On the Theory of Scales of Measurement."** Stevens defined measurement as "the assignment of numerals to objects or events according to rule." **Formally:** - **Nominal scale:** Numbers are assigned purely as labels with no order. Only the identity relation holds: A = A, and A ≠ B (if A ≠ B). Allowed statistics: mode, frequency, association measures. - **Ordinal scale:** Numbers convey order. The order relations hold: A > B > C. But the magnitude of differences is unknown. Allowed statistics: median, percentiles, monotonic correlation. - **Interval scale:** Numbers convey order and equal intervals. If the difference A – B equals the difference C – D, then the intervals are equal. Zero is arbitrary (not the absence of the quantity). Allowed statistics: mean, standard deviation, Pearson correlation, t-tests. - **Ratio scale:** Numbers convey order, equal intervals, and a meaningful zero. If A is twice B, that's a meaningful statement. All operations allowed. ### Appendix 2.B: The Problem of Treating Ordinal as Interval A common mistake in analysis is treating ordinal data (like survey scales 1-5) as if they were interval. This leads to incorrect conclusions. **Why it's problematic:** The mean of ordinal data assumes equal intervals. But ordinal scales don't guarantee this. If you ask 100 people "How satisfied are you?" on a 1-5 scale and 80 people answer "1" while 20 answer "5", the mean is (80×1 + 20×5)/100 = 1.8. But this doesn't mean the "average satisfaction is 1.8." It means most people are very dissatisfied, with a small group very satisfied. The median (1) better represents the central tendency. However, in practice, many analysts do treat ordinal data as interval (especially Likert scales), accepting a small inaccuracy in exchange for analytical power. This is pragmatic but should be done with awareness of the limitation. ### Appendix 2.C: Big Data and the Unreasonable Effectiveness of Data There's a phenomenon in machine learning where with enough data, even very simple models work surprisingly well. This is captured in the paper **"The Unreasonable Effectiveness of Data" (Halevy, Norvig, Pereira, 2009)**. The insight: with sufficient volume and variety, you don't need perfect models. A simple statistical model on a billion observations might outperform a sophisticated model on a million observations. This has implications for Nigerian and African data science: - Investment in data collection infrastructure is often more valuable than sophisticated algorithms - Big data allows less bias (because patterns emerge despite imperfect models) - Privacy and governance concerns become more acute with larger data ### Appendix 2.D: The DIKW Hierarchy and Beyond The progression Data → Information → Insight → Knowledge → Wisdom (sometimes called DIKW) is a classical framework in knowledge management. - **Data:** Raw facts - **Information:** Data with context - **Knowledge:** Actionable understanding - **Wisdom:** Judgment about *when and how* to apply knowledge As an analyst, you typically work in the Data-Information-Insight realm, but you should recognize that decision-makers need wisdom: judgment about whether the insight applies in *their* context, considering factors (relationships, trust, organizational dynamics) that aren't in the data.