Data Literacy: What Is Data and Why Does It Matter?
Learning Objectives
What Is Data? Definitions and the Analytics Value Chain
Before we can analyze, we must understand what we are analyzing. Let’s start with the fundamental question: what is data?
In the broadest sense, data is recorded information about the world. It’s the stock prices from the Nigerian Exchange, the transaction logs from a bank’s ATM network, the temperatures recorded by weather stations in Lagos, the text of customer complaints written to a utility company. Data is everywhere.
But there’s a critical distinction we must make early. Most people use the words “data,” “information,” “insight,” and “decision” interchangeably, but they are not the same thing. Understanding the differences is the foundation of data literacy.
Raw data is unprocessed, unorganized fact: a list of numbers with no context, millions of database records in their native format, sensor readings streaming in without interpretation. Raw data is often messy, redundant, and incomplete. Imagine a CSV file with 2 million rows of bank transactions—that’s raw data. By itself, it tells you nothing.
Information is processed, organized data with context. When you take that CSV file of 2 million transactions, clean it, organize it, and compute summary statistics—“Total transaction volume this month: 500 billion Naira”—you have information. Information has been refined to answer a specific question.
Insight is interpretation and understanding derived from information. Insight sees patterns, asks “why?”, and draws conclusions. An insight might be: “Our transaction volume grows 5% every month, but fraud attempts grow 8% monthly—we need to strengthen our verification systems.” Insight connects dots that raw information leaves disconnected.
Decision is action based on insight. A decision is choosing between alternatives using what you’ve learned. “We will implement biometric authentication on all ATM withdrawals above 1 million Naira.” A good decision is informed by good insight, which comes from good information, which comes from good data.
This progression is so important that we’ll return to it throughout this book. For now, understand that your job as an analyst is to move stakeholders up this chain: from raw data to actionable insight.
Raw Data vs. Derived Data
Data comes in layers of processing:
- Raw data: Unprocessed observations directly from a source (sensor readings, transaction logs, survey responses as entered)
- Cleaned data: Raw data with errors removed, missing values handled, and outliers addressed
- Processed data: Cleaned data organized, formatted, and structured for analysis
- Derived data: New data computed from processed data (ratios, rankings, aggregations, predictions)
As you move down this list, you’re adding value but also adding assumptions and potential for error. A derived variable (like “customer lifetime value”) is only as good as the underlying raw data and the logic used to compute it.
The Practical Challenge
Here’s where it gets real: in the world, you almost never see truly raw data. By the time data reaches an analyst, someone else has usually done preliminary processing. A database schema has imposed structure. A CSV export has formatted numbers and dates in specific ways. Column names have been chosen (sometimes poorly).
Your job includes learning to see through these layers, understand what transformations have already happened, and know what quality issues might remain.
Types of Data: Structured, Unstructured, and Semi-Structured
Data comes in three primary forms, and your analytical approach depends on which type you’re working with.
Structured Data
Structured data is organized into predefined categories, formats, and relationships. Think of a table or spreadsheet: rows are observations, columns are variables, every cell follows a clear format.
Nigerian examples: - Bank account records: account number (column 1), customer name (column 2), balance (column 3), account type (column 4) - NBS Consumer Price Index: date, product category, price index value - Stock exchange data: ticker symbol, closing price, trading volume, timestamp - University enrollment: student ID, programme, admission year, GPA - Mobile money transactions: sender phone number, amount transferred, timestamp, receiver
Structured data is the easiest to analyze. It fits neatly into tables, databases, and statistical software. Most of the analysis you’ll do in your first career will use structured data.
Unstructured Data
Unstructured data has no predefined format or organization. Text, images, audio, video—these are unstructured. The data exists, but it’s not organized into rows and columns.
Nigerian examples: - WhatsApp messages between a business and its customers - Customer service complaints posted on Twitter or written in emails - Medical records and doctor’s notes in a hospital system (often written as free text) - Photographs of damaged goods submitted in insurance claims - Audio recordings of customer service calls at a call center - News articles published by Nigerian newspapers online
Unstructured data is harder to analyze because it requires additional processing to extract meaning. You can’t just put an email message into a spreadsheet and compute its mean. You need specialized tools: natural language processing to extract meaning from text, computer vision to understand images, speech recognition for audio.
However, unstructured data often contains rich information. A customer’s email complaint contains their frustration, the specific product they’re upset about, and clues about what went wrong. Standard statistical analysis would miss all of this.
Semi-Structured Data
Semi-structured data has some organization but doesn’t fit neatly into tables. The classic example is JSON (JavaScript Object Notation) or XML (Extensible Markup Language).
Nigerian examples: - XML form submissions from government agencies (e.g., tax forms with nested sections) - JSON API responses from financial platforms (e.g., mobile money providers returning user transaction history) - HTML pages from e-commerce sites (product data mixed with presentation formatting) - Email messages (structured headers like “From:” and “Date:”, but unstructured body text) - Log files from web servers (timestamps and codes are structured, but error messages are free text)
Semi-structured data requires specialized parsing tools—libraries that understand JSON or XML—but less heavy processing than fully unstructured data.
Section Review: Data Types
Scales of Measurement: Why Numbers Aren’t All the Same
Here’s a truth that seems obvious but whose implications are profound: not all numbers are the same. The number 5 means very different things depending on context.
Scales of measurement are categories that describe what numbers represent. The scale you’re working with determines which statistical operations make sense.
There are four scales: nominal, ordinal, interval, and ratio. Let’s explore each, with Nigerian business examples.
Nominal Scale: Categories Without Order
Nominal data describes categories with no inherent order or ranking. The categories are mutually exclusive—something is either in one category or another, not in between.
What you can do: Count, find mode (most common), test for association between categories.
What you cannot do: Compute a mean, median, or standard deviation. Ranking doesn’t make sense.
Nigerian examples:
Bank account type: Savings, Checking, Business, Student. A savings account is not “greater than” or “less than” a checking account—they’re just different. If you were to code these as numbers (1 = Savings, 2 = Checking, 3 = Business, 4 = Student), computing the mean (2.5) would be meaningless.
Mobile network: MTN, Airtel, Glo, 9mobile. These are categories. An MTN customer is not higher or lower than an Airtel customer. If a telecom analyst wanted to know the most popular network, they’d report the mode (most frequent category), not the average.
Product category in a retail store: Groceries, Electronics, Clothing, Household. These are distinct types, nothing more.
Gender: As typically recorded (Male, Female). Two categories, neither ordered.
State in Nigeria: Lagos, Kano, Rivers, Kaduna, etc. These are categories; Kano is not “greater than” Lagos.
Ordinal Scale: Categories With Order, But Unequal Gaps
Ordinal data describes categories with a natural order, but the distances between categories are not equal or meaningful.
What you can do: Count, find mode, find median, rank, compute correlations.
What you cannot do: Assume equal intervals between ranks. Computing a mean might be misleading.
Nigerian examples:
Net Promoter Score (NPS): A question like “How likely are you to recommend this service to a friend?” answered on a scale of 0-10. You know that 8 > 5 > 2, but is the gap from 2 to 5 the same as the gap from 5 to 8? Probably not. Someone rating 2 versus 5 is a big difference in sentiment; someone rating 5 versus 8 is a smaller difference. The gaps are unequal.
Educational attainment: Primary, Secondary, Tertiary, Postgraduate. There’s an order (Postgraduate > Tertiary > Secondary > Primary), but the “distance” between Primary and Secondary (typically 6 years) is different from Secondary to Tertiary (typically 4 years).
Customer satisfaction survey: Very Dissatisfied, Dissatisfied, Neutral, Satisfied, Very Satisfied. There’s an order, but is the gap from Dissatisfied to Neutral the same as from Neutral to Satisfied? It depends on how people interpret the scale.
Employee performance rating: Below Expectations, Meets Expectations, Exceeds Expectations, Far Exceeds Expectations. Again, there’s an order, but equal intervals between ratings are assumed, not proven.
Income bracket for survey respondents: ₦0–₦50,000/month, ₦50,000–₦100,000/month, ₦100,000–₦250,000/month, ₦250,000+/month. While these have ordering, the intervals differ greatly (the first two are ₦50k wide; the last two are ₦150k and open-ended).
With ordinal data, you can sensibly compute a median (the middle value) but not a mean (which assumes equal intervals). You can say “50% of customers rate us 7 or higher” but not “the average rating is 6.2” without being careful about interpretation.
Interval Scale: Ordered With Equal Intervals, But No True Zero
Interval data has ordered categories with equal intervals between values. Critically, there is no true zero—zero doesn’t mean “the absence of the quantity.”
What you can do: Everything from ordinal scale, plus compute mean, standard deviation, and correlation. Use most common statistical tests.
What you cannot do: Divide or multiply meaningfully. “30°C is twice as hot as 15°C” is false because Celsius has no true zero.
Nigerian examples:
Temperature in Lagos: A weather station records 28°C one day and 14°C another. The difference is 14 degrees, and differences are meaningful. But 28°C is not twice as hot as 14°C (that only makes sense on the Kelvin scale, which has a true zero). You can compute the average temperature over a week: (28 + 26 + 24 + 22 + 25 + 27 + 26) / 7 = 25.4°C. This is meaningful.
Year (as a number): The year 2024 and the year 2000 are 24 years apart. The difference is meaningful. But the year 2024 is not 1.12 times the year 2000. There’s no such thing as “year zero” in this calendar system (year 1 came after year 0, or rather, year 1 BC preceded year 1 AD).
Test scores (with a defined scale): If a standardized test is designed such that 0 means “no correct answers” but is still not interpreted as “the absence of knowledge,” it’s interval. Most school tests work this way.
Time of day: 14:00 (2 PM) and 10:00 (10 AM) are 4 hours apart. You can average times: the average of 10:00 and 14:00 is 12:00. But 14:00 is not “1.4 times” 10:00.
The Nigerian weather application reports temperatures—interval data. Averaging them makes sense. Nigerian exam boards publish test scores on a 0-100 scale (interval)—you can meaningfully discuss average performance.
Ratio Scale: The Gold Standard
Ratio data has all the properties of interval data plus a true, meaningful zero. Zero means the absence of the quantity.
What you can do: Everything. All statistical operations, multiplication, and division are meaningful.
Nigerian examples:
Annual revenue: A small business earns ₦5 million; a larger one earns ₦10 million. The larger business earns twice as much (10 / 5 = 2). Zero revenue means no money came in—that’s a real, meaningful state.
Number of employees: A startup has 5 employees; a mature company has 15. The mature company has three times as many (15 / 5 = 3). Zero employees means no one works there.
Transaction amount in Naira: A customer transfers ₦50,000; another transfers ₦100,000. The second transfer is twice as large. ₦0 means no money moved.
Age in years: A person is 25 years old; another is 50. The second person is twice as old. Age zero (birth) is a real event.
Loan default rate: Of 1,000 loans, 50 defaulted. The default rate is 5%. A default rate of 0% means no loans defaulted—that’s a real state.
With ratio data, you can meaningfully compute ratios (hence the name). You can say “our 2024 revenue is 1.3 times our 2023 revenue” and that statement carries real weight.
Why Scales of Measurement Matter: The Statistical Consequence
The scale of measurement determines which analyses are valid. Using the wrong analysis because you misunderstood your data’s scale leads to nonsense results.
Example: A Nigerian bank classifies customer account types as Savings (coded 1), Checking (coded 2), Business (coded 3), and Student (coded 4). These are nominal data. A careless analyst might compute the average account type as (1 + 2 + 3 + 4) / 4 = 2.5 and conclude that the “average account type is 2.5.” This is meaningless gibberish. You cannot average nominal categories.
The correct analysis would be to report the distribution: “40% of our customers have Savings accounts, 35% have Checking, 20% have Business, and 5% have Student accounts.”
Section Review: Scales of Measurement
Big Data Characteristics: The Five Vs
In the modern era, “big data” is a term you’ll hear constantly. But what makes data “big”? Size alone (number of gigabytes) is only one dimension. Academics and practitioners have identified five key characteristics—the five Vs of big data.
Volume: Sheer Quantity
Volume refers to the amount of data. We’re talking terabytes, petabytes—data at a scale that traditional databases struggle with.
Nigerian context: The NIBSS (Nigeria Inter-Bank Settlement System) processes millions of electronic payment transactions daily. A single large bank in Nigeria might process 10 million+ transactions daily, each transaction a data record with timestamp, amount, sender, receiver, type. Over a year, that’s 3.6 billion transaction records—genuine big data in volume.
Mobile operators like MTN Nigeria and Airtel Africa each track billions of call detail records monthly—when every customer made a call, to whom, for how long. Combined, this is volume at an astonishing scale.
Velocity: Speed of Generation
Velocity refers to how fast data is generated and how fast it must be processed. Some data is static (census data collected once every 10 years). Other data flows continuously.
Nigerian context: Stock market data from the Nigerian Exchange streams in real-time during trading hours. Cryptocurrency exchanges operating in Nigeria (though in a regulatory gray zone) generate tick-by-tick price updates. Payment platforms like Flutterwave and Remita process transactions second-by-second.
For a bank’s fraud detection system, processing velocity matters: a fraudulent transaction detected in milliseconds can be blocked; detected in hours, it’s already done damage. Velocity demands real-time or near-real-time processing.
Variety: Different Types and Sources
Variety refers to heterogeneous data types and sources. Not all data is numbers in a table. You’ve got structured data (databases), unstructured data (text, images), semi-structured data (JSON APIs), and more.
Nigerian context: A fintech company in Lagos might combine: - Structured: customer account records from their database - Semi-structured: JSON responses from payment gateway APIs - Unstructured: customer feedback texts from WhatsApp, email complaints - Images: photos of identity documents for KYC (Know Your Customer) - Geographic: GPS coordinates of ATM locations - Behavioral: clickstream data from their mobile app
Handling this variety requires tools and expertise beyond traditional SQL databases.
Veracity: Quality and Trustworthiness
Veracity refers to data quality—accuracy, consistency, completeness. Much real-world data is messy: missing values, typos, duplicates, contradictions.
Nigerian context: In a survey across rural Nigeria, phone numbers might be recorded inconsistently (with or without country code, spaces in different places). In a government database integrating data from multiple agencies, the same person might appear under different IDs if their name was spelled differently. In IoT sensor data from weather stations, a malfunctioning sensor might report impossible values (like -500°C).
Veracity is why data cleaning (discussed in later chapters) is often 80% of an analyst’s time.
Value: What It’s Worth
Value refers to the utility—the potential for the data to drive insight and decision. Terabytes of data are worthless if you can’t extract value from them.
Nigerian context: A bank’s transaction data has high value: it directly informs fraud detection, credit risk assessment, and customer behavior understanding. A photo archive of employee IDs has low value for analytical purposes. Data exhaust (incidental byproducts of systems) like logs of how long customers spend on each page of a website might have value for UX improvement but needs processing to extract it.
Value depends on how the data connects to business questions.
The Analytics Value Chain: From Data to Value
Let’s return to the idea we introduced at the start of this chapter: the analytics value chain. Now that you understand data types, scales, and characteristics, we can explore this chain in depth.
The chain is: Data → Information → Insight → Decision → Value
Each stage builds on the previous. Let’s walk through a concrete example from a Nigerian bank.
Stage 1: Data
A bank’s loan department has given you a dataset of 10,000 loans issued over the past two years. Each loan record contains: - Loan ID - Customer ID - Loan amount (Naira) - Interest rate (%) - Loan term (months) - Customer age - Customer job title - Loan purpose (personal, home, auto) - Default status (Yes/No) - Months since issue
This is raw data: unprocessed, unorganized. A 10,000-row spreadsheet tells you nothing by itself.
Stage 2: Information
You process this data:
- Cleaning: Remove duplicate records, handle missing values, correct typos in job titles
- Organizing: Sort by issue date, categorize customers by age groups
- Aggregating: Compute totals, averages, counts
Now you have information: structured facts with context.
Examples of information derived from the raw loan data: - “Total loans issued: 10,000; total amount: ₦15 billion” - “Average loan amount: ₦1.5 million; median: ₦1.2 million” - “Default rate: 8% (800 loans of 10,000)” - “By loan purpose: Personal (45%), Home (35%), Auto (20%)” - “Default rates by age group: Under 30 (12%), 30-40 (7%), Over 40 (5%)”
Stage 3: Insight
Now you interpret the information. You look for patterns, ask why, and draw conclusions. This is where analytics truly begins.
From the processed information above, you might notice: - “Younger customers (under 30) default at 2.4 times the rate of customers over 40” - “Personal loans have a 10% default rate, but auto loans have only 5%” - “Default rate has increased from 6% in the first year to 10% in the second year”
These are insights: they suggest causes and implications. Younger customers might have less stable income. Auto loans might be secured (the bank can repossess the car). The increasing default rate might reflect economic downturn.
Stage 4: Decision
Insight informs decision. A bank executive asks: “Given these insights, what should we do?”
Possible decisions: - Tighten lending standards for customers under 30: Require higher income verification, lower loan-to-value ratios, or higher interest rates to compensate for risk - Reduce personal loan limits: Personal loans are riskier; cap them at lower amounts - Adjust interest rates: Increase rates for higher-risk customer segments to compensate for expected defaults - Enhance collections processes: Implement earlier intervention when customers miss first payment
Stage 5: Value
Now the bank executes these decisions, and value is realized: - Risk reduction: By making smarter lending decisions, the bank reduces future losses to defaults - Revenue optimization: By charging higher rates for higher-risk loans, the bank recovers more of the risk premium - Efficiency: By intervening earlier in delinquencies, collections costs decrease - Reputation: By lending more carefully, the bank avoids negative publicity from aggressive collections
Value is measurable: reduced loss rates, higher net interest margin, lower collection costs, or improved customer sentiment.
Section Review: Analytics Value Chain
Key Performance Indicators vs. Key Predictive Indicators
Businesses are obsessed with metrics—numbers that measure performance. But not all metrics are created equal. Two important categories are often confused: Key Performance Indicators (KPIs) and Key Predictive Indicators (KPIs). (Confusingly, both are abbreviated KPI; context determines which is meant.)
Key Performance Indicators: What’s Happening?
A Key Performance Indicator (KPI) is a metric that measures current performance against a strategic objective. KPIs answer: “How are we doing right now?”
KPIs are lagging indicators—they measure results that have already happened. By the time you see a KPI, the underlying events are in the past.
Nigerian examples:
Customer acquisition cost (CAC): How much, on average, does it cost to bring in one new customer? A fintech company tracks: “Our CAC is ₦2,500 per customer.” This is a KPI—it measures a realized cost that has already occurred. It’s useful for budgeting and evaluating marketing efficiency, but it doesn’t predict future success.
Monthly revenue: A SaaS company in Lagos targets ₦100 million in monthly revenue. They track actual revenue each month. This is a KPI. By the time they know their revenue, the month is over.
Customer satisfaction (via NPS): A bank surveys customers: “Would you recommend us to a friend?” Net Promoter Score is a KPI. It measures past customer satisfaction, not future behavior.
Employee turnover rate: HR department tracks: “We lost 10% of our workforce this year.” This is a lagging indicator—employees have already left.
Return on Marketing Investment (ROMI): “For every ₦1 spent on ads, we generated ₦4 in revenue.” A realized metric, measured after the fact.
KPIs are essential for accountability and strategic tracking, but they’re not predictive. A high NPS doesn’t guarantee customers won’t churn next month. Low employee turnover doesn’t predict future retention.
Key Predictive Indicators: What Will Happen?
A Key Predictive Indicator (KPI or sometimes leading indicator) is a metric that forecasts future performance. These are forward-looking.
Nigerian examples:
Employee engagement scores: An employee’s score on an engagement survey—asking about job satisfaction, opportunities for growth, relationship with manager—predicts whether they’ll leave. A low engagement score today predicts likely turnover within 6 months.
Sales pipeline value: A sales team tracks deals in their pipeline by stage (prospect, proposal, negotiation, close). The value of deals in “negotiation” stage predicts revenue in the coming months. As a KPI, it’s predictive: if the pipeline drops, revenue will likely drop later.
Website traffic and conversion funnel: A trend in website traffic (increasing or decreasing) predicts future e-commerce sales. Rising traffic suggests growing future orders; falling traffic predicts declining revenue unless reversed.
Customer product usage: For a SaaS company, early usage of a product feature by new customers predicts whether they’ll retain and expand. High usage = likely to stay; low usage = likely to churn.
Loan application approval rate: If a bank’s approval rate drops sharply, it signals tightened credit conditions, which predicts lower loan volume and revenue in coming months.
Customer support response time and resolution rate: Poor support metrics predict future churn. Customers with unresolved issues leave.
The Strategic Implication
Here’s why this distinction matters: KPIs tell you what happened. Predictive indicators tell you what will happen if you don’t act.
An excellent manager monitors both: - KPIs to know the current state and hold teams accountable - Predictive indicators to spot trends early and intervene before problems become crises
A bank that tracks lagging customer satisfaction KPIs might discover they’ve lost market share—too late to prevent it. A bank that tracks predictive indicators (like early warning signs in customer behavior) can intervene before satisfaction drops.
Section Review: KPIs and Prediction
Data Ethics and Governance in the African Context
As you become skilled with data, you gain power. Power without ethical grounding becomes dangerous. The final section of this chapter addresses the human and societal dimensions of data work.
Privacy and Consent
When you collect data about people, you’re collecting information about their lives, choices, and identities. Ethical data work respects privacy: people should know their data is being collected and should be able to opt out.
Nigerian context: The Nigeria Data Protection Regulation (NDPR), which came into effect in 2019, establishes rules for data collection, storage, and processing. Key principles:
- Consent: Organizations must get explicit consent before collecting personal data
- Purpose limitation: Data collected for one purpose cannot be used for another without consent
- Data minimization: Collect only what you need
- Accuracy: Keep data accurate and up-to-date
- Security: Protect data from unauthorized access
- Right to access: People can ask to see what data an organization has about them
- Right to be forgotten: People can request deletion of their data
As an analyst, you must: - Understand what data is personal and sensitive (health, financial, biometric, etc.) - Never use data beyond its stated purpose - Anonymize data when analyzing (remove names, IDs, identifying information) - Follow your organization’s data governance policies
Bias and Fairness
Data reflects the world—including the world’s injustices. If your training data contains historical bias, your models will perpetuate it.
Nigerian example: A loan approval algorithm trained on historical lending data might learn that women receive loans less often than men (a historical bias reflecting discrimination). If an algorithm is built on this data without correction, it might deny loans to women at higher rates. The algorithm isn’t intentionally sexist, but it reflects and amplifies historical bias.
Similar issues arise around: - Geographic bias: Data from urban centers (Lagos, Abuja) is often richer than from rural areas, leading to models that work poorly in underrepresented regions - Language bias: Most NLP (natural language processing) models are trained on English data and perform poorly on Nigerian Pidgin or local languages - Age bias: Algorithms trained on working-age people might not generalize to elderly customers - Income bias: Data from wealthy segments is often overrepresented; models may not work for low-income populations
Your responsibility as an analyst is to: - Audit data for known biases - Test models across subgroups to ensure fairness - Document limitations explicitly - Recommend collecting more representative data - Challenge assumptions that perpetuate bias
Data Sovereignty
In the context of African nations, data sovereignty is increasingly important. Data sovereignty means a nation’s right to govern the collection, storage, and use of data within its borders.
Why it matters: - Foreign companies have sometimes extracted African data (user information, behavioral data, location data) and used it without clear local benefit - African data infrastructure is sometimes weak, making local populations dependent on foreign cloud services - Regulatory control over data supports accountability
Nigerian regulators and fellow African governments are asserting sovereignty: requiring that sensitive data be stored within the country, ensuring Africans have a say in how data about Africans is used.
As an analyst working in Nigeria or Africa, be aware of: - Where data is stored (locally vs. international servers) - Regulatory requirements for data residency - Your organization’s commitments to local data benefit
Transparency and Explainability
When you build a model or algorithm that affects people’s lives—approving loans, recommending hire/fire decisions, flagging fraud—people deserve to understand how it works.
Nigerian example: If a bank uses a machine learning model to approve or deny loans, customers who are denied deserve to know why. A “black box” that says “denied” is unethical and likely violates NDPR. You should be able to explain which factors the model considered and why it reached its decision.
As you progress to building models (later chapters), remember: interpretability matters. Simpler models that humans can understand are often preferable to complex black-box models, even if the black-box model is slightly more accurate.
Section Review: Data Ethics and Governance
Chapter Summary: Key Takeaways
Data is everywhere, but context is everything. Raw data becomes information only when processed and organized. Information becomes insight only when interpreted. Insight becomes value only when acted upon.
Types of data determine methods. Structured data fits in tables; unstructured data (text, images) requires specialized tools; semi-structured data (JSON, XML) sits between.
Scales of measurement are not just academic. Knowing whether your data is nominal, ordinal, interval, or ratio determines which statistical operations are valid. Treating nominal data as numeric leads to absurd conclusions.
Big data is about more than size. Volume, velocity, variety, veracity, and value—the five Vs—characterize modern datasets. You must manage all five.
The analytics value chain has five stages: data, information, insight, decision, and value. Your job is moving organizations up this chain.
Lead and lag. Key Performance Indicators measure what has happened. Key Predictive Indicators forecast what will happen. Excellent organizations track both.
Data has ethical weight. As you gain skill with data, commit to privacy (NDPR), fairness (reducing bias), sovereignty (local benefit), and transparency (explainability). These aren’t optional—they’re central to legitimate, sustainable data work.
Exercises
Chapter 2 Exercises
1. Defining Terms (Recall) For each of the following, state whether it is data, information, insight, or decision: a) A spreadsheet of 50,000 customer transactions with no summary or analysis b) “Our average customer spent ₦45,000 this year, up from ₦38,000 last year” c) “Given that spending is up 18% but customer acquisition cost is also up 15%, our real revenue growth is only 3%, and this unsustainable acquisition strategy must change” d) “We will reduce customer acquisition spending in digital channels by 30% and shift funds to referral incentives”
2. Data Types in Practice (Comprehension) A Nigerian healthcare startup collects the following information for each patient: - Date of birth (Ratio) - Medical history (Unstructured) - Blood type (Nominal) - Severity of current condition: mild, moderate, severe (Ordinal) - Blood pressure reading in mmHg (Interval) - Number of medications currently taking (Ratio) - Preferred hospital location (from a dropdown list) (Nominal)
Classify each as structured, unstructured, or semi-structured. Then identify its scale of measurement.
3. Scales of Measurement Matter (Comprehension) An analyst at a retail company codes customer segments as: Regular (1), Occasional (2), New (3). They then compute the “average segment” as 2.0 and declare that “the typical customer is Occasional.”
Explain why this analysis is wrong. What scale is customer segment? What would be the correct analysis?
4. Five Vs in Your Life (Application) Pick a real-world data source you encounter (social media, banking app, e-commerce site, etc.). Analyze it through the lens of the five Vs. Which Vs apply, and which don’t? How does high volume and velocity affect how that organization must manage data?
5. Analytics Value Chain (Application) You’re an analyst for a Nigerian fast-food chain with 50 locations. You notice that locations in low-income neighborhoods have higher food waste as a percentage of sales than locations in high-income neighborhoods. Walk this through the analytics value chain: a) What is the raw data here? b) What is the information (processed facts)? c) What insight might you draw? d) What decisions might this insight support? e) What value might be realized?
6. KPIs and Predictive Indicators (Analysis) For a ride-hailing platform like Uber or Bolt operating in Lagos, identify: a) Two Key Performance Indicators (lagging metrics that measure current performance) b) Two Key Predictive Indicators (leading metrics that forecast future performance) c) For each predictive indicator, explain why it predicts future performance
7. Bias in Data (Analysis) A bank built a credit risk model on 5 years of historical lending data. The model performs well on the overall test set (85% accuracy) but performs worse on female applicants (78% accuracy). How might this have happened? What does “worse” mean for real applicants? What should the bank do?
8. Privacy and Ethics (Comprehension) Under Nigeria’s NDPR, explain what an organization must do before: a) Collecting location data from users’ mobile phones b) Using customer email addresses to send marketing messages c) Analyzing customer purchase history to understand buying patterns
9. Sovereignty and Scale (Synthesis) A multinational SaaS company with customers across Africa wants to consolidate all data in a single cloud region (US-based) for cost efficiency. A Nigerian regulator argues this violates data sovereignty principles. a) What is the regulator’s concern? b) What are the company’s practical constraints? c) How might a compromise be structured?
10. Ethics and Profit (Analysis) A credit scoring algorithm used by Nigerian banks improves profitability by 5% but is found to systematically approve loans at lower rates for applicants from certain ethnic regions (based on names). The differences are statistically small and not the algorithm’s explicit purpose—just a learned pattern in the data. a) Is this a technical problem or an ethical problem? b) Should the bank use the algorithm? c) What should the bank disclose to regulators and customers?
Further Reading
- Wickham, H., & Grolemund, G. (2017). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Chapter 1 covers the data science process. https://r4ds.had.co.nz/
- Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley. The foundational text on understanding data through visualization and simple statistics.
- Royston, P., et al. (2009). “Dichotomizing continuous predictors in multiple regression: a bad idea.” Journal of Clinical Epidemiology, 62(10), 1091-1100. A technical paper on why treating scales incorrectly is harmful.
- Nigeria Data Protection Regulation (NDPR) Official Text. https://ndpr.nitda.gov.ng/. The actual regulation you’ll work under in Nigeria.
- O’Neill, C. (2016). Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown Publishers. Excellent on how data systems can perpetuate bias and harm.
Chapter Appendix: Formal Definitions and Measurement Theory
Appendix 2.A: Stevens’ Scale of Measurement
The classification of nominal, ordinal, interval, and ratio scales comes from S. S. Stevens’ 1946 paper “On the Theory of Scales of Measurement.” Stevens defined measurement as “the assignment of numerals to objects or events according to rule.”
Formally:
Nominal scale: Numbers are assigned purely as labels with no order. Only the identity relation holds: A = A, and A ≠ B (if A ≠ B). Allowed statistics: mode, frequency, association measures.
Ordinal scale: Numbers convey order. The order relations hold: A > B > C. But the magnitude of differences is unknown. Allowed statistics: median, percentiles, monotonic correlation.
Interval scale: Numbers convey order and equal intervals. If the difference A – B equals the difference C – D, then the intervals are equal. Zero is arbitrary (not the absence of the quantity). Allowed statistics: mean, standard deviation, Pearson correlation, t-tests.
Ratio scale: Numbers convey order, equal intervals, and a meaningful zero. If A is twice B, that’s a meaningful statement. All operations allowed.
Appendix 2.B: The Problem of Treating Ordinal as Interval
A common mistake in analysis is treating ordinal data (like survey scales 1-5) as if they were interval. This leads to incorrect conclusions.
Why it’s problematic: The mean of ordinal data assumes equal intervals. But ordinal scales don’t guarantee this.
If you ask 100 people “How satisfied are you?” on a 1-5 scale and 80 people answer “1” while 20 answer “5”, the mean is (80×1 + 20×5)/100 = 1.8. But this doesn’t mean the “average satisfaction is 1.8.” It means most people are very dissatisfied, with a small group very satisfied. The median (1) better represents the central tendency.
However, in practice, many analysts do treat ordinal data as interval (especially Likert scales), accepting a small inaccuracy in exchange for analytical power. This is pragmatic but should be done with awareness of the limitation.
Appendix 2.C: Big Data and the Unreasonable Effectiveness of Data
There’s a phenomenon in machine learning where with enough data, even very simple models work surprisingly well. This is captured in the paper “The Unreasonable Effectiveness of Data” (Halevy, Norvig, Pereira, 2009).
The insight: with sufficient volume and variety, you don’t need perfect models. A simple statistical model on a billion observations might outperform a sophisticated model on a million observations.
This has implications for Nigerian and African data science: - Investment in data collection infrastructure is often more valuable than sophisticated algorithms - Big data allows less bias (because patterns emerge despite imperfect models) - Privacy and governance concerns become more acute with larger data
Appendix 2.D: The DIKW Hierarchy and Beyond
The progression Data → Information → Insight → Knowledge → Wisdom (sometimes called DIKW) is a classical framework in knowledge management.
- Data: Raw facts
- Information: Data with context
- Knowledge: Actionable understanding
- Wisdom: Judgment about when and how to apply knowledge
As an analyst, you typically work in the Data-Information-Insight realm, but you should recognize that decision-makers need wisdom: judgment about whether the insight applies in their context, considering factors (relationships, trust, organizational dynamics) that aren’t in the data.