Understanding Data Types
In this module, we’re continuing our data science journey by exploring data types. In our previous session, we learned about tidy data organization. Now we’ll focus on the different kinds of information that fill those neatly organized tables. Understanding data types is essential for working with data effectively, regardless of which programming language you eventually use.
What Are Data Types?
Every piece of information in your dataset has a specific type - just like in the physical world where we have different types of objects like books, cups, and phones. In data science, we categorize information into specific “types” that tell us:
- What kind of information we’re working with
- What operations we can perform on that information
- How to store and process the information efficiently
Think of data types as containers designed specifically for different kinds of information.
Understanding Common Data Types
When you begin working with data, you’ll quickly notice it comes in different flavors. The first step in any analysis is understanding what kind of data you have. Broadly, data falls into two main families, numerical and categorical.
Numerical Data
Numerical data represents measurable quantities—anything you can perform mathematical operations on, like addition or multiplication.
- Integers: These are your classic whole numbers, both positive and negative, without any decimal points. They’re perfect for counting things, like the number of items in a shopping cart, assigning user IDs, or listing someone’s age.
- Examples:
1
,42
,-7
,0
- Examples:
- Floating-Point Numbers (Floats): These are numbers that include a decimal point, giving you more precision. You’ll use floats for measurements, percentages, or any value where partial numbers are important.
- Examples:
3.14
,-0.5
,98.6
- Examples:
Categorical Data
Categorical data represents qualities or labels used to group information. You can’t perform math on these, but they are essential for classifying and organizing your data.
- Strings: A string is simply text. It can be a single character, a word, or an entire sentence. Strings are used for names, addresses, product descriptions, and any other text-based information.
- Examples:
"Hello"
,"Jane Smith"
,"42 Main Street"
- Heads up! Even if a string contains only digits (like a zip code “90210”), it’s treated as text, not a number you can add or subtract.
- Examples:
- Booleans: Booleans are simple yet powerful, with only two possible values: True or False. Think of them as the answer to any yes/no question, such as “Is the user subscribed?” or “Is the item in stock?”
- Examples:
True
,False
- Examples:
- Factors: These represent values drawn from a limited set of categories. This type is especially important and comes in two forms:
- Nominal (Unordered): The categories have no intrinsic order or rank. For example, the colors “Red,” “Blue,” and “Green” are distinct categories, but none is inherently greater than another.
- Examples: Blood types (A, B, AB, O), Genres (Fiction, Non-fiction).
- Ordinal (Ordered): These categories have a meaningful sequence. For instance, education levels have a clear hierarchy, and a survey response of “Excellent” is definitively better than “Poor.” This order allows for comparisons like “greater than” or “less than.”
- Examples: Education (High School, Bachelor’s, Master’s), Ratings (Poor, Fair, Good, Excellent).
- Nominal (Unordered): The categories have no intrinsic order or rank. For example, the colors “Red,” “Blue,” and “Green” are distinct categories, but none is inherently greater than another.
A Special Case: Missing Data
Sometimes, a piece of information is simply not there. We represent this absence with a special value, often called NULL or NA. This isn’t the same as zero or an empty string; it’s a placeholder that specifically means the value is unknown or was never recorded. It can appear in any data type.
Mishandling this concept can have unexpected real-world consequences, as one security researcher discovered after getting a vanity license plate that read ‘NULL’.
Example: A Library Book in a Dataset
To see these data types in action, let’s imagine the data recorded for a single book in a library’s catalog. Each piece of information about the book would be a different data type.
For the book “To Kill a Mockingbird”:
- Author: “Harper Lee” is a string, as it’s text.
- Copies in Stock: 3 is an integer, since you can’t have a fraction of a book.
- User Rating: 4.7 (out of 5) is a float, as it requires a decimal for precision.
- Genre: “Fiction” is a nominal categorical value, classifying the book without ranking it.
- Condition: “Good” would be an ordinal categorical value, as it comes from an ordered set (like Poor, Good, Excellent).
- Is Currently Available?: True is a boolean value, giving a clear yes/no answer.
- Subtitle: NULL. A book like “To Kill a Mockingbird” doesn’t have a subtitle. For any book that only has a main title, this field would be NULL because it simply doesn’t apply.
Why Data Types Matter
Understanding data types is crucial because:
- Appropriate Operations: Different types support different operations.
- You can average ages (numbers) but not names (strings).
- You can alphabetize names but not add them together.
- Storage Efficiency: Proper typing helps store data compactly.
- Storing the number 5 as an integer uses less memory than storing it as text “5”.
- Analysis Accuracy: Mistaken types lead to incorrect analysis.
- If ages are accidentally stored as text, calculating the average age won’t work.
- Data Validation: Types help catch errors.
- If an age field contains “apple,” we know there’s an error because “apple” isn’t a valid integer representing someone’s age in number of years.
Some common type confusions and pitfalls to be mindful of include:
- Numbers stored as text:
- Problem: “42” (string) vs. 42 (integer). Note how the string is in quotes while the integer is not.
- Impact: Can’t perform calculations like adding, subtracting, multiplying, dividing, or finding averages
- Problem: “42” (string) vs. 42 (integer). Note how the string is in quotes while the integer is not.
- Missing values:
- Problem: Empty cells, “N/A”, “NULL”, or other placeholders
- Impact: Can skew calculations or analysis if not handled properly
Topic Recap
Congratulations! You’ve completed the language-agnostic data science foundations modules. You now understand what data science is, how to organize data in a tidy format, and the fundamental data types you’ll work with.