Data Cleaning and Data Wrangling

Introduction

Data Looks Fine… Until It’s Not!

Your dataset might look clean at first glance, but hidden issues can cause serious problems during analysis. Missing values, duplicates, and incorrect formats are some of the most common culprits. If left unchecked, they can distort insights, lead to incorrect conclusions, or even break an entire workflow.

This section will help you identify these issues, understand why they matter, and explore ways to detect them efficiently.

Ctrl+F Won’t Save You: Simple Tricks for Cleaning Data

Handling Numerical Data

Missing Values: The Gaps That Can Skew Analysis

Missing data occurs when certain values in a dataset are left blank, either due to collection errors, system limitations, or human mistakes. Ignoring them can lead to biased results, especially in statistical analysis or machine learning models.

How to Spot Missing Data

  • Some datasets explicitly mark missing values as NA, NULL, or -999. Others just leave cells blank.
  • A quick scan using filters in Excel or checking for empty cells in Google Sheets can help.
  • In Python, missing values can be detected using: df.isnull().sum()
  • In R, a simple check looks like: sum(is.na(df))
  • A more detailed implementation is available in the GitHub link: Python, R.

Duplicate Entries: When Data Repeats Itself

Duplicates can occur when datasets are merged, when data is collected multiple times, or when human errors lead to repeated entries. This artificially inflates counts and distorts averages, leading to misleading insights.

How to Detect Duplicates

  • In Excel, using Remove Duplicates in the Data tab can help.
  • Sorting or filtering a dataset can highlight repeated values.
  • In Python, duplicate rows can be counted using: df.duplicated().sum()
  • In R, a similar check is done with:  sum(duplicated(df))
  • A detailed breakdown, including handling partial duplicates, is available in the GitHub link: Python, R.

     

Incorrect Formats: The Silent Troublemakers

Even when data is present, it may not be in the right format. These inconsistencies cause major issues in calculations and visualizations.

Common Formatting Problems

  • Numbers stored as text: A column meant for numerical values may have entries like $1,200 instead of 1200.
  • Inconsistent date formats: Some dates might be stored as MM/DD/YYYY, while others follow YYYY-MM-DD, causing sorting errors.
  • Category inconsistencies: Variations in spelling or capitalization (e.g., "Male" vs. "male") can create separate categories.

How to Detect Format Issues

  • Sorting data often exposes irregularities.
  • Applying data validation rules in Excel helps catch incorrect entries.
  • In Python, checking data types and converting formats can prevent issues: df.dtypes # Check column types
  • In R, a similar approach can be taken: ·  str(df)  # Inspect data structure
  • More details and transformation techniques are available in the GitHub link: Python, R.

Dealing with Outliers: The Unseen Extremes

Outliers are data points significantly different from other observations. They can distort statistical analyses, skew visualizations, and lead to incorrect model predictions. Identifying and handling outliers is crucial to maintain data quality and accuracy.

Key Fixes for Outlier Issues

  • Detection: Use statistical methods to identify unusually high or low values.

  • Treatment: Depending on the context, you can remove, transform, or cap outliers.

  • Transformation: Apply log or square root transformations to reduce the impact.

  • Capping: Limit extreme values within a reasonable range (e.g., 1st and 99th percentiles).

Code Snippets in the GitHub link: Python, R.

Rounding Values: Keeping Precision Consistent

Numerical data may contain excessive decimal places that are not meaningful or practical. Rounding values helps maintain uniformity, reduces noise, and makes reports more interpretable.

Key Fixes for Rounding Issues

  • Rounding to Desired Precision: Limit decimal places to a fixed number.

  • Avoiding Floating-Point Errors: Round during calculations to maintain accuracy.

  • Contextual Rounding: Round percentages to 2 decimal places but financial data to 4.

  • Truncating vs. Rounding: Choose rounding if precision is critical; truncate when only the integer part matters.

Code Snippets in the GitHub link: Python, R.

Handling Text Data

Invisible Gaps: The Silent Text Disturbance

Leading or trailing spaces in text fields can cause hidden errors—two values that look identical may be treated as different. Similarly, inconsistent capitalization can affect groupings in analysis.

Key Fixes for Text Issues

  • Trimming Spaces: Prevents mismatches caused by accidental spaces.
  • Standardizing Case: Ensures uniformity in categories like country names or product labels.
  • Removing Special Characters: Gets rid of unwanted symbols that interfere with analysis.

Using Code to Standardize Text

  • Python: Trim spaces in a column: df["column"] = df["column"].str.strip()
  • R: Convert text to lowercase and trim spaces: df$column <-trimws(tolower(df$column))
  • Additional transformations, such as removing punctuation or converting formats, are detailed in the GitHub link: Python, R.

Encoding Errors: The Hidden Text Breakers

When datasets from various sources merge, inconsistent character encodings can lead to corrupted or unreadable text—often showing up as odd symbols or question marks.

Key Fixes for Encoding Issues

  • Convert to UTF-8: Ensure a common encoding format across your data.

  • Identify Corrupt Characters: Use libraries to detect and fix encoding errors.

  • Replace Invalid Characters: Swap out unreadable symbols with understandable alternatives.

Using Code to Fix Encoding Problems :

Python:df["text"] = df["text"].str.encode("utf-8", errors="replace").str.decode("utf-8")

R: df$text <- iconv(df$text, from="ISO-8859-1", to="UTF-8")

Detailed instructions available on GitHub: Python, R.

Tokenization & Normalization: Clarifying Text One Word at a Time

Text data often needs simplification to analyze effectively. Tokenization breaks text into individual words, while normalization converts those words into a standardized, base form.

Key Fixes for Text Clarity

  • Tokenization: Split sentences into words or phrases for easier analysis.

  • Lemmatization: Convert words to their dictionary form (e.g., "running" → "run").

  • Stop Word Removal: Remove common filler words that add no analytical value.

See the GitHub link for the code snippet: Python, R.

Emojis & Special Characters: Cleaning Up the Textual Noise

Emojis, symbols, and punctuation can introduce unnecessary complexity in text analysis, making it difficult to accurately interpret data.

Key Fixes for Unwanted Characters

  • Emoji Removal: Strip emojis to simplify text.

  • Replacing Emojis with Words: Translate emojis to text equivalents.

  • Special Character Filtering: Eliminate unnecessary punctuation or symbols.

Code Snippets in the GitHub link: Python, R.

Standardizing Text Length: Keeping Your Data Consistent

When working with textual data, differences in text length can cause analytical and computational issues, especially in machine learning applications. Standardizing length helps maintain consistency.

Key Fixes for Text Length Issues

  • Truncation: Cut long texts to a uniform size.

  • Padding: Add extra spaces or characters to shorter texts.

  • Consistent Length for NLP Models: Improve analytical models with uniform input.

Code Snippets in the GitHub link: Python, R.

Mixed Content in Text Fields: Untangling Data Types

Sometimes fields expected to contain only text are mixed with numbers, dates, or other categorical data. This creates analytical confusion and incorrect results.

Key Fixes for Mixed Content Issues

  • Extract Numeric Information: Pull numeric values out of text fields.

  • Convert Data Types: Ensure numerical text is properly stored as numbers.

  • Categorical Encoding: Map categorical text to standardized numerical codes when necessary.

Code Snippets in the GitHub link: Python, R.

Conclusion

By using data cleaning, we’re avoiding:

  • Inaccurate insights that lead us down the wrong path
  • Misleading trends that skew our decisions
  • Flawed predictions that waste resources
  • Wasted time fixing avoidable errors
  • Data confusion that slows us down
  • Unreliable conclusions that hurt our credibility