Data Cleaning and Data Wrangling

Introduction

Taming the Beast: How to Wrangle Your Data

Once your data is clean, the next challenge is structuring it for analysis. Raw data isn’t always in the right shape—it may need sorting, filtering, or merging with other datasets to make sense. This process, called data wrangling, transforms scattered information into a structured format ready for meaningful insights.

This section covers essential data wrangling techniques, including sorting, filtering, and combining datasets.

Sorting Data

Chaos to Clarity: The Sorting Shift

Sorting helps arrange data logically, making it easier to analyze trends. You might want to:

  • Sort by date to identify patterns over time.
  • Sort by amounts to find the highest or lowest values.
  • Sort by categories for structured reporting.

Ways to Sort Data

  • Excel & Google Sheets: Use the Sort & Filter tool.
  • Python: Sort a dataset by a column:
    df.sort_values("column_name", ascending=True, inplace=True)
  • R: Sort data using dplyr:
    df <- arrange(df, column_name)
  • A detailed guide with multiple sorting criteria is in the GitHub link: Python, R.

Filtering Data

Focus Lens: Zeroing in on What Count

Filtering allows you to extract only the relevant information from a dataset. This is useful when:

  • Working with large datasets and needing to analyze specific subsets.
  • Identifying outliers or values within a particular range.
  • Removing irrelevant records before further processing.

How to Filter Data

  • Excel: Use the Filter option to display specific rows.
  • Python: Extract rows where a condition is met:
    df_filtered = df[df["column_name"] > 50]
  • R: Subset data based on a condition:
    df_filtered <- df[df$column_name > 50, ]
  • More complex filtering examples are in the GitHub link: Python, R.

Merging Datasets

Data Synergy: Uniting the Pieces for the Full Picture

Often, the data you need is spread across multiple files. Merging (or joining) datasets allows you to combine information from different sources.

Common Use Cases for Merging

  • Combining survey results with participant details.
  • Merging sales data with customer records.
  • Integrating multiple years of data into a single dataset.

How to Merge Datasets

  • Excel: Use VLOOKUP or XLOOKUP to pull values from another table.
  • Python: Merge two datasets on a common column: 
    df_combined = df1.merge(df2, on="common_column", how="inner")
  • R: Perform a dataset join using merge():
    df_combined <- merge(df1, df2, by="common_column")
  • Additional join types (left, right, full) are detailed in the GitHub link: Python, R.

Data Transformation

Shape Shift: Turning Raw Data into Insight

Data transformation involves converting data from its raw form into a format that’s more useful for analysis. This process helps you clean, structure, and refine your data to fit specific business goals or analytical needs.

Common Use Cases for Data Transformation

  • Normalizing data: Converting all values to a common scale (e.g., standardizing measurements).
  • Pivoting data: Reorganizing data into a more readable or useful structure, like turning rows into columns.
  • Aggregating data: Summarizing detailed data, such as calculating the average sales per month.

How to Transform Data

  • Excel: Use Power Query to reshape your data, or use formulas like TEXT, CONCATENATE, and SUMIF.

  • Python: Use pandas to manipulate and transform data, such as:

df_transformed = df.groupby("Category").agg({"Sales": "sum", "Revenue": "mean"})

  • R: Apply functions like mutate(), spread(), or gather() from the tidyverse package.
    df_transformed <- df %>% group_by(Category) %>% summarise(Sales = sum(Sales), Revenue = mean(Revenue))

Transformation Types

  • Normalization: Scaling data to a standard range.

  • Pivoting: Switching rows to columns (or vice versa) for a more insightful structure.
  • Aggregation: Summarizing large datasets into key metrics.

Code Snippets on GitHub Link: Python, R.

Conclusion

Why This Matters

Data wrangling ensures that your dataset is structured, relevant, and ready for analysis. Without proper sorting, filtering, and merging, insights can be misleading or incomplete.

By using these techniques, you can organize, refine, and integrate data efficiently—eliminating clutter and focusing on meaningful information.