Data Cleaning and Data Wrangling

Data Dictionary

Data Lingo 101: The Must-Know Terms

1. Imputation

The process of filling missing or incomplete values in a dataset using predefined methods such as mean, median, or mode substitution, or more advanced algorithms.


2. DataFrame

A two-dimensional, tabular data structure with labeled axes (rows and columns), commonly used in libraries like pandas (Python) or dplyr (R).


3. NA (Not Available) / NULL

A placeholder value representing missing or undefined data in a dataset, commonly used in both R and Python.


4. Dropna()

A method used to remove missing or null values from a DataFrame or dataset, available in libraries like pandas (Python).


5. fillna()

A method used in pandas (Python) to replace missing values with a specific value or strategy (e.g., forward fill, backward fill, or mean).


6. Str.strip()

A method used to remove leading and trailing whitespaces from text strings in a dataset, often used to clean up inconsistent spacing issues.


7. Standardization

Converting data into a consistent format, such as ensuring uniform capitalization or applying consistent naming conventions across a dataset.


8. Dplyr

An R package that provides a set of tools for data manipulation, including filtering, selecting, and transforming data, often used for wrangling.


9. Merge()

A function used to combine two datasets based on a common column or key, similar to a SQL join operation. It is available in both R and Python (pandas).


10. Pivot Table

A data summarization tool that is often used in Excel or pandas (Python) to reorganize and aggregate data based on a set of parameters or criteria.


11. Normalization

The process of scaling data values to fit a particular range, often between 0 and 1, to ensure fair comparisons between variables in machine learning models.


12. Grouping

The process of combining data into groups based on shared attributes or columns, which is commonly followed by aggregate functions like sum, mean, or count.


13. One-Hot Encoding

A method of representing categorical variables as binary (0/1) vectors. It converts categorical features into a format that can be used by machine learning models.


14. Join

The process of combining two or more datasets based on a shared column, similar to SQL joins (inner join, outer join, etc.).


15. Column Type Conversion

The process of changing the data type of a column in a dataset, such as converting text to numeric values, or dates to datetime objects.


16. Data Validation

The process of ensuring that the data entered or imported meets predefined rules or constraints, such as acceptable ranges, formats, or types.


17. Trimws()

A function in R used to remove leading and trailing whitespaces from a string in a dataset.


18. Drop_duplicates()

A method in pandas (Python) used to remove duplicate rows from a DataFrame, ensuring that only unique rows remain.


19. Outlier Detection

The process of identifying and handling data points that deviate significantly from other observations, which could skew analysis or predictions.


20. VLOOKUP / XLOOKUP

Excel functions used to look up and retrieve data from another table based on a specific key or matching criterion.


21. Data Transformation

The process of converting data into a format that is more appropriate for analysis or modeling, including actions like normalization, encoding, or aggregation.


22. Data Aggregation

The process of summarizing or combining multiple data points into a single value, often using functions like sum, mean, or count, typically applied to grouped data.


23. Regex (Regular Expressions)

A method for pattern matching within text, often used to clean or extract specific information from strings in a dataset.


24. Cross-Validation

A technique in data analysis and machine learning used to assess the performance of a model by splitting the data into multiple subsets, training on some and testing on others.


25. Indexing

The method of selecting and accessing specific rows or columns in a DataFrame or dataset, often done using labels or position.


26. Outlier

An observation in a dataset that significantly differs from other data points, which can distort statistical analyses and lead to incorrect conclusions.


27. Scaling

The process of standardizing the range of data features in a dataset, often by transforming data so that it falls within a specific range (e.g., 0 to 1).


28. Z-score

A statistical measure that describes a data point's relationship to the mean of a dataset, commonly used in outlier detection.


29. Imbalanced Data

A situation where one class in a dataset is overrepresented compared to others, which can affect the performance of machine learning models.


30. Data Munging

Another term for data wrangling or cleaning, referring to the process of transforming and mapping raw data into a more useful format.


31. Data Transformation

The process of applying functions or operations to raw data to convert it into a format that is more appropriate for analysis or modeling.


32. Box Plot

A graphical representation of data that shows the distribution through quartiles, highlighting the median, upper and lower quartiles, and potential outliers.


33. Feature Engineering

The process of creating new input variables from existing data, aimed at improving the performance of machine learning algorithms.


34. Missing Completely at Random (MCAR)

A situation where the likelihood of data being missing is unrelated to the data itself or any other variables.


35. Missing at Random (MAR)

A situation where the probability of a value being missing depends on other observed variables, but not the missing value itself.


36. Missing Not at Random (MNAR)

A situation where the probability of a value being missing is related to the unobserved value itself, introducing a potential bias in analysis.


37. Subset

A smaller portion of a larger dataset that is selected based on specific criteria or conditions for further analysis.


38. Long Format

A data format where each row represents a single observation, and columns represent variables. It is often used for time-series or panel data.


39. Wide Format

A data format where each column represents a variable for a specific observation. It’s often used for cross-sectional data.


40. Lag Variables

Variables in time-series data that represent a previous time point's value to model delayed effects or temporal relationships.


41. Data Type Casting

The process of converting one data type into another (e.g., converting a string into an integer or a float into a string).


42. Quantile

A statistical term referring to dividing a dataset into equal intervals, often used for identifying outliers or splitting data for analysis.


43. Binning

The process of grouping continuous data into discrete bins or intervals, often used to simplify or categorize data for analysis.


44. Join Types

Various methods of combining datasets, including:

  • Inner Join: Combines rows with matching values.
  • Left Join: Combines rows from the left dataset and matching rows from the right dataset.
  • Right Join: Combines rows from the right dataset and matching rows from the left dataset.
  • Outer Join: Combines all rows from both datasets, filling in with null values where no match exists.

45. DateTime Format

A format used to represent and manipulate date and time data, such as "YYYY-MM-DD" or "MM/DD/YYYY", which is crucial for time-based analysis.


46. Datetime Indexing

The process of setting a date or time column as an index in a dataset, often used in time-series analysis.


47. Data Redundancy

The repetition of data, usually as a result of poor data management or the merging of multiple datasets, which can increase storage costs and reduce analysis accuracy.


48. Hashing

The process of converting data into a fixed-size string or value, commonly used in data de-identification or indexing to ensure privacy.


49. Categorical Variable

A variable that contains labels or categories (e.g., "Gender", "Color") rather than numerical values. These may be nominal or ordinal in nature.


50. Transformation Pipeline

A sequence of operations or functions applied to data in a set order to clean, preprocess, or transform it for modeling.