Data Lingo 101: The Must-Know Terms
The process of filling missing or incomplete values in a dataset using predefined methods such as mean, median, or mode substitution, or more advanced algorithms.
A two-dimensional, tabular data structure with labeled axes (rows and columns), commonly used in libraries like pandas (Python) or dplyr (R).
A placeholder value representing missing or undefined data in a dataset, commonly used in both R and Python.
A method used to remove missing or null values from a DataFrame or dataset, available in libraries like pandas (Python).
A method used in pandas (Python) to replace missing values with a specific value or strategy (e.g., forward fill, backward fill, or mean).
A method used to remove leading and trailing whitespaces from text strings in a dataset, often used to clean up inconsistent spacing issues.
Converting data into a consistent format, such as ensuring uniform capitalization or applying consistent naming conventions across a dataset.
An R package that provides a set of tools for data manipulation, including filtering, selecting, and transforming data, often used for wrangling.
A function used to combine two datasets based on a common column or key, similar to a SQL join operation. It is available in both R and Python (pandas).
A data summarization tool that is often used in Excel or pandas (Python) to reorganize and aggregate data based on a set of parameters or criteria.
The process of scaling data values to fit a particular range, often between 0 and 1, to ensure fair comparisons between variables in machine learning models.
The process of combining data into groups based on shared attributes or columns, which is commonly followed by aggregate functions like sum, mean, or count.
A method of representing categorical variables as binary (0/1) vectors. It converts categorical features into a format that can be used by machine learning models.
The process of combining two or more datasets based on a shared column, similar to SQL joins (inner join, outer join, etc.).
The process of changing the data type of a column in a dataset, such as converting text to numeric values, or dates to datetime objects.
The process of ensuring that the data entered or imported meets predefined rules or constraints, such as acceptable ranges, formats, or types.
A function in R used to remove leading and trailing whitespaces from a string in a dataset.
A method in pandas (Python) used to remove duplicate rows from a DataFrame, ensuring that only unique rows remain.
The process of identifying and handling data points that deviate significantly from other observations, which could skew analysis or predictions.
Excel functions used to look up and retrieve data from another table based on a specific key or matching criterion.
The process of converting data into a format that is more appropriate for analysis or modeling, including actions like normalization, encoding, or aggregation.
The process of summarizing or combining multiple data points into a single value, often using functions like sum, mean, or count, typically applied to grouped data.
A method for pattern matching within text, often used to clean or extract specific information from strings in a dataset.
A technique in data analysis and machine learning used to assess the performance of a model by splitting the data into multiple subsets, training on some and testing on others.
The method of selecting and accessing specific rows or columns in a DataFrame or dataset, often done using labels or position.
An observation in a dataset that significantly differs from other data points, which can distort statistical analyses and lead to incorrect conclusions.
The process of standardizing the range of data features in a dataset, often by transforming data so that it falls within a specific range (e.g., 0 to 1).
A statistical measure that describes a data point's relationship to the mean of a dataset, commonly used in outlier detection.
A situation where one class in a dataset is overrepresented compared to others, which can affect the performance of machine learning models.
Another term for data wrangling or cleaning, referring to the process of transforming and mapping raw data into a more useful format.
The process of applying functions or operations to raw data to convert it into a format that is more appropriate for analysis or modeling.
A graphical representation of data that shows the distribution through quartiles, highlighting the median, upper and lower quartiles, and potential outliers.
The process of creating new input variables from existing data, aimed at improving the performance of machine learning algorithms.
A situation where the likelihood of data being missing is unrelated to the data itself or any other variables.
A situation where the probability of a value being missing depends on other observed variables, but not the missing value itself.
A situation where the probability of a value being missing is related to the unobserved value itself, introducing a potential bias in analysis.
A smaller portion of a larger dataset that is selected based on specific criteria or conditions for further analysis.
A data format where each row represents a single observation, and columns represent variables. It is often used for time-series or panel data.
A data format where each column represents a variable for a specific observation. It’s often used for cross-sectional data.
Variables in time-series data that represent a previous time point's value to model delayed effects or temporal relationships.
The process of converting one data type into another (e.g., converting a string into an integer or a float into a string).
A statistical term referring to dividing a dataset into equal intervals, often used for identifying outliers or splitting data for analysis.
The process of grouping continuous data into discrete bins or intervals, often used to simplify or categorize data for analysis.
Various methods of combining datasets, including:
A format used to represent and manipulate date and time data, such as "YYYY-MM-DD" or "MM/DD/YYYY", which is crucial for time-based analysis.
The process of setting a date or time column as an index in a dataset, often used in time-series analysis.
The repetition of data, usually as a result of poor data management or the merging of multiple datasets, which can increase storage costs and reduce analysis accuracy.
The process of converting data into a fixed-size string or value, commonly used in data de-identification or indexing to ensure privacy.
A variable that contains labels or categories (e.g., "Gender", "Color") rather than numerical values. These may be nominal or ordinal in nature.
A sequence of operations or functions applied to data in a set order to clean, preprocess, or transform it for modeling.