Preprocessing - Data Science

Preprocessing is any manipulation of the dataset before running it through the model.
The motivation of preprocessing data includes:
- ensuring compatibility with Python ML/AI libraries.
- accounting for orders of magnitude in different features.
- generalization. This allows us to reuse models.

Standardization (Feature Scaling)

The process of transforming data into a standard scale.
This is commonly done with this equation:

\[ standardized \: variable = \frac{x – \mu}{\sigma} \]

x = original variable
µ = mean of the original variable
σ = standard deviation of the original variable

Other Methods of Standardization include:

L2-norm
PCA (Principal Components Analysis) – a dimension reduction technique used to combine variables into a bigger (latent) variable.
Whitening – often performed after PCA. Removes most of the correlation between data points.

Preprocessing Categorical Data

Binary Encoding

Binary is an improvement over raw categorical data, but can prove problematic because the encoding/decoding is ordered. This is useful when there are many categories.

One-Hot Encoding

Create as many columns as there are possible values. Unfortunately, this encoding creates many columns of data. This is best suited when few categories are present.

Preprocessing – Shuffle and Batch

Shuffling – keeping the same information but in a different order.
Buffer size – the number of samples that will be shuffled at a time. This is done because shuffling all of a big dataset will crash some computers.
A Buffer size = 1 results in no shuffling.
If the Buffer size >= number of samples, shuffling will happen at once (uniformly).

Batch size = 1 = Stochastic Gradient Descent (SGD)
Batch size = number of samples = single batch (GD)
1 < Batch size < number of samples = mini-batch GD

Balancing the Dataset

Balancing a dataset in AI refers to the process of equalizing the representation of different classes or categories within the dataset. It involves adjusting the class distribution to ensure that each class has a similar number of instances or a desired proportion.

Balancing a dataset is particularly important when dealing with imbalanced datasets, where one or more classes have significantly fewer instances than others. Imbalanced datasets can pose challenges during training, as models may become biased towards the majority class and perform poorly on the minority class(es).

Here are some common techniques used to balance a dataset in AI:

Undersampling:
- Undersampling involves reducing the number of instances from the majority class to match the number of instances in the minority class(es).
- Random undersampling randomly selects a subset of instances from the majority class.
- Undersampling can be effective when the majority class has a large number of redundant instances.
Oversampling:
- Oversampling involves increasing the number of instances in the minority class(es) to match the number of instances in the majority class.
- Random oversampling replicates instances from the minority class randomly to increase their representation.
- Synthetic oversampling techniques, such as SMOTE (Synthetic Minority Over-sampling Technique), generate synthetic instances by interpolating between existing minority class instances.
- Oversampling can be effective when the minority class has limited instances and requires more representation.
Class Weighting:
- Class weighting assigns different weights to each class during training to account for the class imbalance.
- Models are penalized more for misclassifying instances from the minority class by assigning higher weights to those instances.
- Class weighting can be incorporated into the loss function or optimization algorithm to guide the model’s learning process.
Data Augmentation:
- Data augmentation techniques create new training instances by applying transformations, such as rotation, translation, scaling, or adding noise, to the existing instances.
- Data augmentation helps increase the effective size of the minority class by generating diverse and realistic variations of the existing instances.

The choice of balancing technique depends on the specific dataset, the degree of imbalance, and the characteristics of the problem at hand. It is important to consider the potential impact of balancing on the overall performance and generalization of the model. Additionally, evaluation metrics and techniques should be chosen carefully to account for the imbalanced nature of the dataset.