Last Updated : 06 Aug, 2024
Summarize
Comments
Improve
Z-score normalization, also known as standardization, is a crucial data preprocessing technique in machine learning and statistics. It is used to transform data into a standard normal distribution, ensuring that all features are on the same scale. This process helps to avoid the dominance of certain features over others due to differences in their scales, which can significantly impact the performance of machine learning models. In this article, we will delve into the definition and examples of z-score normalization, highlighting its importance and practical applications.
Table of Content
- What is Z-Score Normalization?
- Why Z-Score Normalization is Necessary?
- Calculating Z-Score Normalization: Step-by-Step Calculation
- Practical Example: Z-Score Normalization in Python
- Step 1: Importing the required Libraries
- Step 2: Calculate the Mean
- Step 3:Calculate the standard deviation
- Step 4: Perform Z-score normalization
- Step 5 : Outlier detection
- Applications of Z-Score Normalization
- Advantages and Disadvantages of Z-Score Normalization
What is Z-Score Normalization?
Z-score normalization, also called standardization, transforms data so that it has a mean (average) of 0 and a standard deviation of 1. This process adjusts data values based on how far they deviate from the mean, measured in units of standard deviation.
Where,
(Z) is the Z-score.(X) is the value of the data point.(\mu) is the mean of the dataset.(\sigma) is the standard deviation of the dataset.
Why Z-Score Normalization is Necessary?
Machine learning algorithms often struggle with datasets where features are on drastically different scales. For instance, in a dataset of houses, the number of rooms and the age of the house are measured in different units. If these features are not normalized, the algorithm may give more importance to the feature with the larger scale, leading to inaccurate predictions.
- Normalization is necessary when dealing with features on different scales. Without normalization, features with larger scales can dominate those with smaller scales, leading to biased results in machine learning models.
- Z-score normalization addresses this issue by scaling the data based on its statistical properties, making it particularly useful for algorithms that rely on distance calculations, such as K-nearest neighbors and clustering algorithms
- Z-score normalization ensures that all features are treated equally, enabling the algorithm to identify meaningful patterns and relationships more effectively.
Calculating Z-Score Normalization: Step-by-Step Calculation
- Calculate the Mean and Standard Deviation: Compute the mean (𝜇) and standard deviation (σ) of the dataset.
- Apply the Z-Score Formula: For each data pointx, apply the formula.
1. Calculate the Mean:
2. Calculate the Standard Deviation:
3. Normalize the First Value:
Repeating this for all values in the dataset will yield a new dataset with a mean of 0 and a standard deviation of 1.
Practical Example: Z-Score Normalization in Python
Here is a simple example of how to perform Z-score normalization using Python:
Step 1: Importing the required Libraries
- import numpy as np: This imports the NumPy library and gives it the alias np, which is a common convention.
- import matplotlib.pyplot as plt: This imports the pyplot module from the Matplotlib library and gives it the alias plt.
- data = np.array([70, 80, 90, 100, 110, 130, 150]): This creates a NumPy array named data containing the sample test scores.
import numpy as np# Sample data: test scoresdata = np.array([70, 80, 90, 100, 110])
Step 2: Calculate the Mean
mean = np.mean(data): This calculates the mean (average) of the data array using the mean function from NumPy and stores it in the variable mean.
# Calculate the meanmean = np.mean(data)
Step 3:Calculate the standard deviation
std_dev = np.std(data): This calculates the standard deviation of the data array using the std function from NumPy and stores it in the variable std_dev.
# Calculate the standard deviationstd_dev = np.std(data)
Step 4: Perform Z-score normalization
z_scores = (data – mean) / std_dev: This applies the Z-score normalization formula to each element in the data array. It subtracts the mean from each data point and divides the result by the standard deviation. The resulting Z-scores are stored in the array z_scores.
# Perform Z-score normalizationz_scores = (data - mean) / std_dev# Print the resultsprint("Original data:", data)print("Mean:", mean)print("Standard Deviation:", std_dev)print("Z-scores:", z_scores)
Output:
Original data: [ 70 80 90 100 110]
Mean: 90.0
Standard Deviation: 14.142135623730951
Z-scores: [-1.41421356 -0.70710678 0. 0.70710678 1.41421356]
Step 5 : Outlier detection
- outliers = np.where(np.abs(z_scores) > 3): This identifies the indices of data points whose absolute Z-scores are greater than 3, indicating they are outliers.
- print(“Outliers:”, data[outliers]): This prints the outlier data points to the console.
- plt.figure(figsize=(10, 6)): This creates a new figure with a specified size of 10×6 inches.
- plt.plot(data, ‘bo-‘, label=’Original Data’): This plots the original data points with blue circles connected by lines.
- plt.plot(outliers[0], data[outliers], ‘ro’, label=’Outliers’): This plots the outliers with red circles.
- plt.axhline(mean, color=’g’, linestyle=’–‘, label=’Mean’): This adds a horizontal dashed green line at the mean value.
- plt.xlabel(‘Index’): This labels the x-axis as ‘Index’.
- plt.ylabel(‘Value’): This labels the y-axis as ‘Value’.
- plt.title(‘Data with Outliers Detected’): This sets the title of the plot.
- plt.legend(): This adds a legend to the plot.
- plt.show(): This displays the plot.
# Outlier detection: Z-scores beyond ±3 are considered outliersoutliers = np.where(np.abs(z_scores) > 3)print("Outliers:", data[outliers])# Visualizing the dataplt.figure(figsize=(10, 6))plt.plot(data, 'bo-', label='Original Data')plt.plot(outliers[0], data[outliers], 'ro', label='Outliers')plt.axhline(mean, color='g', linestyle='--', label='Mean')plt.xlabel('Index')plt.ylabel('Value')plt.title('Data with Outliers Detected')plt.legend()plt.show()
Output:
Outliers: []

Applications of Z-Score Normalization
1. In Machine Learning
- Feature Scaling: Ensures that all features contribute equally to the result, improving the performance of algorithms like SVM, K-means clustering, and neural networks.
- Outlier Detection: Z-scores help identify outliers, which are data points with Z-scores beyond a certain threshold (usually ±3).
2. Statistical Analysis
- Hypothesis Testing: Z-scores are used to determine the significance of results by comparing observed data to a standard normal distribution.
- Quality Control: In manufacturing, Z-scores monitor process performance and product quality by comparing measurements to specifications.
Advantages and Disadvantages of Z-Score Normalization
Advantages
- Improved Model Performance: By scaling features to a common range, Z-score normalization enhances the accuracy and efficiency of machine learning models.
- Handling Outliers: Unlike min-max normalization, Z-score normalization is less affected by outliers, making it more robust in real-world applications.
Disadvantages
- Assumption of Normal Distribution: Z-score normalization assumes that the data follows a normal distribution. If the data is highly skewed, this method may not be appropriate.
- Loss of Interpretability: The transformed data loses its original units, which can make interpretation more challenging.
Conclusion
Z-score normalization is a powerful technique for standardizing data, making it essential for various data science and machine learning applications. By transforming data to have a mean of zero and a standard deviation of one, it ensures that features are on a common scale, improving model performance and robustness. While it has its limitations, understanding when and how to apply Z-score normalization can significantly enhance the quality of data analysis and modeling efforts.
Previous Article
Multidimensional Scaling (MDS) using Scikit Learn
Next Article