This folder contains various techniques for encoding categorical and nominal data into numerical formats suitable for machine learning algorithms.
Most machine learning algorithms require numerical input, but real-world data often contains categorical variables (text labels, categories, etc.). Encoding techniques transform these categorical variables into numerical representations that algorithms can process.
- File:
Data_Encoding_nominal.ipynb - Method:
sklearn.preprocessing.OneHotEncoder - Description: Creates binary columns for each unique category in the data. Each category gets its own column with 1 or 0 values. For example, if you have colors [red, blue, green], it creates three columns: color_red, color_blue, color_green.
- Use Case: Nominal categorical data where there's no inherent order between categories
- When to Use:
- Few unique categories (< 10-15)
- Nominal data (no order)
- When you want to avoid implying order
- Advantages:
- No ordinal relationship assumed
- Works well with linear models
- Disadvantages:
- Can create many columns (curse of dimensionality)
- Not suitable for high-cardinality features
- File:
label_encoding.ipynb - Method:
sklearn.preprocessing.LabelEncoder - Description: Assigns a unique numerical label to each category. Each category is mapped to an integer value (0, 1, 2, ...). The mapping is arbitrary and doesn't preserve any order.
- Use Case: Simple encoding for categorical variables where the algorithm can handle numerical labels
- When to Use:
- Tree-based algorithms (Random Forest, XGBoost)
- When order doesn't matter
- Quick encoding for exploration
- Advantages:
- Simple and fast
- Doesn't increase dimensionality
- Disadvantages:
- May imply false ordinal relationships
- Not suitable for linear models
- File:
label_encoding.ipynb - Method:
sklearn.preprocessing.OrdinalEncoder - Description: Encodes categorical variables with an inherent order (e.g., small, medium, large) into numerical values while preserving the order. You specify the order of categories explicitly.
- Use Case: Ordinal categorical data where the relationship between categories matters
- When to Use:
- Ordinal data with clear hierarchy
- Size, rating, or ranking categories
- When order is meaningful
- Advantages:
- Preserves ordinal relationships
- Single column output
- Works well with algorithms that understand order
- Disadvantages:
- Requires domain knowledge to specify order
- Assumes equal spacing between categories
- File:
target_encoding.ipynb - Method: Group-based mean encoding using
pandas.groupby()andmap() - Description: Encodes categorical variables based on the mean (or other statistics) of the target variable for each category. For example, if category "A" has a mean target value of 0.7, all instances of "A" get encoded as 0.7.
- Use Case: High-cardinality categorical features where one-hot encoding would create too many features
- When to Use:
- High-cardinality features (many unique categories)
- When one-hot encoding is impractical
- When category-target relationship is important
- Advantages:
- Captures target relationship
- Single column output
- Handles high-cardinality well
- Disadvantages:
- Risk of overfitting
- Requires careful cross-validation
- Can leak target information
Data_Encoding_nominal.ipynb: One-Hot Encoding implementation with exampleslabel_encoding.ipynb: Label Encoding and Ordinal Encoding implementationstarget_encoding.ipynb: Target Encoding implementation with group-based mean encoding
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoded = encoder.fit_transform(data[['color']])from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoded = encoder.fit_transform(data['color'])from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
encoded = encoder.fit_transform(data[['size']])mean_encoding = data.groupby('city')['target'].mean().to_dict()
data['city_encoded'] = data['city'].map(mean_encoding)-
Choose Based on Data Type:
- Nominal → One-Hot or Target Encoding
- Ordinal → Ordinal Encoding
- High-cardinality → Target Encoding
-
Consider Algorithm:
- Linear models → One-Hot Encoding
- Tree-based → Label Encoding often works
- Neural networks → One-Hot or Embeddings
-
Handle High-Cardinality: Use Target Encoding or feature hashing for many categories
-
Avoid Data Leakage: In Target Encoding, use cross-validation to prevent leakage
-
Document Encoding: Keep track of encoding mappings for model interpretation
- pandas
- numpy
- scikit-learn
- Feature scaling may be needed after encoding
- Consider feature selection after one-hot encoding
- Target encoding requires careful validation strategy