Data Encoding

This folder contains various techniques for encoding categorical and nominal data into numerical formats suitable for machine learning algorithms.

Overview

Most machine learning algorithms require numerical input, but real-world data often contains categorical variables (text labels, categories, etc.). Encoding techniques transform these categorical variables into numerical representations that algorithms can process.

Types Used

1. One-Hot Encoding

File: Data_Encoding_nominal.ipynb
Method: sklearn.preprocessing.OneHotEncoder
Description: Creates binary columns for each unique category in the data. Each category gets its own column with 1 or 0 values. For example, if you have colors [red, blue, green], it creates three columns: color_red, color_blue, color_green.
Use Case: Nominal categorical data where there's no inherent order between categories
When to Use:
- Few unique categories (< 10-15)
- Nominal data (no order)
- When you want to avoid implying order
Advantages:
- No ordinal relationship assumed
- Works well with linear models
Disadvantages:
- Can create many columns (curse of dimensionality)
- Not suitable for high-cardinality features

2. Label Encoding

File: label_encoding.ipynb
Method: sklearn.preprocessing.LabelEncoder
Description: Assigns a unique numerical label to each category. Each category is mapped to an integer value (0, 1, 2, ...). The mapping is arbitrary and doesn't preserve any order.
Use Case: Simple encoding for categorical variables where the algorithm can handle numerical labels
When to Use:
- Tree-based algorithms (Random Forest, XGBoost)
- When order doesn't matter
- Quick encoding for exploration
Advantages:
- Simple and fast
- Doesn't increase dimensionality
Disadvantages:
- May imply false ordinal relationships
- Not suitable for linear models

3. Ordinal Encoding

File: label_encoding.ipynb
Method: sklearn.preprocessing.OrdinalEncoder
Description: Encodes categorical variables with an inherent order (e.g., small, medium, large) into numerical values while preserving the order. You specify the order of categories explicitly.
Use Case: Ordinal categorical data where the relationship between categories matters
When to Use:
- Ordinal data with clear hierarchy
- Size, rating, or ranking categories
- When order is meaningful
Advantages:
- Preserves ordinal relationships
- Single column output
- Works well with algorithms that understand order
Disadvantages:
- Requires domain knowledge to specify order
- Assumes equal spacing between categories

4. Target Encoding

File: target_encoding.ipynb
Method: Group-based mean encoding using pandas.groupby() and map()
Description: Encodes categorical variables based on the mean (or other statistics) of the target variable for each category. For example, if category "A" has a mean target value of 0.7, all instances of "A" get encoded as 0.7.
Use Case: High-cardinality categorical features where one-hot encoding would create too many features
When to Use:
- High-cardinality features (many unique categories)
- When one-hot encoding is impractical
- When category-target relationship is important
Advantages:
- Captures target relationship
- Single column output
- Handles high-cardinality well
Disadvantages:
- Risk of overfitting
- Requires careful cross-validation
- Can leak target information

Files

Data_Encoding_nominal.ipynb: One-Hot Encoding implementation with examples
label_encoding.ipynb: Label Encoding and Ordinal Encoding implementations
target_encoding.ipynb: Target Encoding implementation with group-based mean encoding

Implementation Examples

One-Hot Encoding

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoded = encoder.fit_transform(data[['color']])

Label Encoding

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoded = encoder.fit_transform(data['color'])

Ordinal Encoding

from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
encoded = encoder.fit_transform(data[['size']])

Target Encoding

mean_encoding = data.groupby('city')['target'].mean().to_dict()
data['city_encoded'] = data['city'].map(mean_encoding)

Best Practices

Choose Based on Data Type:
- Nominal → One-Hot or Target Encoding
- Ordinal → Ordinal Encoding
- High-cardinality → Target Encoding
Consider Algorithm:
- Linear models → One-Hot Encoding
- Tree-based → Label Encoding often works
- Neural networks → One-Hot or Embeddings
Handle High-Cardinality: Use Target Encoding or feature hashing for many categories
Avoid Data Leakage: In Target Encoding, use cross-validation to prevent leakage
Document Encoding: Keep track of encoding mappings for model interpretation

Dependencies

pandas
numpy
scikit-learn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Encoding

Overview

Types Used

1. One-Hot Encoding

2. Label Encoding

3. Ordinal Encoding

4. Target Encoding

Files

Implementation Examples

One-Hot Encoding

Label Encoding

Ordinal Encoding

Target Encoding

Best Practices

Dependencies

Related Topics

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Data Encoding

Overview

Types Used

1. One-Hot Encoding

2. Label Encoding

3. Ordinal Encoding

4. Target Encoding

Files

Implementation Examples

One-Hot Encoding

Label Encoding

Ordinal Encoding

Target Encoding

Best Practices

Dependencies

Related Topics