Skip to content

Latest commit

 

History

History
142 lines (116 loc) · 5.27 KB

File metadata and controls

142 lines (116 loc) · 5.27 KB

Data Encoding

This folder contains various techniques for encoding categorical and nominal data into numerical formats suitable for machine learning algorithms.

Overview

Most machine learning algorithms require numerical input, but real-world data often contains categorical variables (text labels, categories, etc.). Encoding techniques transform these categorical variables into numerical representations that algorithms can process.

Types Used

1. One-Hot Encoding

  • File: Data_Encoding_nominal.ipynb
  • Method: sklearn.preprocessing.OneHotEncoder
  • Description: Creates binary columns for each unique category in the data. Each category gets its own column with 1 or 0 values. For example, if you have colors [red, blue, green], it creates three columns: color_red, color_blue, color_green.
  • Use Case: Nominal categorical data where there's no inherent order between categories
  • When to Use:
    • Few unique categories (< 10-15)
    • Nominal data (no order)
    • When you want to avoid implying order
  • Advantages:
    • No ordinal relationship assumed
    • Works well with linear models
  • Disadvantages:
    • Can create many columns (curse of dimensionality)
    • Not suitable for high-cardinality features

2. Label Encoding

  • File: label_encoding.ipynb
  • Method: sklearn.preprocessing.LabelEncoder
  • Description: Assigns a unique numerical label to each category. Each category is mapped to an integer value (0, 1, 2, ...). The mapping is arbitrary and doesn't preserve any order.
  • Use Case: Simple encoding for categorical variables where the algorithm can handle numerical labels
  • When to Use:
    • Tree-based algorithms (Random Forest, XGBoost)
    • When order doesn't matter
    • Quick encoding for exploration
  • Advantages:
    • Simple and fast
    • Doesn't increase dimensionality
  • Disadvantages:
    • May imply false ordinal relationships
    • Not suitable for linear models

3. Ordinal Encoding

  • File: label_encoding.ipynb
  • Method: sklearn.preprocessing.OrdinalEncoder
  • Description: Encodes categorical variables with an inherent order (e.g., small, medium, large) into numerical values while preserving the order. You specify the order of categories explicitly.
  • Use Case: Ordinal categorical data where the relationship between categories matters
  • When to Use:
    • Ordinal data with clear hierarchy
    • Size, rating, or ranking categories
    • When order is meaningful
  • Advantages:
    • Preserves ordinal relationships
    • Single column output
    • Works well with algorithms that understand order
  • Disadvantages:
    • Requires domain knowledge to specify order
    • Assumes equal spacing between categories

4. Target Encoding

  • File: target_encoding.ipynb
  • Method: Group-based mean encoding using pandas.groupby() and map()
  • Description: Encodes categorical variables based on the mean (or other statistics) of the target variable for each category. For example, if category "A" has a mean target value of 0.7, all instances of "A" get encoded as 0.7.
  • Use Case: High-cardinality categorical features where one-hot encoding would create too many features
  • When to Use:
    • High-cardinality features (many unique categories)
    • When one-hot encoding is impractical
    • When category-target relationship is important
  • Advantages:
    • Captures target relationship
    • Single column output
    • Handles high-cardinality well
  • Disadvantages:
    • Risk of overfitting
    • Requires careful cross-validation
    • Can leak target information

Files

  • Data_Encoding_nominal.ipynb: One-Hot Encoding implementation with examples
  • label_encoding.ipynb: Label Encoding and Ordinal Encoding implementations
  • target_encoding.ipynb: Target Encoding implementation with group-based mean encoding

Implementation Examples

One-Hot Encoding

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoded = encoder.fit_transform(data[['color']])

Label Encoding

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoded = encoder.fit_transform(data['color'])

Ordinal Encoding

from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
encoded = encoder.fit_transform(data[['size']])

Target Encoding

mean_encoding = data.groupby('city')['target'].mean().to_dict()
data['city_encoded'] = data['city'].map(mean_encoding)

Best Practices

  1. Choose Based on Data Type:

    • Nominal → One-Hot or Target Encoding
    • Ordinal → Ordinal Encoding
    • High-cardinality → Target Encoding
  2. Consider Algorithm:

    • Linear models → One-Hot Encoding
    • Tree-based → Label Encoding often works
    • Neural networks → One-Hot or Embeddings
  3. Handle High-Cardinality: Use Target Encoding or feature hashing for many categories

  4. Avoid Data Leakage: In Target Encoding, use cross-validation to prevent leakage

  5. Document Encoding: Keep track of encoding mappings for model interpretation

Dependencies

  • pandas
  • numpy
  • scikit-learn

Related Topics

  • Feature scaling may be needed after encoding
  • Consider feature selection after one-hot encoding
  • Target encoding requires careful validation strategy