OHE: User can select which categories to encoded for selected variables#667

Open

Morgan-Sell wants to merge 10 commits intofeature-engine:mainfrom

Morgan-Sell:ohe_cstm_ctgrs

Collaborator

Morgan-Sell commented Apr 28, 2023

Closes #303.

Objective:
Create functionality so a user can encode certain categories that may not be the most frequent. The functionality is explained in this thread.

Will create a new init param called custom_categories that accepts a dictionary with selected features as keys and lists of the desired categories as values.

Both top_categories and user_categories cannot be used at the same time.

Morgan-Sell added 8 commits

April 28, 2023 10:52


          initial commit

02a695d


          create custom_categories param and init checks

ed3adf8


          add init checks

781b312


          expand fit

21b3b57


          create test_encode_custom_categories(). pass all tests.

c6b478b


          fix formatting errors

353c6f3


          update docstring

00b4216


          fix typecheck errors

43bc63d

Collaborator Author

Morgan-Sell commented Apr 29, 2023

I think we're getting close on this one.

Are there any other unit tests that we should implement? According to the coverage report, OHE has 100% coverage.

I'll dig into the type error.

Collaborator Author

Morgan-Sell commented Apr 30, 2023

Disregard my comment about code coverage. It seems that the transformer is very poorly "covered". I'll work on this.

solegalli reviewed

View reviewed changes

Collaborator

solegalli left a comment

Hey @Morgan-Sell

I had a quick look at the logic of the new functionality. There are some things that we need to give more thought to. Would be great it you could have a look and tell me what you think?

Thank you!

feature_engine/encoding/one_hot.py Outdated Show resolved Hide resolved

feature_engine/encoding/one_hot.py

+                      The dicitonary values are lists of the categories for each selected variable
+                      that are to be one-hot encoded.
+                      If `custom_categories` is being used, `top_categories` must equal None.

Collaborator

solegalli May 3, 2023

A few things to think:

I think custom_categories should override top_categories. So no matter what the user enters, custom_categories takes precedence. It is easier to use like this imo. So here we should say "when custom_categories is entered, top_categories will be ignored.

The other thing to consider is, how do we want custom_categories to interplay with variables. Are we going to allow the user to specify categories for 3 variables for example, while encoding 6? What did you have in mind for this?

For examples vars a,b,c,d,e are categorical, if variables=None, all will be encoded. If user passes a dictionary with keys, b and c, will variables a,d and e also be encoded? or are we making custom_categories take precedence over variables?

Collaborator Author

Morgan-Sell May 26, 2023

Oh, Dr. Che, you never let me get away easy. No wonder I enjoy working w/ you ;)

I'm going to suggest that variables is superior. I envision scenarios in which users would like to encode all variables but would like to select which categories are encoded in certain variables.

Consequently, this transformer will require a check when variables does not equal None and the user utilizes custom_categories. The transformer should confirm that the variables included in custom_categories are a subset of the variables inputted into the variables param.

What do you think? ;)

Collaborator

solegalli Jun 2, 2023

Sounds good.

feature_engine/encoding/one_hot.py

                       drop_last_binary: bool = False,
-                      variables: Union[None, int, str, List[Union[str, int]]] = None,
+                      variables: Union[
+                          None, int, str, List[Union[str, int]], Iterable[Union[str, int]]

Collaborator

solegalli May 3, 2023

why did you add iterable?

Collaborator Author

Morgan-Sell May 26, 2023

Before refactoring the code, MyPy raised the following errors:

feature_engine/encoding/one_hot.py:221: error: Argument 1 to "sorted" has incompatible type "Union[int, str, List[Union[str, int]], None]"; expected "Iterable[Union[str, int]]"
feature_engine/encoding/one_hot.py:223: error: Argument 1 to "sorted" has incompatible type "Union[int, str, List[Union[str, int]], None]"; expected "Iterable[Union[str, int]]"

feature_engine/encoding/one_hot.py Outdated Show resolved Hide resolved

feature_engine/encoding/one_hot.py

+                                  f"its values. Got {custom_categories} instead."
+                              )
+                          # check that custom_categories variable match variables

Collaborator

solegalli May 3, 2023

if custom_categories should exactly match variables, I think we should just make this parameter take precendence over variables, and ignore variables altogether.

The alternative is, if we allow the user to specify None or a group of variables in variables, while custom_categories contains only a subset of these. Could be practical for the user, but a headache for us. Need to think / have a look at how much code we should change.

Collaborator Author

Morgan-Sell May 26, 2023

I'm reading this comment after responding to your prior question.

I agree that it will be a headache for us and useful for the user. I'll think about this more too.

Collaborator

solegalli Jun 2, 2023

Alright then, let me know if you find an elegant solution :)

feature_engine/encoding/one_hot.py

+                      """
+                      for var, categories in self.custom_categories.items():
+                          unique_values = set(X[var].unique())
+                          if not set(categories).issubset(unique_values):

Collaborator

solegalli May 3, 2023

what if, the user wants a dummy for a category that is not present in the train set? should we contemplate this or is too unlikely?

Collaborator Author

Morgan-Sell May 26, 2023

When someone is OHE-ing a category that is not present in the training set, are they putting it into a "catch-all" bucket, e.g. unlearned_categories? Is this common?

I think it's unlikely and defeats the purpose of the custom_categories param. I see custom_categories being used to protect against unlearned categories.

Collaborator

solegalli Jun 2, 2023

Here is an example:

I work in a car insurance company. My historical data has 3 types of fuel: gas, petrol, diesel. But, I know that electric vehicles are coming. So I want an additional category: electric.

In the above scenario, it might help if we allow the additional bespoke category.

On the other hand, a model can't be train in a category that does not exist, so the results would be unreliable. Maybe we don't allow them for now?

Morgan-Sell added 2 commits

May 26, 2023 19:46


          revise docstring

cebc04a


          refactor code

9d98b36

Collaborator Author

Morgan-Sell commented May 27, 2023

When you're back from vacay, lmk what you think.

Also, I don't know why I said that OHE has a low code coverage score. I just ran the coverage report - the transformer received 100% coverage!

Collaborator

solegalli commented Jun 2, 2023

Hey @Morgan-Sell

I disappeared for a while, so I am a bit lost. I guess you were waiting for my input here, no? You let me know when I need to take another look? I am away from next Thursday, for a month, ehem, (red face)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet