OHE: User can select which categories to encoded for selected variables#667
OHE: User can select which categories to encoded for selected variables#667Morgan-Sell wants to merge 10 commits intofeature-engine:mainfrom
Conversation
|
hi @solegalli, I think we're getting close on this one. Are there any other unit tests that we should implement? According to the coverage report, OHE has 100% coverage. I'll dig into the type error. |
|
Disregard my comment about code coverage. It seems that the transformer is very poorly "covered". I'll work on this. |
solegalli
left a comment
There was a problem hiding this comment.
Hey @Morgan-Sell
I had a quick look at the logic of the new functionality. There are some things that we need to give more thought to. Would be great it you could have a look and tell me what you think?
Thank you!
| The dicitonary values are lists of the categories for each selected variable | ||
| that are to be one-hot encoded. | ||
|
|
||
| If `custom_categories` is being used, `top_categories` must equal None. |
There was a problem hiding this comment.
A few things to think:
- I think custom_categories should override top_categories. So no matter what the user enters, custom_categories takes precedence. It is easier to use like this imo. So here we should say "when custom_categories is entered, top_categories will be ignored.
The other thing to consider is, how do we want custom_categories to interplay with variables. Are we going to allow the user to specify categories for 3 variables for example, while encoding 6? What did you have in mind for this?
For examples vars a,b,c,d,e are categorical, if variables=None, all will be encoded. If user passes a dictionary with keys, b and c, will variables a,d and e also be encoded? or are we making custom_categories take precedence over variables?
There was a problem hiding this comment.
Oh, Dr. Che, you never let me get away easy. No wonder I enjoy working w/ you ;)
I'm going to suggest that variables is superior. I envision scenarios in which users would like to encode all variables but would like to select which categories are encoded in certain variables.
Consequently, this transformer will require a check when variables does not equal None and the user utilizes custom_categories. The transformer should confirm that the variables included in custom_categories are a subset of the variables inputted into the variables param.
What do you think? ;)
| drop_last_binary: bool = False, | ||
| variables: Union[None, int, str, List[Union[str, int]]] = None, | ||
| variables: Union[ | ||
| None, int, str, List[Union[str, int]], Iterable[Union[str, int]] |
There was a problem hiding this comment.
why did you add iterable?
There was a problem hiding this comment.
Before refactoring the code, MyPy raised the following errors:
feature_engine/encoding/one_hot.py:221: error: Argument 1 to "sorted" has incompatible type "Union[int, str, List[Union[str, int]], None]"; expected "Iterable[Union[str, int]]"
feature_engine/encoding/one_hot.py:223: error: Argument 1 to "sorted" has incompatible type "Union[int, str, List[Union[str, int]], None]"; expected "Iterable[Union[str, int]]"
| f"its values. Got {custom_categories} instead." | ||
| ) | ||
|
|
||
| # check that custom_categories variable match variables |
There was a problem hiding this comment.
if custom_categories should exactly match variables, I think we should just make this parameter take precendence over variables, and ignore variables altogether.
The alternative is, if we allow the user to specify None or a group of variables in variables, while custom_categories contains only a subset of these. Could be practical for the user, but a headache for us. Need to think / have a look at how much code we should change.
There was a problem hiding this comment.
I'm reading this comment after responding to your prior question.
I agree that it will be a headache for us and useful for the user. I'll think about this more too.
There was a problem hiding this comment.
Alright then, let me know if you find an elegant solution :)
| """ | ||
| for var, categories in self.custom_categories.items(): | ||
| unique_values = set(X[var].unique()) | ||
| if not set(categories).issubset(unique_values): |
There was a problem hiding this comment.
what if, the user wants a dummy for a category that is not present in the train set? should we contemplate this or is too unlikely?
There was a problem hiding this comment.
When someone is OHE-ing a category that is not present in the training set, are they putting it into a "catch-all" bucket, e.g. unlearned_categories? Is this common?
I think it's unlikely and defeats the purpose of the custom_categories param. I see custom_categories being used to protect against unlearned categories.
There was a problem hiding this comment.
Here is an example:
I work in a car insurance company. My historical data has 3 types of fuel: gas, petrol, diesel. But, I know that electric vehicles are coming. So I want an additional category: electric.
In the above scenario, it might help if we allow the additional bespoke category.
On the other hand, a model can't be train in a category that does not exist, so the results would be unreliable. Maybe we don't allow them for now?
|
Hi @solegalli, When you're back from vacay, lmk what you think. Also, I don't know why I said that OHE has a low code coverage score. I just ran the coverage report - the transformer received 100% coverage! |
|
Hey @Morgan-Sell I disappeared for a while, so I am a bit lost. I guess you were waiting for my input here, no? You let me know when I need to take another look? I am away from next Thursday, for a month, ehem, (red face) |
Closes #303.
Objective:
Create functionality so a user can encode certain categories that may not be the most frequent. The functionality is explained in this thread.
Will create a new init param called
custom_categoriesthat accepts a dictionary with selected features as keys and lists of the desired categories as values.Both
top_categoriesanduser_categoriescannot be used at the same time.