This repository was archived by the owner on Jun 17, 2024. It is now read-only.
Description Describe the bug
There is no process to restore the column name after correcting the special character of the column name in the input data.
To Reproduce
Steps to reproduce the behavior:
Show your code calling generate_code().
script
import numpy as np
import pandas as pd
from sapientml import SapientML
df = pd .DataFrame ({'a' : [1 ,2 ]* 10 , 'b' : ["moji" ]* 20 , '[y]' : [1 ,0 ]* 10 })
cls_ = SapientML (
target_columns = ['[y]' ],
task_type = 'classification' ,
)
cls_ .fit (
training_data = df ,
)
Attach the datasets or dataframes input to generate_code() if possible.
Show the generated code such as 1_default.py when it was generated.
generated code
# GENERATED PIPELINE
# LOAD DATA
import pandas as pd
train_dataset = pd .read_pickle (r"C:\work\workspace\sapientml\outputs\training.pkl" )
# TRAIN-TEST SPLIT
from sklearn .model_selection import train_test_split
def split_dataset (dataset , train_size = 0.75 , random_state = 17 ):
train_dataset , test_dataset = train_test_split (dataset , train_size = train_size , random_state = random_state )
return train_dataset , test_dataset
train_dataset , test_dataset = split_dataset (train_dataset )
train_dataset , validation_dataset = split_dataset (train_dataset )
# SUBSAMPLE
# If the number of rows of train_dataset is larger than sample_size, sample rows to sample_size for speedup.
from lib .sample_dataset import sample_dataset
train_dataset = sample_dataset (
dataframe = train_dataset ,
sample_size = 100000 ,
target_columns = ['[y]' ],
task_type = 'classification'
)
test_dataset = validation_dataset
# Remove special symbols that interfere with visualization and model training
import re
cols_has_symbols = ['[y]' ]
inhibited_symbol_pattern = re .compile (r"[\{\}\[\]\",:<'\\]+" )
train_dataset = train_dataset .rename (columns = lambda col : inhibited_symbol_pattern .sub ("" , col ) if col in cols_has_symbols else col )
test_dataset = test_dataset .rename (columns = lambda col : inhibited_symbol_pattern .sub ("" , col ) if col in cols_has_symbols else col )
# DISCARD COLUMNS WITH ONE VALUE ONLY
cols_one_value_only = ['b' ]
train_dataset = train_dataset .drop (cols_one_value_only , axis = 1 , errors = "ignore" )
test_dataset = test_dataset .drop (cols_one_value_only , axis = 1 , errors = "ignore" )
# DETACH TARGET
TARGET_COLUMNS = ['y' ]
feature_train = train_dataset .drop (TARGET_COLUMNS , axis = 1 )
target_train = train_dataset [TARGET_COLUMNS ].copy ()
feature_test = test_dataset .drop (TARGET_COLUMNS , axis = 1 )
target_test = test_dataset [TARGET_COLUMNS ].copy ()
# MODEL
import numpy as np
from xgboost import XGBClassifier
random_state_model = 42
model = XGBClassifier (random_state = random_state_model , )
from sklearn .preprocessing import LabelEncoder
label_encoder = LabelEncoder ()
target_train = pd .DataFrame (label_encoder .fit_transform (target_train ), columns = TARGET_COLUMNS )
model .fit (feature_train , target_train .values .ravel ())
y_pred = model .predict (feature_test )
y_pred = label_encoder .inverse_transform (y_pred ).reshape (- 1 , 1 )
#EVALUATION
from sklearn import metrics
f1 = metrics .f1_score (target_test , y_pred , average = 'macro' )
print ('RESULT: F1 Score: ' + str (f1 ))
Expected behavior
File output processing is being performed with the modified column name
Environment (please complete the following information):
OS: [e.g. Ubuntu 20.04]
SapientML Version: [e.g. 2.3.4]
Reactions are currently unavailable
Describe the bug
There is no process to restore the column name after correcting the special character of the column name in the input data.
To Reproduce
Steps to reproduce the behavior:
generate_code().script
generate_code()if possible.1_default.pywhen it was generated.generated code
Expected behavior
File output processing is being performed with the modified column name
Environment (please complete the following information):