DATA TRANSFORMATION
Transformation Pipeline
Raw numerical features from validation step
KNNImputer (n_neighbors=3, weights='uniform')
Replace -1 with 0 in target column for binary classification
Convert to numpy arrays for model training
Preprocessor Serialization
Save as preprocessor.pkl for inference consistency
Implementation
DATA_TRANSFORMATION_IMPUTER_PARAMS: dict = {
"missing_values": np.nan,
"n_neighbors": 3,
"weights": "uniform",
}
def get_data_transformer_object(cls) -> Pipeline:
imputer: KNNImputer = KNNImputer(**DATA_TRANSFORMATION_IMPUTER_PARAMS)
processor: Pipeline = Pipeline([("imputer", imputer)])
return processor
def initiate_data_transformation(self) -> DataTransformationArtifact:
# Load validated data
train_df = DataTransformation.read_data(self.data_validation_artifact.valid_train_file_path)
test_df = DataTransformation.read_data(self.data_validation_artifact.valid_test_file_path)
# Separate features and target
input_feature_train_df = train_df.drop(columns=[TARGET_COLUMN], axis=1)
target_feature_train_df = train_df[TARGET_COLUMN]
target_feature_train_df = target_feature_train_df.replace(-1, 0) # Binary classification
# Apply transformations
preprocessor = self.get_data_transformer_object()
preprocessor_object = preprocessor.fit(input_feature_train_df)
# Transform and save data
transformed_input_train = preprocessor_object.transform(input_feature_train_df)
train_arr = np.c_[transformed_input_train, np.array(target_feature_train_df)]
# Save preprocessor for inference
save_object("final_model/preprocessor.pkl", preprocessor_object)
Key Features Utilized
having_IP_Address
URL_Length
Shortining_Service
having_At_Symbol
double_slash_redirecting
Prefix_Suffix
having_Sub_Domain
SSLfinal_State
Domain_registeration_length
Favicon
port
HTTPS_token
Request_URL
URL_of_Anchor
+16 more features
Reproducibility
Serialized preprocessor ensures consistent transformations across training and inference
Intelligent Imputation
KNN-based missing value handling preserves feature relationships rather than simple statistics
Pipeline Architecture
Modular design allows easy extension with additional transformers if needed