DATA TRANSFORMATION

Transformation Pipeline

Input Features

Raw numerical features from validation step

Missing Value Imputation

KNNImputer (n_neighbors=3, weights='uniform')

Target Transformation

Replace -1 with 0 in target column for binary classification

Array Construction

Convert to numpy arrays for model training

Preprocessor Serialization

Save as preprocessor.pkl for inference consistency

Implementation

# KNN Imputer Configuration DATA_TRANSFORMATION_IMPUTER_PARAMS: dict = { "missing_values": np.nan, "n_neighbors": 3, "weights": "uniform", } # Preprocessor Pipeline Creation def get_data_transformer_object(cls) -> Pipeline: imputer: KNNImputer = KNNImputer(**DATA_TRANSFORMATION_IMPUTER_PARAMS) processor: Pipeline = Pipeline([("imputer", imputer)]) return processor
# Core Transformation Logic def initiate_data_transformation(self) -> DataTransformationArtifact: # Load validated data train_df = DataTransformation.read_data(self.data_validation_artifact.valid_train_file_path) test_df = DataTransformation.read_data(self.data_validation_artifact.valid_test_file_path) # Separate features and target input_feature_train_df = train_df.drop(columns=[TARGET_COLUMN], axis=1) target_feature_train_df = train_df[TARGET_COLUMN] target_feature_train_df = target_feature_train_df.replace(-1, 0) # Binary classification # Apply transformations preprocessor = self.get_data_transformer_object() preprocessor_object = preprocessor.fit(input_feature_train_df) # Transform and save data transformed_input_train = preprocessor_object.transform(input_feature_train_df) train_arr = np.c_[transformed_input_train, np.array(target_feature_train_df)] # Save preprocessor for inference save_object("final_model/preprocessor.pkl", preprocessor_object)

Key Features Utilized

having_IP_Address URL_Length Shortining_Service having_At_Symbol double_slash_redirecting Prefix_Suffix having_Sub_Domain SSLfinal_State Domain_registeration_length Favicon port HTTPS_token Request_URL URL_of_Anchor +16 more features

Reproducibility

Serialized preprocessor ensures consistent transformations across training and inference

Intelligent Imputation

KNN-based missing value handling preserves feature relationships rather than simple statistics

Pipeline Architecture

Modular design allows easy extension with additional transformers if needed