DATA TRANSFORMATION

Transformation Pipeline

Input Features

Raw numerical features from validation step

Missing Value Imputation

KNNImputer (n_neighbors=3, weights='uniform')

Target Transformation

Replace -1 with 0 in target column for binary classification

Array Construction

Convert to numpy arrays for model training

Preprocessor Serialization

Save as preprocessor.pkl for inference consistency

Implementation

# KNN Imputer Configuration
DATA_TRANSFORMATION_IMPUTER_PARAMS: dict = {
    "missing_values": np.nan,
    "n_neighbors": 3,
    "weights": "uniform",
}

# Preprocessor Pipeline Creation
def get_data_transformer_object(cls) -> Pipeline:
    imputer: KNNImputer = KNNImputer(**DATA_TRANSFORMATION_IMPUTER_PARAMS)
    processor: Pipeline = Pipeline([("imputer", imputer)])
    return processor

# Core Transformation Logic
def initiate_data_transformation(self) -> DataTransformationArtifact:
    # Load validated data
    train_df = DataTransformation.read_data(self.data_validation_artifact.valid_train_file_path)
    test_df = DataTransformation.read_data(self.data_validation_artifact.valid_test_file_path)

    # Separate features and target
    input_feature_train_df = train_df.drop(columns=[TARGET_COLUMN], axis=1)
    target_feature_train_df = train_df[TARGET_COLUMN]
    target_feature_train_df = target_feature_train_df.replace(-1, 0)  # Binary classification

    # Apply transformations
    preprocessor = self.get_data_transformer_object()
    preprocessor_object = preprocessor.fit(input_feature_train_df)
    
    # Transform and save data
    transformed_input_train = preprocessor_object.transform(input_feature_train_df)
    train_arr = np.c_[transformed_input_train, np.array(target_feature_train_df)]
    
    # Save preprocessor for inference
    save_object("final_model/preprocessor.pkl", preprocessor_object)

Key Features Utilized

having_IP_Address URL_Length Shortining_Service having_At_Symbol double_slash_redirecting Prefix_Suffix having_Sub_Domain SSLfinal_State Domain_registeration_length Favicon port HTTPS_token Request_URL URL_of_Anchor +16 more features

Reproducibility

Serialized preprocessor ensures consistent transformations across training and inference

Intelligent Imputation

KNN-based missing value handling preserves feature relationships rather than simple statistics

Pipeline Architecture

Modular design allows easy extension with additional transformers if needed

Prev Slide 8/12 Next