DATA VALIDATION & SCHEMA

Schema Validation

Column Validation

Ensures all required features are present and conform to expected data types

# Column validation method def validate_number_of_columns(self, dataframe: pd.DataFrame) -> bool: number_of_columns = len(self._schema_config) logging.info(f"Required columns: {number_of_columns}") logging.info(f"DataFrame columns: {len(dataframe.columns)}") if len(dataframe.columns) == number_of_columns: return True return False

Schema Definition

YAML-based schema with 30 features (URL & website attributes) + target

# schema.yaml excerpt columns: - having_IP_Address: int64 - URL_Length: int64 - Shortining_Service: int64 - having_At_Symbol: int64 - double_slash_redirecting: int64 - Prefix_Suffix: int64 - having_Sub_Domain: int64 - SSLfinal_State: int64 - Domain_registeration_length: int64 - Favicon: int64 - port: int64 # ...19 additional features - Result: int64 # Target variable numerical_columns: - having_IP_Address - URL_Length # ...28 additional features - Result

Drift Detection

Statistical Drift Testing

Uses Kolmogorov-Smirnov test to detect distribution shifts between train & test data

# Drift detection with KS test def detect_dataset_drift(self, base_df, current_df, threshold=0.05) -> bool: status = True report = {} for column in base_df.columns: d1 = base_df[column] d2 = current_df[column] is_same_dist = ks_2samp(d1, d2) if threshold <= is_same_dist.pvalue: is_found = False else: is_found = True status = False report.update({ column: { "p_value": float(is_same_dist.pvalue), "drift_status": is_found } }) # Write report to YAML write_yaml_file(file_path=drift_report_file_path, content=report) return status

Drift Report Sample

Sample report showing p-values and drift status for features

having_IP_Address: p-value=0.89 (No Drift)
URL_Length: p-value=0.76 (No Drift)
Shortining_Service: p-value=0.03 (Drift Detected)
having_At_Symbol: p-value=0.68 (No Drift)
double_slash_redirecting: p-value=0.91 (No Drift)
Domain_registeration_length: p-value=0.04 (Drift Detected)
HTTPS_token: p-value=0.56 (No Drift)

Validation Pipeline Benefits

Data Quality Assurance

Prevents garbage-in, garbage-out by ensuring all data meets expected structure before training

Early Warning System

Identifies distribution shifts that could impact model performance before deployment

Pipeline Robustness

Creates audit trail of validation reports for tracing and reproducing model performance