DATA VALIDATION & SCHEMA

Schema Validation

Column Validation

Ensures all required features are present and conform to expected data types

# Column validation method
def validate_number_of_columns(self, dataframe: pd.DataFrame) -> bool:
    number_of_columns = len(self._schema_config)
    logging.info(f"Required columns: {number_of_columns}")
    logging.info(f"DataFrame columns: {len(dataframe.columns)}")
    
    if len(dataframe.columns) == number_of_columns:
        return True
    return False
                        

Schema Definition

YAML-based schema with 30 features (URL & website attributes) + target

# schema.yaml excerpt
columns:
  - having_IP_Address: int64
  - URL_Length: int64
  - Shortining_Service: int64
  - having_At_Symbol: int64
  - double_slash_redirecting: int64
  - Prefix_Suffix: int64
  - having_Sub_Domain: int64
  - SSLfinal_State: int64
  - Domain_registeration_length: int64
  - Favicon: int64
  - port: int64
  # ...19 additional features
  - Result: int64  # Target variable

numerical_columns:
  - having_IP_Address
  - URL_Length
  # ...28 additional features
  - Result
                        

Drift Detection

Statistical Drift Testing

Uses Kolmogorov-Smirnov test to detect distribution shifts between train & test data

# Drift detection with KS test
def detect_dataset_drift(self, base_df, current_df, threshold=0.05) -> bool:
    status = True
    report = {}
    
    for column in base_df.columns:
        d1 = base_df[column]
        d2 = current_df[column]
        is_same_dist = ks_2samp(d1, d2)
        
        if threshold <= is_same_dist.pvalue:
            is_found = False
        else:
            is_found = True
            status = False
            
        report.update({
            column: {
                "p_value": float(is_same_dist.pvalue),
                "drift_status": is_found
            }
        })
    
    # Write report to YAML
    write_yaml_file(file_path=drift_report_file_path, content=report)
    return status
                        

Drift Report Sample

Sample report showing p-values and drift status for features

having_IP_Address: p-value=0.89 (No Drift)

URL_Length: p-value=0.76 (No Drift)

Shortining_Service: p-value=0.03 (Drift Detected)

having_At_Symbol: p-value=0.68 (No Drift)

double_slash_redirecting: p-value=0.91 (No Drift)

Domain_registeration_length: p-value=0.04 (Drift Detected)

HTTPS_token: p-value=0.56 (No Drift)

Validation Pipeline Benefits

Data Quality Assurance

Prevents garbage-in, garbage-out by ensuring all data meets expected structure before training

Early Warning System

Identifies distribution shifts that could impact model performance before deployment

Pipeline Robustness

Creates audit trail of validation reports for tracing and reproducing model performance

Prev Slide 7/12 Next