Project Overview
In this project the objective was to design, train, and evaluate a machine learning model capable of classifying websites as phishing or legitimate based on extracted features.
Phishing attacks are among the most widespread and damaging threats to online security. Automating their detection can help prevent fraud, protect user data, and improve trust in online services.
Dataset
We used the Phishing Websites Dataset from the UCI Machine Learning Repository, containing 11,055 records and 31 features describing website characteristics, including:
- URL length and structure
- Use of HTTPS protocol
- Presence of suspicious symbols
- Domain registration length
- Page rank and traffic statistics
The target variable is binary:
1
→ Phishing website-1
→ Legitimate website
Preprocessing
The preprocessing pipeline included:
- Handling missing values.
- Encoding categorical variables into numerical format.
- Normalizing numerical features.
- Splitting the dataset into training (80%) and testing (20%) sets.
Models and Evaluation
We trained and evaluated multiple supervised classification algorithms:
- Decision Tree Classifier
- Random Forest Classifier
- Logistic Regression
- Support Vector Machine (SVM)
- K-Nearest Neighbors (KNN)
Evaluation Metrics:
- Accuracy
- Precision
- Recall
- F1-Score
- Confusion Matrix
Results
The Random Forest Classifier delivered the highest performance across all metrics:
Model | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
Decision Tree | 0.951 | 0.95 | 0.95 | 0.95 |
Random Forest | 0.977 | 0.98 | 0.98 | 0.98 |
Logistic Regression | 0.926 | 0.93 | 0.93 | 0.93 |
SVM | 0.961 | 0.96 | 0.96 | 0.96 |
KNN | 0.934 | 0.93 | 0.93 | 0.93 |
The Random Forest model showed excellent generalization on unseen data and is the recommended approach for deployment.
Conclusions
- Random Forest proved to be a highly effective algorithm for phishing detection, offering strong performance without extensive parameter tuning.
- This model can be integrated into a real-time phishing detection API to flag suspicious URLs before users access them.
- Future work could explore deep learning models or dynamic feature extraction from live webpage content.
Team & Acknowledgments
This was a group project completed as part of the Fundamentos de Aprendizaje de Máquina course at PUCP, with:
- Carlos Alberto Varas Tello
- Lucia Dayhana Garcia Murguia
- Paul Oliver Mateo Rodulfo
GitHub Repository: https://github.com/milkreator/deteccion_phishing_web