IT Project Plan
Academic project for IT Projects course — Chernivtsi National University
🔍 Fraud Detection System
Project Proposal submitted to Mr. Marku
Reference: IT Projects Course
Assignment
Use Case: Retrain Fraud Model on New Patterns
Aim: To reduce the time required to update the system with new data, ensuring high accuracy for legitimate users while minimizing downtime.
Overview: This process involves retraining the classifier to identify fresh fraudulent transaction types that the current production model misses. The automation of this process significantly reduces financial losses by cutting down the delay between spotting new fraud patterns and deploying updated models.
Actors
- Project Manager — Accountable for approving and deploying models to production
- ML Engineer — Responsible for training the Challenger model and comparing it vs Production
- MLOps Engineer — Manages model registration, deployment, performance monitoring, and rollbacks
- Fraud Analyst (Subject Matter Expert) — Reviews and labels fraud cases, and conducts bias/fairness audits
- Tech Lead — Responsible for triggering the retrain & evaluate pipeline
- Product Owner — Accountable for the final performance report and business value
- Data Engineer — Pulls the latest datasets from the Feature Store
- QA Engineer — Generates performance reports and validates model quality
- Automated Pipeline (System) — Orchestrates training, evaluation, and deployment processes
Pre-conditions
- New labeled data: The Fraud Analyst has already tagged recent suspicious transactions as "confirmed fraud"
- Dataset availability: This updated data is ready in the Feature Store
Scenario
- First, the Fraud Analyst finishes reviewing the fraud cases missed yesterday.
- The MLOps Engineer then triggers the "Retrain & Evaluate" pipeline.
- The System pulls the latest dataset from the Feature Store.
- A "Challenger" model is trained by the System to recognize these new patterns.
- The System compares this Challenger model against the active Production model (checking metrics like Recall and False Positive Rate).
- A report is generated showing that the new model detects the fraud without flagging legitimate users.
- Finally, the System registers the Challenger model as a release candidate.
Post-conditions
A new, more accurate model version is registered and staged, ready for zero-downtime deployment. The system maintains high availability while incorporating improved fraud detection capabilities.
Unit Testing Strategy
The unit tests verifying this scenario will be implemented in the next development phase. The full source code, including tests for model comparison and registration logic, will be hosted on GitHub. The repository link will be shared for code review once implementation is finalized.
Planned test coverage includes:
- Data pipeline validation and Feature Store integration
- Model training and evaluation metrics verification
- Challenger vs Production model comparison logic
- Model registry and versioning functionality
- Deployment rollback mechanisms
Yours sincerely,
Andrii Vlonha
Retrain Pipeline — Flow Diagram
Visual representation of the automated fraud model retraining and deployment pipeline.
Project Kanban Board
Current sprint status for the Fraud Detection System — Retrain Pipeline & Production Rollout.
To Do
5Doing
3Done
6Key Elements of IT Project Planning — Applied to Fraud Detection System
Each element below is tailored to the Fraud Detection System project (retrain pipeline & production rollout).
| Element | Owner | Purpose | Outputs | IT/Fraud Examples/Notes | Category |
|---|---|---|---|---|---|
| Identify & Analyze Stakeholders | Project Manager | Map everyone who influences or is impacted by the project to ensure proper engagement and avoid surprises. |
|
|
Foundation |
| Define Roles, Responsibilities & RACI | PM + Tech Lead | Eliminate confusion — clearly define who owns what to streamline collaboration in fast-paced IT environments. |
|
|
Foundation |
| Hold Kickoff Meeting | Project Manager | Align team on vision, scope, and processes to kickstart execution. |
|
|
Launch |
| Define Scope, Budget & Timeline | PM + PO | Set firm boundaries to manage expectations and prevent overruns. |
|
|
Core |
| Deliverables & Acceptance Criteria | PO + Tech Lead | Make success tangible by specifying outputs and how to verify them. |
|
|
Core |
| Create Schedule & Milestones | Project Manager | Break down work into actionable steps with timelines. |
|
|
Execution |
| Plan Resources & Team Capacity | PM + Tech Lead | Ensure availability of resources to avoid bottlenecks. |
|
|
Execution |
| Risk Assessment & Mitigation | PM + Security | Identify and mitigate threats early to protect project outcomes. |
|
|
Control |
| Quality & Success Metrics | Tech Lead + QA | Establish benchmarks to ensure the system meets high standards. |
|
|
Control |
| Communication Plan | Project Manager | Maintain transparency and quick issue resolution. |
|
|
Control |
RACI Matrix for Fraud Detection System
| Task | Project Manager | ML Engineer | MLOps Engineer | Fraud Analyst | Tech Lead | Product Owner | Data Engineer | QA Engineer |
|---|---|---|---|---|---|---|---|---|
| Review and label fraud cases | I | C | I | R A | C | C | C | I |
| Trigger retrain & evaluate pipeline | C | R | R A | I | C | I | I | C |
| Pull latest dataset from Feature Store | I | R | C | I | I | I | R A | |
| Train Challenger model | I | R A | C | C | C | I | I | I |
| Compare Challenger vs Production model | C | R | C | C | A | C | I | R |
| Generate performance report | C | C | C | C | C | A | I | R |
| Register Challenger model | I | R | R A | I | C | I | C | |
| Approve and deploy to production | C | C | R | I | R A | A | I | R |
| Monitor production performance | C | C | R A | R | C | C | I | C |
| Handle deployment rollbacks if needed | I | C | R A | I | R | C | I | R |
Project Priorities (Iron Triangle)
Primary Driver: Quality
In ML Fraud Detection, false negatives mean lost money, and false positives block real users. Scope & Quality are non-negotiable for a passing grade and business value.
Secondary Constraint: Deadline
The project is bound by the university academic calendar. The defense date is fixed, meaning timeline extensions are impossible.
Scope Actions (45%)
- Train Challenger model with 95%+ Precision/Recall.
- Build automated MLflow Retrain Pipeline.
- Implement Zero-Downtime Blue/Green deploy.
Time Actions (35%)
- Strict 4-Sprint lifecycle (2 weeks each).
- Deliver Core Pipeline MVP by Sprint 2.
- Final freeze 1 week before presentation.
Cost Actions (20%)
- Cap AWS/GCP usage at $500/month.
- Use Spot Instances for model training.
- Utilize open-source tools (Grafana, MLflow).
8. Risk assessment template
| What are the hazards? | Who might be harmed and how? | What are you already doing to control the risks? | What further action do you need to take to control the risks? | Who needs to carry out the action? | When is the action needed by? | Status |
|---|---|---|---|---|---|---|
| Customer's insolvency (Funding falls through) | Development Team & Agency: Loss of expected revenue, unpaid working hours, and abrupt project cancellation. | We hold regular monthly syncs with the client to assess their business health and project satisfaction. | Require a 30% upfront advance payment before commencing the next project phase; pause execution if invoices are >15 days late. | Project Manager / Finance | Project start | Done |
| Data Drift degrading model accuracy | Business: Missed fraudulent transactions leading to direct
financial loss. Users: Increased false positives. |
Data Scientists manually evaluate batch transaction data from the previous week to check for statistical deviations. | Implement automated concept drift detection (e.g., evidentlyAI) within the MLflow pipeline to trigger auto-retraining alerts. | MLOps Engineer | Sprint 2 | In Process |
| Cloud Compute (GPU) Budget Overrun | Company Financials: Exceeding the strict $500/month budget reduces overall project profitability. | Basic AWS/GCP billing alerts are configured to trigger emails at 80% and 100% of the budget threshold. | Transition training workloads exclusively to Spot Instances and enforce strict auto-shutdown policies for idle GPU servers. | DevOps Engineer | Sprint 1 | Done |
| Critical Spike in False Positive Rate (>1%) | Legitimate Customers: Payment rejections, account lockouts, and severe UX degradation leading to churn. | Evaluating the Challenger model using standard train/test split metrics on historical static datasets. | Setup Grafana real-time alerts for live FPR metrics and mandate a Shadow A/B testing phase before full traffic routing. | QA / ML Engineer | Sprint 3 | In Process |
| Production Downtime during deployment | E-commerce Platforms & Users: Unable to process real-time checkouts during the API outage window. | Manual deployments are scheduled exclusively during low-traffic night hours (3:00 AM) with manual rollback plans. | Architect and test a Kubernetes-based Blue-Green deployment strategy ensuring 100% zero-downtime updates. | Tech Lead / DevOps | Sprint 4 | Future |
| GDPR / PII Privacy Violation | Company: Heavy regulatory fines, legal action, and massive reputational damage. | Raw transaction data access is restricted exclusively to authorized senior Database Administrators. | Implement automated data masking and hashing pipelines in the Feature Store before data reaches the ML training environment. | Data Engineer / SecOps | Sprint 2 | In Process |
| API Inference Latency >50ms | End-users: Frustratingly slow checkout process leading to cart abandonment and lower conversion rates. | Utilizing a simplified baseline model architecture (e.g., XGBoost) to keep prediction times naturally low. | Optimize the final serialized deep learning model using ONNX Runtime or TensorRT to guarantee sub-50ms execution. | ML Engineer | Sprint 4 | Future |
| Unexpected departure of Key Team Member | Project Timeline & Team: Severe delays in pipeline delivery and loss of critical domain/architectural knowledge. | Conducting daily stand-up meetings to share current context, tasks, and immediate blockages across the team. | Enforce a strict "Bus Factor" policy by requiring detailed Runbooks, Architectural Decision Records (ADRs), and mandatory code reviews. | Project Manager | Sprint 1 | Done |
What the IT Project Should Produce — Expected Outcomes
Concrete deliverables and measurable results the Fraud Detection System project will produce upon successful completion.
Automated Retraining Pipeline
A fully automated CI/CD pipeline (Airflow + MLflow) that retrains the fraud classifier on new labeled data, evaluates the Challenger model, and registers it — with zero manual intervention.
⚡ 50% faster model updatesModel Versioning & Registry
Every trained model is versioned and tracked in MLflow with metadata (metrics, parameters, artifacts). Rollback to any previous version is possible within minutes.
🔒 Full audit trailReal-time Monitoring Dashboard
Grafana dashboard tracking live model performance: Recall, False Positive Rate, prediction latency, and data drift alerts — visible to all stakeholders.
🎯 99.9% uptime SLAZero-Downtime Deployment
Blue-green or canary deployment strategy on Kubernetes ensures the production system stays online during model updates. Automatic rollback triggered if FPR degrades.
⏱️ <50ms prediction latencyGovernance & Documentation
Complete project documentation: RACI matrix, risk register, communication plan, incident runbooks, GDPR compliance checklist, and unit-tested source code on GitHub.
✅ GDPR compliantImproved Fraud Detection Accuracy
The new model detects 95%+ of fraud cases including previously missed patterns, while keeping False Positive Rate below 1% — protecting legitimate users from being blocked.
📈 95% accuracy achievedAgile Methodologies
How Scrum & Kanban work together in the Fraud Detection System project
🔄 SCRUM — How We Use It in This Project
Scrum gives the team a structured, time-boxed lifecycle to build, evaluate and ship the fraud model in predictable increments
Product Backlog
All desired features and tasks for the fraud system, owned & prioritised by the Product Owner. Items are ordered by fraud risk impact and technical dependency.
Sprint Planning
Tech Lead (Scrum Master) + Product Owner + Dev Team select which backlog items to commit to for the next 2 weeks. The Sprint Goal is defined. Tasks are estimated in story points.
Sprint Backlog
The committed subset of tasks for this Sprint. Each item becomes a Kanban card on the board with an assigned owner, tag (MLOps / Data / QA / PM) and progress tracker.
- Fraud Analyst labels new fraud cases (Data)
- Data Engineer updates Feature Store (Data)
- ML Engineer trains Challenger model (XGBoost) (MLOps)
- evidentlyAI drift detection integrated (MLOps)
- GDPR masking pipeline verified (Data + SecOps)
Sprint 2 weeks
The team executes — cards move across the Kanban board every day. The Tech Lead runs a 15-minute Daily Standup every morning to surface blockers before they stall progress.
"Trained XGBoost Challenger — Recall 0.93 on test set"
"Compare Challenger vs Production in MLflow, generate report"
"GPU quota exceeded — DevOps escalation needed"
Potentially Shippable Product Increment
At the end of each Sprint, the team delivers a working, tested, "Done" increment. The Definition of Done is enforced strictly:
Sprint Review
Last day of Sprint · ~2 hours · All stakeholders attendThe team demos the working increment to stakeholders. Product Backlog is updated based on feedback received during the demo.
- Live demo of Challenger model metrics on Grafana dashboard
- Fraud Analyst confirms new fraud patterns are correctly detected
- Product Owner formally accepts or rejects the increment
- Backlog reprioritised (e.g. A/B shadow testing moved up if FPR risk found)
- Next Sprint scope agreed with all stakeholders
Sprint Retrospective
After Review · ~1.5 hours · Team only (no stakeholders)The team reflects on how they worked, not what they built. Three questions drive continuous process improvement every sprint.
"Automated pipeline trigger saved 3h of manual work in Sprint 2"
"GPU quota blocker was discovered mid-sprint, too late"
"Add GPU usage check to Sprint Planning checklist — action: DevOps"
Sprint Timeline — 4 × 2 Weeks
Full project lifecycle — goals, key deliverables and milestones per Sprint
- Stakeholder register & RACI matrix defined
- Feature Store schema designed
- MLflow model registry configured
- Retrain pipeline trigger implemented
- AWS Spot Instances & budget alerts set up
- Bus Factor runbooks started (ADRs)
- Fraud Analyst labels new fraud cases
- Feature Store updated with labeled data
- Challenger model trained (XGBoost)
- evidentlyAI drift detection integrated
- GDPR data masking pipeline built
- Challenger vs Production comparison logic
- Challenger vs Production final comparison
- A/B shadow testing on live traffic
- Grafana FPR real-time alerts configured
- QA performance evaluation report
- Bias & fairness audit by Fraud Analyst
- QA sign-off: Precision 0.95 / Recall 0.92
- Blue-green deployment to Kubernetes
- Grafana dashboard live for all stakeholders
- Incident runbooks written & reviewed
- ONNX model optimised (<50ms latency)
- Final documentation & ADRs completed
- 1-week code freeze before presentation
📋 KANBAN — How We Use It in This Project
Kanban runs inside every Sprint — it visualises and controls the daily task flow so the team always knows what to work on next
👁️ Visualise Every Task
Every item from the Sprint Backlog becomes a card on the Kanban board. Nothing is hidden — if it's not on the board, it's not being worked on. The MLOps Engineer, ML Engineer, Fraud Analyst and QA all update their cards after each Daily Standup.
🚦 WIP Limits Prevent Overload
Max 3 cards "In Progress" at once. This stops the ML Engineer from training three models simultaneously while completing none. When a card is blocked (e.g. GPU quota exceeded), it is flagged red and escalated at the next Daily Standup immediately.
⚡ Continuous Delivery After Sprint 4
Once the initial 4 Scrum sprints are complete, the fraud detection system switches to pure Kanban mode for ongoing operations. When evidentlyAI detects data drift → the retrain pipeline triggers automatically → the new model flows through the board → deployed without waiting for a sprint boundary.
All tasks not yet in a sprint. Fed from Product Backlog. Prioritised by the PO based on fraud risk impact and business value.
Sprint-committed items ready to be picked up. All dependencies are met — labeled data is available, access is granted.
Actively being worked on. Owner assigned, progress bar tracked. Blocked cards flagged red and escalated at next Standup.
Built but awaiting verification — code review, model metric check by QA Engineer, or peer test of the pipeline logic.
Meets the Definition of Done: tested, merged, documented, and either deployed or staged and ready for zero-downtime production deploy.
Certifications & Achievements
Professional certifications and completed courses in MLOps, Cloud Computing, and Software Engineering.
MLOps Specialization
DeepLearning.AI
Expected: 2025
AWS Solutions Architect
Amazon Web Services
Expected: 2025
Kubernetes Administrator (CKA)
Cloud Native Computing Foundation
Expected: 2025
Docker & Containerization
Docker Inc.
Expected: 2025
Python for Data Science
DataCamp / Coursera
Expected: 2025
Machine Learning Engineering
DeepLearning.AI
Expected: 2025
Terraform & Infrastructure as Code
HashiCorp
Expected: 2025
CI/CD & DevOps Fundamentals
GitHub / GitLab
Expected: 2025