IT Projects - Andriy Vlonha

IT Project Plan

Academic project for IT Projects course — Chernivtsi National University

⚡

50%

Faster Detection

🎯

95%

Model Accuracy

🚀

Zero

Downtime Deploy

🔍 Fraud Detection System

Project Proposal submitted to Mr. Marku
Reference: IT Projects Course Assignment

🎯 Use Case: Retrain Fraud Model on New Patterns

Aim: To reduce the time required to update the system with new data, ensuring high accuracy for legitimate users while minimizing downtime.

Overview: This process involves retraining the classifier to identify fresh fraudulent transaction types that the current production model misses. The automation of this process significantly reduces financial losses by cutting down the delay between spotting new fraud patterns and deploying updated models.

👥 Actors

Project Manager — Accountable for approving and deploying models to production
ML Engineer — Responsible for training the Challenger model and comparing it vs Production
MLOps Engineer — Manages model registration, deployment, performance monitoring, and rollbacks
Fraud Analyst (Subject Matter Expert) — Reviews and labels fraud cases, and conducts bias/fairness audits
Tech Lead — Responsible for triggering the retrain & evaluate pipeline
Product Owner — Accountable for the final performance report and business value
Data Engineer — Pulls the latest datasets from the Feature Store
QA Engineer — Generates performance reports and validates model quality
Automated Pipeline (System) — Orchestrates training, evaluation, and deployment processes

📋 Pre-conditions

New labeled data: The Fraud Analyst has already tagged recent suspicious transactions as "confirmed fraud"
Dataset availability: This updated data is ready in the Feature Store

🔄 Scenario

First, the Fraud Analyst finishes reviewing the fraud cases missed yesterday.
The MLOps Engineer then triggers the "Retrain & Evaluate" pipeline.
The System pulls the latest dataset from the Feature Store.
A "Challenger" model is trained by the System to recognize these new patterns.
The System compares this Challenger model against the active Production model (checking metrics like Recall and False Positive Rate).
A report is generated showing that the new model detects the fraud without flagging legitimate users.
Finally, the System registers the Challenger model as a release candidate.

✅ Post-conditions

A new, more accurate model version is registered and staged, ready for zero-downtime deployment. The system maintains high availability while incorporating improved fraud detection capabilities.

🧪 Unit Testing Strategy

The unit tests verifying this scenario will be implemented in the next development phase. The full source code, including tests for model comparison and registration logic, will be hosted on GitHub. The repository link will be shared for code review once implementation is finalized.

Planned test coverage includes:

Data pipeline validation and Feature Store integration
Model training and evaluation metrics verification
Challenger vs Production model comparison logic
Model registry and versioning functionality
Deployment rollback mechanisms

Yours sincerely,
Andrii Vlonha

Retrain Pipeline — Flow Diagram

Visual representation of the automated fraud model retraining and deployment pipeline.

flowchart TD A["🕵️ Fraud Analyst\nReviews & Labels Cases"] -->|"Confirmed fraud data"| B["🗄️ Feature Store\nUpdated with New Labels"] B --> C["⚙️ MLOps Engineer\nTriggers Retrain Pipeline"] C --> D["📥 System Pulls\nLatest Dataset"] D --> E["🤖 Train Challenger\nModel"] E --> F{"📊 Compare Metrics\nChallenger vs Production\nRecall & FPR"} F -->|"✅ Challenger Better"| G["📋 Register as\nRelease Candidate\n(MLflow)"] F -->|"❌ Challenger Worse"| H["⚠️ Keep Production\nModel & Alert Team"] G --> I["🚀 Zero-Downtime\nDeploy to Production\n(Canary / Blue-Green)"] I --> J["📡 Monitor Production\nGrafana Dashboard"] J -->|"🔁 Drift Detected"| A H -->|"Next Cycle"| A style A fill:#e0f2fe,stroke:#0284c7,color:#0c4a6e style B fill:#f0fdf4,stroke:#16a34a,color:#14532d style C fill:#fef9c3,stroke:#ca8a04,color:#713f12 style D fill:#f0fdf4,stroke:#16a34a,color:#14532d style E fill:#ede9fe,stroke:#7c3aed,color:#3b0764 style F fill:#fff7ed,stroke:#ea580c,color:#431407 style G fill:#d1fae5,stroke:#059669,color:#064e3b style H fill:#fee2e2,stroke:#dc2626,color:#7f1d1d style I fill:#dbeafe,stroke:#2563eb,color:#1e3a5f style J fill:#f3e8ff,stroke:#9333ea,color:#3b0764

Project Kanban Board

Current sprint status for the Fraud Detection System — Retrain Pipeline & Production Rollout.

To Do

Deploy to Production (Zero-Downtime)

MLOpsDevOps

Set up Grafana Monitoring Dashboard

MLOps

Write Incident Runbooks

A/B Shadow Testing in Live Traffic

QAMLOps

Conduct Bias & Fairness Audit

QAData

Doing

Train Challenger Model on New Fraud Patterns

MLOpsData

Compare Challenger vs Production (Recall & FPR)

MLOpsQA

Generate Performance Evaluation Report

QAPM

Done

Kickoff Meeting Held

Fraud Analyst Labeled New Fraud Cases

Data

Feature Store Updated with Labeled Data

Data

MLflow Model Registry Configured

MLOps

Retrain Pipeline Trigger Implemented

MLOpsDevOps

Stakeholder RACI Matrix Defined

Key Elements of IT Project Planning — Applied to Fraud Detection System

Each element below is tailored to the Fraud Detection System project (retrain pipeline & production rollout).

Element	Owner	Purpose	Outputs	IT/Fraud Examples/Notes	Category
Identify & Analyze Stakeholders	Project Manager	Map everyone who influences or is impacted by the project to ensure proper engagement and avoid surprises.	Stakeholder register + RACI / power-interest grid	Fraud Analysts, Compliance (GDPR), Product Owner, Legal, Security, MLOps/DevOps, End-users	Foundation
Define Roles, Responsibilities & RACI	PM + Tech Lead	Eliminate confusion — clearly define who owns what to streamline collaboration in fast-paced IT environments.	RACI matrix, access rights matrix, escalation paths for issues	MLOps Engineer (Responsible for ML pipeline); Fraud Analyst (Accountable for labeling); Tech Lead (Consulted)	Foundation
Hold Kickoff Meeting	Project Manager	Align team on vision, scope, and processes to kickstart execution.	Kickoff deck, shared success criteria, initial risks log	Demo current false positives/negatives, agree on labeling standards for new fraud patterns	Launch
Define Scope, Budget & Timeline	PM + PO	Set firm boundaries to manage expectations and prevent overruns.	Scope statement, budget breakdown, high-level roadmap	Scope: Retrain classifier, evaluate models; Budget: $500/mo GPU; Constraints: latency <50ms	Core
Deliverables & Acceptance Criteria	PO + Tech Lead	Make success tangible by specifying outputs and how to verify them.	Deliverables list with Definition of Done (DoD)	Pipeline code in Git; MLflow models; Grafana dashboard; Tests pass & zero-downtime verified	Core
Create Schedule & Milestones	Project Manager	Break down work into actionable steps with timelines.	Gantt chart / Kanban board, sprint plan, milestones	W1: Analysts label cases; W2: MLOps triggers retrain; Milestone: Challenger model registered	Execution
Plan Resources & Team Capacity	PM + Tech Lead	Ensure availability of resources to avoid bottlenecks.	Resource histogram, tooling list	1 MLOps Eng, 2 Fraud Analysts; Tools: MLflow, Airflow, Kubernetes; Reserve 4 GPUs	Execution
Risk Assessment & Mitigation	PM + Security	Identify and mitigate threats early to protect project outcomes.	Risk register, mitigation / contingency plans	Data drift (monitor); Labeling errors (SME review); Model bias (audits); Contingency: Rollback	Control
Quality & Success Metrics	Tech Lead + QA	Establish benchmarks to ensure the system meets high standards.	KPI dashboard, test strategy (unit, integration, A/B)	Precision 0.95, Recall 0.92, FPR <1%; A/B tests: Shadow mode; Success: 50% faster detection	Control
Communication Plan	Project Manager	Maintain transparency and quick issue resolution.	Comms matrix, real-time dashboards, escalation paths	Daily standups; Weekly reports; Slack alerts for pipeline failures; Dashboards in Grafana	Control

RACI Matrix for Fraud Detection System

Roles: R Responsible | A Accountable | C Consulted | I Informed

Task	Project Manager	ML Engineer	MLOps Engineer	Fraud Analyst	Tech Lead	Product Owner	Data Engineer	QA Engineer
Review and label fraud cases	I	C	I	R A	C	C	C	I
Trigger retrain & evaluate pipeline	C	R	R A	I	C	I	I	C
Pull latest dataset from Feature Store	I	R	C	I	I	I	R A
Train Challenger model	I	R A	C	C	C	I	I	I
Compare Challenger vs Production model	C	R	C	C	A	C	I	R
Generate performance report	C	C	C	C	C	A	I	R
Register Challenger model	I	R	R A	I	C	I		C
Approve and deploy to production	C	C	R	I	R A	A	I	R
Monitor production performance	C	C	R A	R	C	C	I	C
Handle deployment rollbacks if needed	I	C	R A	I	R	C	I	R

Project Priorities (Iron Triangle)

Primary Driver: Quality

In ML Fraud Detection, false negatives mean lost money, and false positives block real users. Scope & Quality are non-negotiable for a passing grade and business value.

Secondary Constraint: Deadline

The project is bound by the university academic calendar. The defense date is fixed, meaning timeline extensions are impossible.

Scope Actions (45%)

Train Challenger model with 95%+ Precision/Recall.
Build automated MLflow Retrain Pipeline.
Implement Zero-Downtime Blue/Green deploy.

Time Actions (35%)

Strict 4-Sprint lifecycle (2 weeks each).
Deliver Core Pipeline MVP by Sprint 2.
Final freeze 1 week before presentation.

Cost Actions (20%)

Cap AWS/GCP usage at $500/month.
Use Spot Instances for model training.
Utilize open-source tools (Grafana, MLflow).

8. Risk assessment template

Company name: IT Projects Dept

Date of next review: Sprint 2 End

What are the hazards?	Who might be harmed and how?	What are you already doing to control the risks?	What further action do you need to take to control the risks?	Who needs to carry out the action?	When is the action needed by?	Status
Customer's insolvency (Funding falls through)	Development Team & Agency: Loss of expected revenue, unpaid working hours, and abrupt project cancellation.	We hold regular monthly syncs with the client to assess their business health and project satisfaction.	Require a 30% upfront advance payment before commencing the next project phase; pause execution if invoices are >15 days late.	Project Manager / Finance	Project start	Done
Data Drift degrading model accuracy	Business: Missed fraudulent transactions leading to direct financial loss. Users: Increased false positives.	Data Scientists manually evaluate batch transaction data from the previous week to check for statistical deviations.	Implement automated concept drift detection (e.g., evidentlyAI) within the MLflow pipeline to trigger auto-retraining alerts.	MLOps Engineer	Sprint 2	In Process
Cloud Compute (GPU) Budget Overrun	Company Financials: Exceeding the strict $500/month budget reduces overall project profitability.	Basic AWS/GCP billing alerts are configured to trigger emails at 80% and 100% of the budget threshold.	Transition training workloads exclusively to Spot Instances and enforce strict auto-shutdown policies for idle GPU servers.	DevOps Engineer	Sprint 1	Done
Critical Spike in False Positive Rate (>1%)	Legitimate Customers: Payment rejections, account lockouts, and severe UX degradation leading to churn.	Evaluating the Challenger model using standard train/test split metrics on historical static datasets.	Setup Grafana real-time alerts for live FPR metrics and mandate a Shadow A/B testing phase before full traffic routing.	QA / ML Engineer	Sprint 3	In Process
Production Downtime during deployment	E-commerce Platforms & Users: Unable to process real-time checkouts during the API outage window.	Manual deployments are scheduled exclusively during low-traffic night hours (3:00 AM) with manual rollback plans.	Architect and test a Kubernetes-based Blue-Green deployment strategy ensuring 100% zero-downtime updates.	Tech Lead / DevOps	Sprint 4	Future
GDPR / PII Privacy Violation	Company: Heavy regulatory fines, legal action, and massive reputational damage.	Raw transaction data access is restricted exclusively to authorized senior Database Administrators.	Implement automated data masking and hashing pipelines in the Feature Store before data reaches the ML training environment.	Data Engineer / SecOps	Sprint 2	In Process
API Inference Latency >50ms	End-users: Frustratingly slow checkout process leading to cart abandonment and lower conversion rates.	Utilizing a simplified baseline model architecture (e.g., XGBoost) to keep prediction times naturally low.	Optimize the final serialized deep learning model using ONNX Runtime or TensorRT to guarantee sub-50ms execution.	ML Engineer	Sprint 4	Future
Unexpected departure of Key Team Member	Project Timeline & Team: Severe delays in pipeline delivery and loss of critical domain/architectural knowledge.	Conducting daily stand-up meetings to share current context, tasks, and immediate blockages across the team.	Enforce a strict "Bus Factor" policy by requiring detailed Runbooks, Architectural Decision Records (ADRs), and mandatory code reviews.	Project Manager	Sprint 1	Done

What the IT Project Should Produce — Expected Outcomes

Concrete deliverables and measurable results the Fraud Detection System project will produce upon successful completion.

🔁

Automated Retraining Pipeline

A fully automated CI/CD pipeline (Airflow + MLflow) that retrains the fraud classifier on new labeled data, evaluates the Challenger model, and registers it — with zero manual intervention.

⚡ 50% faster model updates

📦

Model Versioning & Registry

Every trained model is versioned and tracked in MLflow with metadata (metrics, parameters, artifacts). Rollback to any previous version is possible within minutes.

🔒 Full audit trail

📊

Real-time Monitoring Dashboard

Grafana dashboard tracking live model performance: Recall, False Positive Rate, prediction latency, and data drift alerts — visible to all stakeholders.

🎯 99.9% uptime SLA

🚀

Zero-Downtime Deployment

Blue-green or canary deployment strategy on Kubernetes ensures the production system stays online during model updates. Automatic rollback triggered if FPR degrades.

⏱️ <50ms prediction latency

📋

Governance & Documentation

Complete project documentation: RACI matrix, risk register, communication plan, incident runbooks, GDPR compliance checklist, and unit-tested source code on GitHub.

✅ GDPR compliant

🎯

Improved Fraud Detection Accuracy

The new model detects 95%+ of fraud cases including previously missed patterns, while keeping False Positive Rate below 1% — protecting legitimate users from being blocked.

📈 95% accuracy achieved

Agile Methodologies

How Scrum & Kanban work together in the Fraud Detection System project

🔀

This project uses a Scrum–Kanban hybrid

Scrum structures the development lifecycle into 4 time-boxed sprints with ceremonies (Planning, Review, Retrospective). Kanban visualises the day-to-day task flow on the board — controlling WIP limits and keeping the pipeline unblocked. Both run simultaneously throughout the project.

🔄 SCRUM — How We Use It in This Project

Scrum gives the team a structured, time-boxed lifecycle to build, evaluate and ship the fraud model in predictable increments

Product Backlog

All desired features and tasks for the fraud system, owned & prioritised by the Product Owner. Items are ordered by fraud risk impact and technical dependency.

Retrain pipeline automation Grafana monitoring dashboard GDPR data masking Blue-green deployment A/B shadow testing Rollback mechanism Drift detection (evidentlyAI) ONNX model optimisation

↓

Sprint Planning

Tech Lead (Scrum Master) + Product Owner + Dev Team select which backlog items to commit to for the next 2 weeks. The Sprint Goal is defined. Tasks are estimated in story points.

Example — Sprint 2 Planning Goal: "Deliver a registered Challenger model that outperforms Production on Recall & FPR using freshly labeled fraud data from the Feature Store."

↓

Sprint Backlog

The committed subset of tasks for this Sprint. Each item becomes a Kanban card on the board with an assigned owner, tag (MLOps / Data / QA / PM) and progress tracker.

Sprint 2 Sprint Backlog example:

Fraud Analyst labels new fraud cases (Data)
Data Engineer updates Feature Store (Data)
ML Engineer trains Challenger model (XGBoost) (MLOps)
evidentlyAI drift detection integrated (MLOps)
GDPR masking pipeline verified (Data + SecOps)

↓

Sprint 2 weeks

The team executes — cards move across the Kanban board every day. The Tech Lead runs a 15-minute Daily Standup every morning to surface blockers before they stall progress.

🟢

Done yesterday?
"Trained XGBoost Challenger — Recall 0.93 on test set"

🔵

Today's plan?
"Compare Challenger vs Production in MLflow, generate report"

🔴

Blockers?
"GPU quota exceeded — DevOps escalation needed"

↓

Potentially Shippable Product Increment

At the end of each Sprint, the team delivers a working, tested, "Done" increment. The Definition of Done is enforced strictly:

✅ Challenger model trained & registered in MLflow

✅ Metrics beat Production (Recall ≥ 0.92, FPR < 1%)

✅ Code reviewed & merged to main branch

✅ Unit & integration tests passing in CI/CD

✅ GDPR compliance verified by Data Engineer

✅ Pipeline runs with zero manual steps

🔍

Sprint Review

Last day of Sprint · ~2 hours · All stakeholders attend

The team demos the working increment to stakeholders. Product Backlog is updated based on feedback received during the demo.

In this project:

Live demo of Challenger model metrics on Grafana dashboard
Fraud Analyst confirms new fraud patterns are correctly detected
Product Owner formally accepts or rejects the increment
Backlog reprioritised (e.g. A/B shadow testing moved up if FPR risk found)
Next Sprint scope agreed with all stakeholders

👥 Attendees: Full Scrum Team + PM + Fraud Analyst + Tech Lead

🔁

Sprint Retrospective

After Review · ~1.5 hours · Team only (no stakeholders)

The team reflects on how they worked, not what they built. Three questions drive continuous process improvement every sprint.

✅

What went well?
"Automated pipeline trigger saved 3h of manual work in Sprint 2"

⚠️

What needs improvement?
"GPU quota blocker was discovered mid-sprint, too late"

🎯

What will we change?
"Add GPU usage check to Sprint Planning checklist — action: DevOps"

👥 Attendees: MLOps Eng · ML Eng · QA Eng · Data Eng · Tech Lead only

Sprint Timeline — 4 × 2 Weeks

Full project lifecycle — goals, key deliverables and milestones per Sprint

Sprint 1 · Weeks 1–2

Foundation & Setup

Stakeholder register & RACI matrix defined
Feature Store schema designed
MLflow model registry configured
Retrain pipeline trigger implemented
AWS Spot Instances & budget alerts set up
Bus Factor runbooks started (ADRs)

🎯 Pipeline skeleton running end-to-end

→

Sprint 2 · Weeks 3–4

Core ML Pipeline

Fraud Analyst labels new fraud cases
Feature Store updated with labeled data
Challenger model trained (XGBoost)
evidentlyAI drift detection integrated
GDPR data masking pipeline built
Challenger vs Production comparison logic

🎯 Challenger model registered in MLflow

→

Sprint 3 · Weeks 5–6

Evaluation & QA

Challenger vs Production final comparison
A/B shadow testing on live traffic
Grafana FPR real-time alerts configured
QA performance evaluation report
Bias & fairness audit by Fraud Analyst
QA sign-off: Precision 0.95 / Recall 0.92

🎯 Model approved for production deployment

→

Sprint 4 · Weeks 7–8

Deploy & Monitor

Blue-green deployment to Kubernetes
Grafana dashboard live for all stakeholders
Incident runbooks written & reviewed
ONNX model optimised (<50ms latency)
Final documentation & ADRs completed
1-week code freeze before presentation

🏁 Zero-downtime deploy · Monitoring live

📋 KANBAN — How We Use It in This Project

Kanban runs inside every Sprint — it visualises and controls the daily task flow so the team always knows what to work on next

👁️ Visualise Every Task

Every item from the Sprint Backlog becomes a card on the Kanban board. Nothing is hidden — if it's not on the board, it's not being worked on. The MLOps Engineer, ML Engineer, Fraud Analyst and QA all update their cards after each Daily Standup.

🚦 WIP Limits Prevent Overload

Max 3 cards "In Progress" at once. This stops the ML Engineer from training three models simultaneously while completing none. When a card is blocked (e.g. GPU quota exceeded), it is flagged red and escalated at the next Daily Standup immediately.

⚡ Continuous Delivery After Sprint 4

Once the initial 4 Scrum sprints are complete, the fraud detection system switches to pure Kanban mode for ongoing operations. When evidentlyAI detects data drift → the retrain pipeline triggers automatically → the new model flows through the board → deployed without waiting for a sprint boundary.

Board Column Definitions — Fraud Detection Pipeline

📥 Backlog

All tasks not yet in a sprint. Fed from Product Backlog. Prioritised by the PO based on fraud risk impact and business value.

No WIP limit

e.g. "Implement ONNX model optimisation"

📋 To Do

Sprint-committed items ready to be picked up. All dependencies are met — labeled data is available, access is granted.

WIP limit: 6

e.g. "Label new fraud cases (Fraud Analyst)"

⚙️ In Progress

Actively being worked on. Owner assigned, progress bar tracked. Blocked cards flagged red and escalated at next Standup.

WIP limit: 3 ← critical

e.g. "Train Challenger model — 70% done"

🔍 Review / QA

Built but awaiting verification — code review, model metric check by QA Engineer, or peer test of the pipeline logic.

WIP limit: 2

e.g. "QA validates Recall ≥ 0.92"

✅ Done

Meets the Definition of Done: tested, merged, documented, and either deployed or staged and ready for zero-downtime production deploy.

No WIP limit

e.g. "MLflow registry configured ✓"

GitHub / GitLab

Expected: 2025

⚠ Temporary Section This page was created specifically for the IT Projects subject (Mr. Marku) and will be removed after the course ends.