150 Million Files Moved in One Week — 50% Faster Than the Customer Expected
A voice AI startup had built its infrastructure on Azure — and was hitting the limits. They needed better GPU capacity for their ML models, higher reliability, stronger file-security posture, and a clean migration of 150 million files without disrupting their customers. We designed the AWS architecture and completed the full migration in twelve weeks — including moving every one of those 150 million files inside a single week.
Voice AI on Azure, and a Data Store That Had to Move Without Disrupting the Product
Voice AI doesn’t forgive infrastructure drift. Model inference runs on GPUs around the clock. Customers expect real-time translation quality. And when a rapidly growing AI startup built on Azure realized they needed more GPU firepower for their ML models, better reliability for customer workloads, and a stronger security posture over user voice data, the decision to move to AWS wasn’t controversial.
The execution was. Their platform supported an accent-translation service that lets users shape how their voice sounds — real customer audio, sensitive by nature, 150 million files sitting behind Microsoft SQL and PostgreSQL databases. None of it could be tampered with. Almost none of it could be offline.
The customer identified AWS as the right target. They needed partners who could design the new environment, harden the security model, and move every last one of those 150 million files without their customers noticing.
Twelve Weeks, Two SOWs, One Cutover
The engagement ran as a fixed-scope AWS Migration Acceleration Program (MAP) engagement across two sequential SOWs. Eligibility for AWS funding programs is entirely AWS’s call; this engagement qualified.
SOW 1 — Migration and Implementation (10 weeks). A four-phase lifecycle.
- Phase 1 — Discovery & Planning. Inventory the Azure footprint (hundreds of resources across multiple subscriptions), map dependencies, define the AWS target architecture, agree the cutover sequence.
- Phase 2 — Design. AWS landing zone, account strategy, networking, IAM, tagging for MAP compliance, security guardrails, and the specific target patterns for the GPU fleet, Kubernetes workloads, and data stores.
- Phase 3 — Build. Stand up the AWS environment as infrastructure-as-code. Move compute. Migrate databases. Transfer the 150 million files. Validate at every step.
- Phase 4 — Go-Live & Handoff. Cutover, monitoring dialed in, runbooks delivered, environment handed to the customer’s team.
SOW 2 — Web App Development (2 weeks). A parallel workstream covering web application work adjacent to the migration.
Twelve weeks elapsed. The data cutover — the part everyone was most worried about — ran inside one of those weeks.
An AWS-Native Target for GPU-Heavy AI Workloads
The AWS design was not a lift-and-shift of the Azure shape. It was a redesign around what the customer actually needed: reliable GPU inference, predictable scaling, and a security posture that held up against the company’s own “customer voice files must never be tampered with” requirement.
Terraform as the source of truth. Every piece of the target environment was defined in Terraform — the landing zone, networking, security groups, IAM roles, ECS cluster, databases, and supporting services. The environment is reproducible and auditable, not click-ops.
ECS for application containers. Amazon Elastic Container Service hosted the application containers that front the Microsoft SQL and PostgreSQL databases behind the voice platform. The cluster simplified deployments and scaling compared to the AKS-based Azure origin, and integrated natively with AWS Lambda for the serverless compute path.
Lambda for event-driven compute. The Azure Functions running accent translation, audio processing, scoring, and data utilities moved to AWS Lambda. Runtime changed; the event-driven architecture stayed.
Elastic Beanstalk orchestrating file saves. Beanstalk orchestrated the file-save path into Amazon S3 — a production-grade saving pattern designed so customer voice files land safely and consistently.
Databases into a managed AWS pattern. SQL Server and PostgreSQL workloads moved to the AWS managed database family — RDS for SQL Server and Amazon Aurora PostgreSQL for the Postgres side — with encryption at rest, automated backups, and multi-AZ where warranted.
GPU fleet sized for real traffic. The GPU workloads deployed onto AWS EC2 across the right instance families for the customer’s mix of training and inference — P3 for training-capable workloads, G4 for inference-optimized serving, R5 for memory-bound work.
Security as a first-class requirement. Security groups, network ACLs, IAM roles and policies, AWS Secrets Manager for credentials, and a bastion-server + internet-gateway pattern for controlled access to the private network. Amazon Route 53 for DNS. KMS encryption consistent across data at rest. The customer’s “voice files must never be tampered with” requirement was translated into specific controls, not a wish.
Reliability through multi-AZ and a CDN. Amazon CloudFront was deployed across multiple availability zones. The platform was architected so that if one zone had issues, customers could be rerouted to another without interruption. Amazon CloudWatch watched compute resources and application performance, with dashboards and alerts to catch issues early.
How We Moved 150 Million Files in a Week Without Breaking the Service
The part of the engagement that kept leadership up at night was the data.
150 million files sat in Azure Blob Storage, integrated with the voice platform’s SQL and PostgreSQL databases. Moving them naively — one enormous transfer job — was a recipe for either a multi-week outage or an unusable target environment at the other end. The customer expected this phase to take two weeks. We promised better.
We started with a pilot. A subset of the files — a few million — was transferred using AWS DataSync to validate the approach, confirm throughput, and understand how file-size distribution affected transfer performance. The pilot revealed the right strategy: transfer at the folder level, in smaller subsets, in parallel, rather than one monolithic job.
Four concurrent DataSync agents. We provisioned four AWS DataSync agents handling different web-service-backed folder trees concurrently. Each agent ran its own transfer, on its own sequencing, with its own validation. The parallelism cut the total transfer time significantly without blowing up the network path at either end.
One-week completion. Every one of the 150 million files landed in AWS S3 inside one calendar week. That was 50% faster than what the customer had planned for — and it happened without a customer-visible disruption to the translation service.
Nerd Talk: Architecture & AWS Service Footprint
Compute: EC2 across multiple instance families for general workloads and the GPU fleet — P3 (training), G4 (inference), R5 (memory-bound). Autoscaling groups sized to real traffic rather than provisioned-for-peak Azure patterns.
Containers: Amazon ECS cluster hosting application containers fronting the SQL Server and PostgreSQL databases, integrated with Lambda for serverless compute. Amazon ECR for container images.
Serverless: AWS Lambda handling event-driven functions — accent translation, audio processing, scoring, data utilities — replacing the customer’s Azure Functions footprint. Amazon SQS for queue-based messaging between app components.
Data platform: Managed databases for SQL Server and Aurora PostgreSQL, with encryption at rest, automated backups, multi-AZ where warranted. Amazon S3 as the durable object tier (10+ TB active storage). AWS Elastic Beanstalk orchestrating file saves into S3.
Migration mechanism: AWS DataSync for the 150M-file Azure-to-AWS transfer. Four concurrent DataSync agents, folder-level subset batching, one-week completion.
Networking & access: VPCs with public/private subnet split, NAT gateways, bastion server for controlled private-network access, Amazon Route 53 for DNS. VPN connectivity back to customer environments where needed.
Security: IAM roles and policies with least-privilege defaults. AWS Secrets Manager for credentials. KMS-based encryption at rest. Security groups and NACLs at the network layer. Application-level security testing during QA.
Reliability: Amazon CloudFront CDN across multiple availability zones. Amazon CloudWatch metrics and alarms for compute and application performance.
IaC: Terraform as the source of truth for the full environment.
MAP program: Workloads tagged for Migration Acceleration Program compliance and cost tracking. Eligibility for AWS funding programs is entirely AWS’s call; this engagement qualified.
Azure footprint consolidated into AWS: 776+ Azure resource objects across multiple subscriptions, including VMs, AKS clusters (two — one hosting a Kubeflow stack, one hosting a speaker-certification pipeline), Azure Functions, Azure SQL Server, Azure PostgreSQL, Azure Blob Storage, and Azure Cognitive Services (replaced in the target architecture). Additional GCP compute consolidated into the same AWS environment.
Faster GPUs, Stronger Security, and a Cutover That Didn't Feel Like One
150 million files moved in a single week. The piece of the migration that carried the most risk ran inside one calendar week using AWS DataSync with four concurrent agents — 50% faster than the customer’s own planned timeline, with no visible disruption to the translation service.
GPU capacity matched to the ML workload. AWS’s GPU instance families gave the customer the compute profile their machine-learning models had been starved for on Azure. Inference workloads got the throughput they needed, and the fleet could be sized to actual traffic patterns instead of provisioned-for-peak.
Security posture strengthened. Secrets Manager, IAM least-privilege, KMS encryption, security groups, and a bastion-based private-network access model delivered the customer’s “voice files must never be tampered with” requirement as concrete controls, not aspirations.
Reliability built in. A CloudFront CDN across multiple availability zones, CloudWatch monitoring and alerts, and a multi-AZ database pattern gave the customer the uptime guarantees their growing customer base demanded.
A consolidated operating model. One cloud, one set of IAM primitives, one observability stack, one deploy path. The operational tax of running across Azure and GCP disappeared.
The Client
An AI startup offering accent-translation technology that lets users shape how their voice sounds, improves fluency between languages, and reduces communication friction for global teams. Heavy GPU-inference workloads, machine-learning model pipelines, and large voice-file data stores.
Details have been anonymized at our client's request. The technical substance of the engagement is preserved.
Key Results
Facing a similar challenge?
Let's discuss how we can help your organization achieve the same results.