Case Study

From GCP to a Production AWS HPC Platform in 10 Weeks

A chip-design company moved its compute-intensive EDA platform from GCP to AWS — full migration delivered in ten weeks on AWS ParallelCluster with FSx for high-performance storage. The environment lives as IaC, with AWS funding programs in play where the workload qualifies.

Semiconductor & EDA DevOps, Modernization, Migrations 10-week migration; full engagement covered assessment through post-cutover support
The Situation

HPC on GCP, and a Roadmap That Needed AWS

Chip design is compute-intensive. Runs of Cadence and Innovus through Slurm-orchestrated parallel clusters can spike to hundreds of cores on demand, sit idle for hours between jobs, and resume at full throttle when a designer submits the next simulation. The cost curve for that workload shape matters more than almost anything else in the infrastructure budget.

This customer had built that workload on Google Cloud Platform. It ran. But as their roadmap scaled and their usage of AWS-native HPC services became more attractive — AWS ParallelCluster, FSx for high-performance storage, EC2 c7a compute, and MAP funding to offset the transition — they wanted to know what a clean GCP-to-AWS migration would actually cost, perform like, and take to execute.

They wanted the numbers and a working landing zone.

The Assessment

What We Found in the First Three Weeks

We started with a fixed-scope, three-week AWS Migration Acceleration Program (MAP) assessment: migration readiness, Well-Architected review, and a cost analysis with the right target architecture for the workload shape.

The assessment produced three things the customer’s leadership actually needed:

  • A Migration Readiness Assessment naming the workloads in scope and the sequencing risk, account by account.
  • A Well-Architected findings report across security, reliability, performance, cost, operational excellence, and sustainability.
  • A cost model comparing GCP run-rate to a modeled AWS run-rate with specific instance families (c7a.large compute), Spot-instance headroom, FSx + EFS + S3 storage sizing, and networking costs priced in (NAT, Route 53, VPN).

With the numbers and the plan in hand, they greenlit the next phase.

How We Engaged

From Assessment to Production

The engagement ran as a sequence of fixed-scope phases — assessment, proof of concept, migration, pilot expansion, and post-cutover support — each with its own scope and success criteria. The migration itself executed in ten weeks; the surrounding work shaped the architecture going in and stabilized it after.

Assessment (3 weeks). MAP business case, readiness report, Well-Architected review, cost analysis.

Proof of Concept (2 weeks). A working AWS environment for the HPC workload: Cadence License Server stood up, RES stack configured, ParallelCluster (Slurm) wired in, Innovus and Exostellar Infrastructure Optimizer overlay validated. End-state: a running cluster and a concrete cost breakdown.

Migration (10 weeks). Four sub-phases — Discovery & Planning, Design, Build (Infrastructure as Code), Go-Live & Handoff. The landing zone, the production-ready infrastructure, and the documentation handed over to the customer’s engineering team.

Pilot Expansion (4 weeks). Scaling to production with up to 20 workloads migrated, Terraform and CloudFormation for IaC, ParallelCluster provisioning extended, Slurm database in place.

Post-Cutover Stabilization (6 weeks). The real-world issues that only surface after cutover — Exostellar compatibility with a new software update, Spot-instance migration problems, parallel-cluster reconfiguration. Addressed and returned to IaC.

What We Did

AWS-Native HPC, Engineered for the Workload Shape

The target was not a lift-and-shift. It was an AWS-native architecture designed around the specific behavior of EDA workloads: bursty compute, heavy-IO shared storage, license-server dependencies, and engineer access from distributed locations.

Compute via AWS ParallelCluster and Slurm. The scheduler stayed Slurm — familiar to the engineering team — but underneath, ParallelCluster managed EC2 provisioning and autoscaling against the job queue. Jobs pulled c7a.large instances on demand; the cluster scaled down when the queue emptied. Spot instances added cost headroom where job restart tolerance allowed.

Shared high-performance storage on FSx and EFS. Innovus and other EDA tools lean hard on shared filesystem semantics. We used FSx for OpenZFS where performance and feature parity mattered, and EFS where POSIX semantics and elastic scaling fit better. S3 Standard served as the durable object tier behind both.

RES for engineer desktops. Remote Engineering System gave distributed designers secure, authenticated access to the Linux environment where the EDA tools ran — no VPN gymnastics, no per-desktop SSH sprawl.

License servers on EC2 with Secrets Manager. Cadence and related licensing moved to dedicated EC2 instances with Secrets Manager holding the Slurm munge keys and other sensitive material. No credentials in code, no shared SSH keys.

A custom VPC with proper zoning. Public and private subnets, NAT gateway, Route 53 for DNS, AWS Organizations separating dev, staging, and production. Systems Manager Session Manager replaced SSH for administrative access.

Exostellar Infrastructure Optimizer for cost tuning. The Exostellar overlay handled workload-aware cost optimization on top of the AWS primitives — moving work between Spot and on-demand capacity without the engineering team managing it by hand.

Nerd Talk: Architecture Details

Compute: EC2 c7a.large for head-node and compute nodes. Spot instances for restart-tolerant work. AWS ParallelCluster managing Slurm-driven autoscaling.

Storage: FSx for OpenZFS as the primary shared filesystem for EDA tool shared libraries and project directories. EFS for elastic shared POSIX storage. S3 Standard as the durable object tier with lifecycle policies for archival.

Networking: Custom VPC with public/private subnet split, NAT gateway, VPN connectivity, Route 53 for DNS. Subnet CIDR plans aligned with the target multi-account structure.

Identity & access: AWS Organizations with dev/staging/prod accounts. IAM roles for service accounts. Systems Manager Session Manager for admin access (no SSH keys to rotate).

Licensing & secrets: Cadence License Server on EC2. AWS Secrets Manager for Slurm munge keys and other sensitive material.

Observability: CloudWatch metrics and alarms. AWS Config for compliance posture.

Cost optimization overlay: Exostellar Infrastructure Optimizer for Spot / on-demand workload routing.

MAP program tagging: Workloads tagged for Migration Acceleration Program compliance and cost tracking.

IaC: Terraform and CloudFormation for reproducible infrastructure, deployed from version-controlled pipelines. The environment is reconstructable from code — not click-ops.

Projected production run-rate: Modeled at $239K/month ($2.87M/year) at full scale across EC2, FSx, EFS, S3, NAT, Route 53, and networking. The assessment cost model was the basis for the business-case approval to migrate.

The Results

A Production AWS HPC Platform, Operated by the Customer

A production-ready environment, delivered as IaC. The full landing zone, networking, identity, HPC cluster, shared storage, licensing, and observability — all deployed from Terraform and CloudFormation. The customer’s engineering team can reproduce, modify, and extend the environment without re-engaging us.

Twenty workloads migrated in pilot expansion, ready for scale-out. The pilot phase moved up to twenty workloads onto the new AWS platform with the ParallelCluster + Slurm setup validated under real jobs.

Well-Architected-compliant by design. The environment was designed against the six Well-Architected pillars from the assessment forward — not retrofitted under audit pressure later.

MAP-funded. The program qualified for MAP funding with the right tagging strategy in place for cost tracking. Eligibility for AWS funding programs is entirely AWS’s call; this engagement qualified.

Post-cutover issues, addressed. When a third-party optimizer update broke compatibility and Spot-instance migration hit problems, we came back in to update the IaC, reconfigure the cluster, and return the environment to steady state.

The Client

A chip-design company running compute-intensive Electronic Design Automation (EDA) workloads — Cadence, Innovus, Slurm-based parallel clusters — with multi-environment infrastructure and strict data durability requirements.

Details have been anonymized at our client's request. The technical substance of the engagement is preserved.

Key Results
10 weeks Migration delivered, kickoff to cutover
20+ HPC workloads migrated
MAP Migration Acceleration Program qualified

Facing a similar challenge?

Let's discuss how we can help your organization achieve the same results.

More Case Studies

See how we've helped others

How Brrrn At-Home grew memberships 20% month over month on a cloud-native AWS platform

How Conductiv moved from Google Cloud to AWS — and got their architecture audit-ready

Healthcare Provider Migrates to AWS with Zero Downtime