AI Infrastructure Engineer (Lab & Benchmarks): onsite nashua

Overview

This is a full-time, onsite position based in Nashua, NH (3–5 days per week). In this role, you’ll work with our lab team to build and scale the AI development, test, and validation environment. Responsibilities include rack-and-stack, high-speed networking, optical interconnects, bare-metal bring-up, and reproducible LLM/GPU benchmarking to support Engineering, QA, and Marketing. This is a hands-on, fast-paced startup role where you’ll take on multiple responsibilities, collaborate closely across teams, and contribute directly to advancing our AI infrastructure.

Key Responsibilities

Lab operations: Assist with racking servers/GPUs/switches/PDUs; neat LC/MPO fiber work; labeling, inventory, ESD/safety; maintain as-built diagrams and runbooks.
Platform bring-up: Partner on BIOS/BMC configuration; firmware updates (GPU/NIC/storage); NUMA tuning; validate PCIe topology and links.
OS provisioning & automation: Contribute to automated bare-metal installs (PXE/iPXE, kickstart, cloud-init) and improve Ansible playbooks.
VM provisioning & orchestration: Support creation, management, and benchmarking of virtualized environments (OpenStack, VMware, KVM); validate GPU passthrough, SR-IOV, and network/storage integration.
AI stack setup: Install/verify CUDA/NCCL (and/or ROCm), PyTorch, vLLM with Ray; optional Slurm/Kubernetes for multi-node runs.
Benchmarking & validation: Co-design and run repeatable tests (throughput, tokens/s, latency, utilization, power/thermals) on single- and multi-GPU nodes; track and visualize results.
Troubleshooting: With the team, triage performance issues across GPU/PCIe/NVLink/RoCE/Ethernet, GPUDirect storage, and CPU/NUMA; document fixes and prevention steps.
Reporting & enablement: Produce concise reports and reproducible configs for Dev/QA/Marketing (methods, graphs, exact steps); help prep demo/POC images and scripts.

Required Qualifications

3–7+ years in lab/validation/SRE/solutions roles supporting Linux on bare metal.
Hands-on with NVIDIA GPUs (driver, CUDA, NCCL) and PyTorch; able to stand up an LLM serving stack with team guidance.
Solid Linux admin + networking fundamentals (VLANs, LACP, MTU/Jumbo, routing basics).
Automation with Bash and/or Python; config management with Ansible.
Familiar with BMC/IPMI/Redfish, firmware lifecycles, and disciplined documentation.
Able to lift/move ~40–50 lbs; onsite lab presence required.

Preferred Skills

Fabrics: 100–400G Ethernet/InfiniBand (ConnectX-6/7), RoCE/IB, SR-IOV; iperf experience.
Storage: GPUDirect Storage, NVMe/NVMe-oF, fio profiling, latency tuning.
Optics/Photonics: MPO/LC cleaning/inspection, basic optical power checks; exposure to co-packaged optics or photonic links is a plus.
PCIe & switches: Practical experience with PCIe topology and hot-plug; familiarity with PCIe switches is a bonus.
Scheduling/containers: Slurm, Kubernetes GPU operators, Helm.
Observability: Prometheus/Grafana, basic perf tracing.
AMD GPUs/ROCm, NVIDIA MIG/MPS; CI for benchmarks

AI Infrastructure Engineer (Lab & Benchmarks): onsite nashua

Overview

Key Responsibilities

Required Qualifications

Preferred Skills

AI INFRASTRUCTURE ENGINEER (LAB & BENCHMARKS): NASHUA, NH

Join Our Team

Apply Now

AI Infrastructure Engineer (Lab & Benchmarks): onsite nashua

Overview

Key Responsibilities

Required Qualifications

Preferred Skills

AI INFRASTRUCTURE ENGINEER (LAB & BENCHMARKS): NASHUA, NH

Join Our Team

Apply Now

This website uses cookies.