Professional Summary
Senior AI Engineer specializing in the architecture and deployment of production-grade Agentic AI systems. I combine hands-on development of multi-agent RAG workflows (LangGraph, Python) with deep expertise in optimizing distributed inference infrastructure (H100/H200, vLLM). Proven track record of building self-correcting retrieval systems, implementing multimodal ingestion pipelines (OCR/Vision), and architecting scalable orchestration layers (Ray, LeaderWorkerSet) for Tier-1 AI workloads.
Professional Experience
AI/ML Cloud Engineer
Leveraging deep systems knowledge to build and scale next-generation AI applications for the world's leading GPU cloud.
- Architected Agentic RAG Platform (LangGraph): Designed and deployed a modular "Agentic" retrieval system to automate technical solution discovery. Replaced monolithic code with LangGraph workflows, enabling conditional routing, self-correction, and state management.
- Multimodal Data Integration: Built ingestion pipelines using OCR and vision models to extract text from technical diagrams and screenshots, successfully incorporating trapped visual data into the LLM's context window.
- Advanced Reasoning: Implemented Adaptive Multi-Query workflows with LLM-driven Gap Analysis, allowing the system to detect missing info, generate targeted follow-up queries, and merge results from Vector Stores and Knowledge Graphs (Dgraph).
- Hybrid Retrieval Pipeline: Built a sophisticated retrieval engine combining semantic search (Vector) and entity relationships (Graph), utilizing document expansion algorithms to improve context relevance by 5x.
- High-Performance Inference Optimization: Engineered a high-throughput inference engine using vLLM and SGLang, achieving a 19x throughput increase vs. standard baselines. Deployed custom H200 Remote Reranking APIs with optimized KV-cache management and continuous batching strategies.
- Distributed Workload Orchestration: Architected scalable compute clusters using Ray and Kubernetes LeaderWorkerSet (LWS) to manage distributed training and inference lifecycles. Resolved critical multi-node bottlenecks by tuning NCCL topology awareness and Slurm scheduling parameters for massive-scale GPU fleets.
- Distributed Model Caching Architecture: Designed a transparent caching proxy for the Hugging Face Hub using Olah and JuiceFS backed by S3. Enabled block-level caching and LRU eviction for partial model downloads, significantly reducing startup latency and egress costs.
- DevPod Solution Architecture: Designed and deployed a scalable Golden Path development environment on Kubernetes, enabling researchers to spin up consistent, GPU-accelerated workspaces instantly. Authored the public reference architecture driving enterprise adoption.
Senior Linux Programmer / Administrator
- Data Pipeline Engineering: Engineered real-time data integration pipelines using Python and Bash, ensuring data accuracy across a distributed enterprise Linux environment (2,000+ endpoints).
- Infrastructure Reliability: Managed the stability of critical Point-of-Sale (POS) infrastructure, implementing automated remediation scripts that significantly reduced operational downtime.
- API Integration: Developed custom API integrations to streamline workflows and architected standardized VMware vSphere deployment processes.
Senior Linux Developer / Administrator
- Multi-Cloud Architecture: Managed complex infrastructure spanning AWS, Azure, and Google Cloud, migrating legacy on-prem workloads to hybrid cloud environments.
- Kubernetes Operations: Designed and managed high-availability Kubernetes clusters, implementing full-stack observability with Prometheus and Grafana prior to industry standardization.
- Full-Stack Development: Developed and maintained performant web applications (PHP/Perl/JS), optimizing code for scalability under high traffic loads.