How to Scale AI Application Inference 100x ft. Fireworks’ Lin Qiao

Inference is a three-dimensional optimization problem across quality, speed, and cost that can unlock 10-100x improvements in AI app economics.

May 19, 2025

Post methodology: Dust custom assistant @AIAscentEssay using Claude 3.7 with system prompt: Please take the supplied transcript text and write a substack-style post about the key themes in the talk at AI Ascent 2025. Style notes: After the initial mention of a person, use their first name for all subsequent mentions; Do not use a first person POV in the posts. Light editing and formatting for the Substack platform.

Lin Qiao of Fireworks AI talked about the future state of inference, highlighting the critical challenges and opportunities that lie ahead for AI application developers. Her perspective offers a fresh lens on how the next generation of AI products will need to bridge the gap between model capabilities and real-world application needs.

The Alignment Challenge

Lin frames the development of AI applications as a massive alignment process. From ideation to scaling, developers must align their products across multiple dimensions:

Aligning product design with desired user behaviors
Aligning AI models with product requirements

While the first type of alignment has been extensively studied with established tools for product analytics and user behavior tracking, the second type represents uncharted territory. Most developers still rely on off-the-shelf models with minimal customization. "How to infuse your product knowledge into your model is a new area that most people don't know how to do," Lin explains, noting that prompt engineering is currently the predominant approach to model steering.

Related: Lin Qiao on the Training Data podcast

The Three-Dimensional Optimization Problem

According to Lin, the future of inference involves scaling across three critical dimensions:

Quality: Ensuring model outputs meet application-specific standards
Speed: Delivering responses quickly enough for practical use
Cost/Concurrency: Supporting multiple users simultaneously at a sustainable price point

This creates what Lin describes as a multidimensional optimization problem. The challenge is that businesses typically want "OpenAI quality, lightspeed, and high concurrency as if they are running fraud detection" – essentially demanding excellence across all three dimensions simultaneously.

The Cost Iceberg

Lin uses a powerful metaphor to describe the current state of inference economics: "We are seeing an iceberg situation where the water line of inference cost is very high right now." The goal, she explains, is to drive down these costs by 10-100x. When that happens, the range of applications that can achieve product-market fit and scale into sustainable businesses will expand dramatically.

The Path Forward: Co-Optimization

The solution, according to Lin, lies not in looking at inference in isolation but in combining post-training optimization with inference optimization. This approach opens up a vast space of possibilities:

Predicting multiple tokens at once rather than one at a time
Aligning numeric precision with application data distributions
Matching hardware selection to specific workloads
Sharding models based on application needs
Implementing cross-host distributed inference
Selecting optimized kernels for specific applications
Applying various quality tuning mechanisms

The combinatorial explosion of these options creates more than 100,000 possible configurations to choose from – a daunting challenge that Lin says Fireworks AI is tackling head-on.

Real-World Impact

The results of proper inference optimization can be dramatic. Lin shared examples of customers who have rapidly scaled their AI features after finding the right optimization strategy:

A food chain company scaled from running one shop with its AI app to a thousand shops in just three months
A software development company expanded their AI feature from 100,000 developers to 25 million developers in three months

These success stories highlight the transformative potential of well-optimized inference systems tailored to specific application needs.

The Future Lies in Customization

The overarching message of Lin's talk is clear: the future of inference is not one-size-fits-all but rather heavy customization for specific applications. By incorporating production data to drive optimizations and finding the right balance between quality, speed, and cost, developers can unlock new possibilities for AI applications at scale.

As AI continues to mature as a technology, the focus is shifting from raw model capabilities to the nuanced art of aligning those capabilities with real-world application needs. For developers looking to build sustainable AI-powered businesses, mastering this alignment will be the key to success.