concepts · article · 8 min
Fine-tuning Agents with Reverse-Engineered Training Data
May 17, 2026
blog | AI & Machine Learning Flow generation through natural language: An agentic modeling approach We fine-tuned Qwen3-32B into a tool-calling agent that generates Flow automations from natural language—faster, cheaper, and more accurate than the frontier model it replaced, with a weekly retraining flywheel built on real merchant data. Engineering at Shopify We’re hiring See open roles If you're building AI products on top of closed models, anyone with an API key can get similar capabilities. Lasting differentiation comes from proprietary data, the training recipe, the infrastructure, and the speed of iteration. Shopify has something most companies don't: a product surface where millions of merchant interactions directly signal whether the model's output is any good. That feedback loop is the foundation, but only if you keep learning from it. We fine-tuned a tool-calling agent to turn natural language into a Shopify Flow for Sidekick , our AI commerce assistant. It's 2.2x faster, 68% cheaper, and outperforms closed models. Along the way, we found lessons no paper warned us about. Data preprocessing decisions, from representation design to formatting details, that compound to swing accuracy by double digits. Silent infrastructure failures that degrade your model with zero warnings and take days to trace. Benchmark parity that masks a 35% gap once real users show up. This post covers the problems we faced, how we fixed them, and what to look for if you're doing the same. Flywheel" style="float: none;"> Building the training dataset Shopify Flow is an automation platform where store owners build workflows from triggers, conditions, and actions. For store owners who aren't engineers, building the right workflow from a blank canvas is daunting. Sidekick generates it from plain English. The cold start problem Fine-tuning required training data, but since the feature hadn't been deployed yet, there were no production conversations to learn from. We reverse-engineered user intent from existing production workflows. Thousands of anonymized store owners had already built workflows manually in Flow. We sampled those and filtered for quality: workflows that had run at least once in the last seven days, from merchants with two or more qualifying workflows, with one example per descriptor to ensure diversity across workflow types. With a set of validated workflows, we worked backwards: Sample a workflow. Pick a popular, validated workflow from production. Generate a user query. Use a stronger LLM to produce a plausible natural-language request that would lead to this workflow. Construct the tool trajectory. Build the full multi-turn sequence of tool calls that an ideal agent would execute to arrive at this workflow. This was the bulk of the engineering effort. We fine-tuned Qwen3-32B on this synthetic dataset and evaluated it against a benchmark of 300 hand-crafted examples covering the breadth of expected Flow usage. An LLM evaluation framework compares the generated workflow against the expected one for semantic correctness, and validates syntactic correctness programmatically. We looked at three metrics: Semantic correctness: Does the generated workflow do what it's supposed to? An LLM judge compares the output against the expected workflow. Syntactic correctness: Are there errors that would cause it to fail? Malformed conditions, incorrect references, invalid configurations. Checked programmatically. Latency: Time from request to workflow delivery. If you're building an agent without interaction data, start with the output artifacts your users already produce and work backwards from them. That is often the right first step before your metrics have caught up. As shown in the table above, there is still a meaningful gap to close. Our second lesson, which we discuss below, is that teaching the model to generate Flows in Python can help narrow that gap further. Training in-distribution: the Python DSL Shopify Flow workflows are represented internally in a JSON-based domain-specific language (DSL) designed for backend parsing, validation, and execution. That format is ideal for production systems, but it's a poor fit for LLMs. Conditional, program-like logic that would normally appear as code is embedded in deeply nested JSON, a pattern that's rare in pretraining data. Rather than forcing the model to learn Flow's native format from scratch, we reformulated the task in a representation closer to the model's training distribution. Workflows are programs, so we taught the model to write them as Python. A transpiler converts the JSON DSL into semantically equivalent Python: Same workflow, same semantics, but the model now generates Python instead of a data format. Python is far closer to code and logical reasoning, and it makes up a large share of pretraining data. The fine-tuned model draws on familiar patterns: decorators, if/else logic, variables, for loops, and function calls. With the same training data, switching from the JSON DSL to the Python DSL improved syntactic correctness by 22 points and semantic correctness by 13 points. Moving the target format from out-of-distribution to in-distribution turned the problem from "learn a new language and the task" into "learn the task." Making this work required building a round-trip transpiler between Python and Flow's JSON representation to handle the full complexity of Flow logic without losing meaning in either direction. Reliability was backed with extensive tests. We round-trip tested every workflow merchants created through Sidekick in production: converting from JSON to Python and back to JSON, then verifying the output matched the original exactly. Any mismatch was caught before it could reach training data. This process ran continuously across all production workflows, giving us confidence the transpiler handled the full range of real-world patterns. At inference time, the model writes Python. The transpiler converts it to JSON for the Flow backend. Store owners never see Python, and the backend never has to understand it. Python is the model's internal language. Prior work has explored Python as an intermediate representation ( SPEAC , LLMLift , WorkflowLLM ), but via prompting or without a round-trip transpiler. What distinguishes this approach is the full loop: fine-tuning on Python combined with a transpiler back to the production DSL, without changing any downstream systems. If you're training a model on a custom DSL, consider translating it into a language the model already knows. This helps separate learning the format from learning the task. As the results show, the gap narrows, but there is still room for improvement. At that point, the next step is to bring the system into production, learn from real usage, and incorporate real user feedback. Mirroring the production environment Representation was one half of the data problem. The other half was making sure the model's training data matched exactly what it would see in production. We knew training data should match production. What we didn't expect was how sensitive the model is to the degree of match. Every difference we closed, no matter how minor, improved eval scores: Tool naming and ordering: Training data used the full prefixed name flow_app_agent_task_search . At inference, the same tool was called task_search . Functionally identical, but the model treated them as different tools. Removing the prefix from training data to match inference improved accuracy. The order in which the tools appeared in the system prompt also mattered. Shuffle the order between training and serving, and performance drops. Tool response format: Tool responses return JSON objects with multiple fields. In the training data, we sorted keys alphabetically. If production returned them in a different order, or included an extra field, the model noticed. Any drift between what the training data showed and what production APIs actually returned degraded accuracy. System prompt and tool descriptions: Tool descriptions in production changed frequently as the product team iterated on behavior. Every update had to be reflected in the training data, or the model's behavior drifted. Keeping both in sync was an ongoing process, not a one-time fix. None of these are about the logic of the task. They are formatting details. The model treats every token as a signal, whether you intended it or not. Optimizing the tool-calling stack When an agent calls tools, every response becomes part of the context. Context grows, latency grows, cost grows. Worse, irrelevant context dilutes the signal. The model reasons less accurately when it’s processing information it won't use. We restructured our tool interfaces to minimize context at each step. Instead of returning full details for every result upfront, tools return lightweight summaries first. The model scans the summaries, selects what it needs, then retrieves full details only for those necessities. Two cheap calls instead of one expensive one. For example, Flow has hundreds of available triggers, conditions, and actions. A search might return 100 matches. Rather than loading the full configuration schema for each one, task_search returns just names and descriptions. The model picks the 2-3 it actually needs, then calls task_configuration to get the full schema only for those. The context stays small, the reasoning stays focused. Shopify Flow workflow created" src="https://cdn.shopify.com/s/files/1/0779/4361/files/image1_92d92421-357a-4ef5-805f-569ab8a67ad0.png?v=1776796570" loading="lazy"> Making training fast As our data pipeline grew, so did a tension: more training data improved accuracy but slowed each run. Slower runs meant fewer iterations, and fewer iterations meant slower improvement. We needed a way to use all the data and still retrain weekly. We built the infrastructure to make both possible. Qwen3-32B trains on two nodes of H200 GPUs with Fully Sharded Data Parallel (FSDP). A full train