Second Brain

YouTube video: How To Approach Your AI Evals Channel: Hamel Husain Published: 2026-06-16 URL: https://www.youtube.com/watch?v=DZxaPNYi_k0

Description: Join the AI Evals September 2026 cohort: https://maven.com/parlance-labs/evals?promoCode=yt-2026

If you're new to my channel, my name is Hamel Husain. I spent 20 years in data and machine learning at banks, startups, Airbnb, GitHub, before most people knew what any of that meant. In 2023 I started advising AI companies and noticed the same gap everywhere: nobody had a real way to test, measure, and improve their AI products. They were shipping on vibes.

How I got here...

2003: Built credit-ri

Transcript: The most important thing that you should keep in mind or develop a skill for when we talk about evals is really looking at data. What it means is when you're developing an AI application, you're building your AI application, you want to log your traces and your user interactions in a very specific way. And then what you want to do is you want to systematically analyze that data. Now, there's a bit of qualitative and quantitative analysis you want to do on that data. A lot of people skip the qualitative aspect. But really what you know, at its most simple form, what you can do is you can set up a nice workflow for yourself. You can even code your own data annotation tools using Claude. And you can just annotate some data of and like put your product hat on and understand like what is going wrong from a user's perspective. Is there anything that's not ideal? But putting your eyes on some data and annotating that data, you know, putting your product hat on is the most powerful sort of foundation for evals cuz what it does is it grounds you in what is actually broken in your application. So, the industry at large will steer you towards like, "Hey, buy my eval tool. We're going to throw up a dashboard with a bunch of scores, hallucination score, helpfulness score, whatever." That doesn't help you. It gives you the illusion that you're doing evals. So, what you need to do is like start with this looking at data and doing some data analysis. When I say data analysis, you can just start with counting. Counting is the most powerful form of data analysis as a baseline. I mean, you can do other things, but start there. And you can quickly see like what is actually broken in your application, what's not ideal. And that data analysis will help ground you on what you should measure and what you should focus on. So, that data analysis will tell you "Okay, there's certain things you should just go fix in your application." And there's other things like maybe you don't know how to fix. And those things are where evals are the most helpful because it gives you a kind of harness in which you can use to improve because you need to these tests and these metrics help you more when you need to hill climb against a hard problem, especially if it's like an expensive eval. So there's different kinds of evals. You know, there's code base evals. It's like your traditional software test where you do sort of assertions and you know, deterministic tests. And then there's like more subjective tests like LLM as judge. And LLM judge is more costly. So you have to think about what you want to test and how you want to test it and sort of think about like what your budget is and uh given your data analysis, you want to think about okay, like what should your test suite be? The kind of unique part about AI evals is, you know, people have been doing software testing for a long time. This like subjective nature of testing is a little bit different. So people use this LLM as a judge. It's like getting one LLM to measure another LLM. As soon as you hear that, you might wonder to yourself like why is that trustworthy? Like why should I rely on another LLM to tell me if my product is good? Like what if that is wrong? And it's often the case it is wrong. And so what you can do is measure it. You can measure how good your LLM as a judge is. You know, as this kind of baseline LLM judge can be treated like a classifier, like a black box classifier. It's telling you something's good or bad. You don't know like exactly what the internal process of that classifier is. It's just giving you a like a prediction. And so what you need to do is measure your LLM judges against human labels. So you need to collect human labels. There's efficient ways of doing that, but you need to have those so you can trust your judge. And the reason why I'm telling you this is, okay, when you go out into the world, work at companies working on AI, you'll see LLM judge a lot. You'll see like some metric like helpfulness. And you might ask somebody, oh like how did you calculate this helpfulness? And they will tell you like, oh this is like a LLM judge. Or they might tell you it's like this off-the-shelf metric. But you know, you should always think to yourself, well, how do I trust that? Do you have human labels? How did you calibrate that? Like, can I trust this? You don't want to get into a situation where just using these metrics because what happens is if there becomes a drift between in the actual user experience, like if your eval say it's good and the user experience is really bad, no one's going to trust your evals. And no one trust your evals, they don't trust you. So you want to avoid that.