Spend the first twenty minutes on one live test: send a real lead and watch what happens, end to end, without anyone touching the keyboard. What the system does in those twenty minutes tells you more than the rest of the deck.
Then probe the seams — ask where context comes from, what happens when the model is wrong, and how a correction propagates. Vague answers there are the answers that should end the call early.