OpenAI just released a 34-page practical guide to building agents,
Here's 10 things it teaches us:
1➜ agents are different from workflows: they are complete autonomous systems that perform tasks on your behalf. many applications use LLMs for workflows, but this is not an agent.
2➜ use them for tricky stuff: complex decision making, dynamic rules, unstructured data
3➜ core recipe: each agent has three main components: Model (the brain), Tools, Instructions on how to behave
4➜ choose the right brain: set up evals to get a baseline performance, use a smart model to see what's possible, gradually downgrade the model for cost and speed
5➜ tools are key: choose well-defined and tested tools. an agent needs tools to retrieve data and context, and take actions.
6➜ instruction matters A LOT: be super clear telling the agent its goals, steps, and rules. Vague instructions = unpredictable agent. Be explicit.
7➜ start simple, then scale: often a single agent with several tools is ok. don't jump to complex multi-agent systems immediately.
8➜ if you use multi-agents: you can have a "manager" agent directing traffic to specialist agents, or have agents hand off tasks to each other.
9➜ gaurdrails are a MUST: check user input for weird stuff, make sure the agent isn't about to do something risky, filter out private info, block harmful content. Don't let it run wild.
10➜ build and plan for humans: start small, test, improve. always have a plan for when the agent gets stuck or is about to do something high-risk.
GPT-4.1 dropped this week - and it puts OpenAI back in the race for coding & agentic leadership.
⚙️ API only - no ChatGPT toggle for this. 💻 Coding performance is back on par with Claude 3.7 Sonnet & Gemini 2.5 Pro (though Gemini still leads). 💸 Pricing: • Full: $3.50 / 1M tokens • Mini: $0.70 / 1M • Nano: $0.17 / 1M 👉 Gemini 2.5 Pro = best price/perf ($3.44 / 1M) 😵 Claude 3.5 Sonnet = $6 / 1M (!)
🧠 Not a "thinking" model. 📊 Mini shines on general reasoning tasks (e.g. GPQA), but only the full model holds up in SWE-bench-verified (GitHub issue solving).
A team from NUS and Microsoft just released an agent that can act on any UI (Desktop, Android, Web) without needing additional text information. It works extremely well : they applied their method on a tiny Qwen2-VL-2B, and they managed to beat methods that use either much more powerful vision models (like GPT-4V) without using any additional info (e.g. leveraging the DOM of a webpage) like previous methods did ! 👏👏
They started from the idea that most existing methods rely heavily on text, which makes them less generalizable, while letting aside rich UI structure that user actually rely on when navigating this interfaces.
⚙️ They put several good ideas to work:
💡 Simplify screenshots to the max: They prune a lot the heavy visual content of UI screenshots, by removing cloned image patches (like any vast patch of the same color will be reduced to a small patch, while maintaining positional embeddings), then group patches from the same GUI elements together to simplify even further
💡 Build a truly generalist dataset: To train a general UI agent, you need trajectories from each possible UI, and express them in a common language. Authors merge datasets like OmniAct for Desktop, Mind2Web for websites, AMEX for Android trajectories to create a high-quality and diverse dataset.
➡️ Nice results ensued: They fine-tune a tiny Qwen-2-VL-2B on their method, and it reaches SOTA on several task (element identification, web navigation), even beating methods that either use additional info from the DOM or use much bigger VLMS like GPT-4v! 🏆
And performance could certainly jump with a slightly bigger vision model. Let's hope the community builds this soon! 🚀
Details I am still rigorously testing different hyperparameters and comparing impact of each one to find the best workflow So far done 16 different full trainings and completing 8 more at the moment I am using my poor overfit 15 images dataset for experimentation (4th image) I have already proven that when I use a better dataset it becomes many times betters and generate expressions perfectly Here example case : https://www.reddit.com/r/FluxAI/comments/1ffz9uc/tried_expressions_with_flux_lora_training_with_my/ Conclusions When the results are analyzed, Fine Tuning is way lesser overfit and more generalized and better quality In first 2 images, it is able to change hair color and add beard much better, means lesser overfit In the third image, you will notice that the armor is much better, thus lesser overfit I noticed that the environment and clothings are much lesser overfit and better quality Disadvantages Kohya still doesn’t have FP8 training, thus 24 GB GPUs gets a huge speed drop Moreover, 48 GB GPUs has to use Fused Back Pass optimization, thus have some speed drop 16 GB GPUs gets way more aggressive speed drop due to lack of FP8 Clip-L and T5 trainings still not supported Speeds Rank 1 Fast Config — uses 27.5 GB VRAM, 6.28 second / it (LoRA is 4.85 second / it) Rank 1 Slower Config — uses 23.1 GB VRAM, 14.12 second / it (LoRA is 4.85 second / it) Rank 1 Slowest Config — uses 15.5 GB VRAM, 39 second / it (LoRA is 6.05 second / it) Final Info Saved checkpoints are FP16 and thus 23.8 GB (no Clip-L or T5 trained) According to the Kohya, applied optimizations doesn’t change quality so all configs are ranked as Rank 1 at the moment I am still testing whether these optimizations make any impact on quality or not