It's known that Language Models memorize data that can be extracted via prompting.
In this paper, the authors investigate this aspect: - using open models, where prompting can be fully customized by the user, including special tokens. - focusing on open-source models like Olmo, where full training data is available.
📤 How do they extract data?
During post-training (like SFT), new tokens such as <|user|> are introduced.
The authors hypothesize prompting the model with these tokens can make it output its alignment data (remember Magpie?).
For example, for SFT, their extraction prompt is <|endoftext|><|user|>.
📏 Evaluating memorization
The authors compare each sampled example with the original data using vector search with embedding similarity.
They find that many outputs are semantically very similar to the original data, even if the exact words differ.
Traditional string-matching algorithms underestimate memorization by 10x.
🔁 What about RL?
Surprisingly, the same technique works to extract data from Reinforcement Learning (PPO/GRPO) phases.
This is counter-intuitive because the RL objective is not designed to increase sequence likelihoods (unlike SFT).
Practical limitation: in this case, extraction relies on using the initial part of the training prompt, which is not generally public.
📈 Are the extracted data effective for post-training?
Both in SFT and RL, the extracted data can be used to fine-tune models to similar performance to the originals.
The authors suggest that model distillation, where a stronger model is used to drive the training of a weaker one, may be a form of indirect training on the original dataset.
reacted to ronantakizawa's
post with 🔥about 8 hours ago
Introducing the Japanese honorifics dataset: a dataset with 137 sentences covering the three main keigo forms: 尊敬語 (Sonkeigo), 謙譲語 (Kenjōgo), and 丁寧語 (Teineigo). Each entry includes the base form, all three honorific transformations, and English translations for essential phrases in Japanese. This dataset is perfect for training and evaluating the Japanese skill level of LLMs.
How to Use the AI Image Generator https://miragic.ai/products/image-generator 1. Describe Your Vision: Enter a text prompt describing what you want to create. Example: "A futuristic city skyline at sunset with glowing airships." 2. Select a Style: Choose an art style—Realistic, Anime, Painterly, Surreal, or Minimalist—to match your idea. 3. Generate and Refine: Click "Generate Image" and let the AI do its magic. Want to tweak it? Refine your prompt or try a new style. 4. Download and Share: Save your creation in high resolution or share it directly on social media.
Building Smarter AI Agents: A Tool-Based Architecture for Modularity and Trust
Over the past year, our AI engineering team at GoDaddy has been rethinking how to make agent systems more modular, transparent, and production-ready. Instead of viewing an AI agent as a monolithic process, we’ve decomposed it into four core tools that separate decision-making from execution — a design that’s proving critical for scale and observability:
🧩 MemoryTool – maintains persistent context and user continuity ✅ CompletionTool – determines when a task is truly complete 💬 UserInteractionTool – manages clarifications, approvals, and confirmations 🔁 DelegationTool – enables agents to hand off tasks to other agents or humans
This approach makes every step of an agent’s workflow explicit, testable, and auditable, allowing us to scale AI systems in production with higher confidence. We see this as a step toward a more open, composable agent ecosystem — one where frameworks can interoperate and agents can build trust through transparency and version control.
We are excited to share our Atom V1 4B Preview model! This fine tuned Gemma3 4B variant has a distinct, friendly, and exploratory persona - designed to help the user think and reflect.
Atom is trained to ask questions, use approachable, yet relatable analogies in ELI5-style explanations, and engage in deep, reflective conversation.
We plan to scale Atom's persona to larger architectures, and this iteration was created as part of that R&D.
Any and all feedback is always welcome as we continue to refine our approach.
If you haven't played around with the Hugging Face Hub API yet, you should! I recently built this custom analytics dashboard to track how our models perform over time.
I'm hoping this tool will help us understand how the community uses the models, which types of models people prefer over others, and insights about what's maybe working and what isn't.
This is the second project I've built with the Hub API, and I can't wait to do more!
Do you remember https://thispersondoesnotexist.com/ ? It was one of the first cases where the future of generative media really hit us. Humans are incredibly good at recognizing and analyzing faces, so they are a very good litmus test for any generative image model.
But none of the current benchmarks measure the ability of models to generate humans independently. So we built our own. We measure the models ability to generate a diverse set of human faces and using over 20'000 human annotations we ranked all of the major models on their ability to generate faces. Find the full ranking here: https://app.rapidata.ai/mri/benchmarks/68af24ae74482280b62f7596
measuring the information content of a reasoning trace seems like a straightforward reasoning LLM KPI, but how can we achieve this?
what if we keep it simple: gzip the resulting text and take the length of the compressed stream... "compressed bytes of information per output token" becomes the KPI
if we split across correct answers vs incorrect answers vs truncated answers and group by difficulty, a whole new world of analysis becomes not just possible but visually intuitive and almost trivial:
1) what is the model's overall reasoning efficiency? this is the slope of the scatterplot curve segments (there may be more then one..)
2) is the model able to apply more test-time compute towards more difficult variations of the task? the two on the left are not, the two on the right are.
3) when applying more test-time compute, is that compute useful? this is the curvature of the scatterplot trends - the two in the middle are 'losing their mojo' as answers get longer the information content falls down
4) is the model applying multiple approaches to the task? (right) do those approaches change with difficulty?
5) are truncations because we don't have enough context budget (left) or because the model has lost its mind and gone into a repeat loop (middle two) and does this happen across the board (middle left) or only when the problem is more difficult (middle right)
would love to hear your guys feedback on this kind of analysis, is anyone doing similar work?
this approach generates 12 plots per model (one for each task) so quite a bit of data and i've been hesitant to publish it so far, consider this post a toe tip.
In November, with Robonine are starting work on the next version of the SO ARM 102 manipulator. The version will be open source and agreed upon with @therobotbuilder the creator of the original manipulator.
We are planning to:
- increase positioning accuracy by approximately 2x using Feetech STS 3250 motors - increase working payload from 200g to 300g - increase rigidity using parametric design optimization and stiffer plastic - increase length to 550 mm - increase folding angles - use ISO 9409-1-50-4-M6 mounting standard for the gripper - use a parallel gripper in the default version - update the mounting plate for different camera types, M3 grid with 12.5 mm pitch - add table mounting standard 80x80 M8
The number of degrees of freedom and basic kinematics will remain the same.
Are there other things missing for working with SO ARM 100?
- Any standard inputs/outputs, for example? - Status indicators? - Perhaps some types of mounting for third-party grippers are more preferable? - Anything else?
We are excited to share the first iteration of our dataset focused on human-AI collaborative tasks!
This dataset contains 3,050 lines of warm, collaborative, and natural conversational examples designed to teach the model how to effectively and efficiently problem solve back and forth with a human.
Additionally, the examples include <think> tags, showing the model proper internal reasoning.
VANTA Research is committed to contributing back to the open source community in order to make AI development more accessible, transparent, and beneficial for all.
There is no anxiety quite like powering up 2KW of basement compute after rewiring it all. Small bit of trouble with the horizontal 3090 because I misread my motherboard manual, but otherwise so far so good.. Next we see if I've built up enough cooling to hit my target TDP on those 3-slot nvlinked cards especially. The 4-slot bridges are much easier to work with but their prices went bananas and I couldn't acquire a second, so gotta get a little creative with intakes.
SpeedPaint is an AI Speed Painting software that simulates how an artist paints — step by step, layer by layer — but at machine speed. Instead of generating a finished image instantly, it paints in motion, giving users a live-brush experience.
I just got asked about the differences between Blackwell systems and Grace Blackwell systems. What's the difference and how much of a performance gap is there between them?
Here's a summary of the key points from the article:
GB200 (Grace Blackwell) is a Superchip: It integrates a Grace CPU and two Blackwell GPUs into a single package. B200 is a GPU-only module: It's designed to be paired with x86 or ARM CPUs in more traditional server setups.
Performance and Efficiency:
Based on MLPerf Training v5.0 benchmarks, the article concludes:
GB200 systems are approximately 42% more efficient than B200 systems on average. This is especially true in large-scale deployments (100+ GPUs), where the GB200's integrated design and high-speed NVLink interconnect provide a significant advantage.
In smaller, single-node systems (e.g., 8 GPUs), the performance difference is much smaller, around 10-15%.
Use Cases:
Choose GB200 for large-scale AI clusters, training massive models, and when maximum efficiency is the top priority.
Choose B200 for smaller deployments, when you need the flexibility to choose your own CPU, or for mixed AI and HPC workloads.
Load test conducted on the Feetech STS3250 servo motor.
With a 2 kg load on a 100 mm arm, the motor operated near its limit. At higher acceleration settings, lifting performance decreased noticeably. The temperature increased from 40 °C to 70 °C within 8 minutes. The test highlights the torque and thermal constraints under sustained load conditions.
Totally Free Image Generator, which is the best quality.
Transform your imagination into breathtaking visuals with our advanced AI technology. No skills required—just describe your vision and watch the magic happen
On this day in 2019, OpenAI released the final GPT-2 model as part of their staged release. I still remember that November well - so much was happening, but GPT-2's release felt like a watershed moment for the field. It showed us what was possible with carefully trained language models.
To recreate some of that GPT-2 magic, I recently tackled an interesting challenge: can you pretrain a language model with just 1 billion tokens - roughly 1/10th of what GPT-2 used - and still get comparable performance? After 50+ systematic experiments testing different dataset mixtures, the answer is yes.
The result is codelion/gpt-2-70m, which achieves over 90% of GPT-2's benchmark performance despite being trained on 10x less data. The key was finding the optimal dataset composition: 50% high-quality textbook PDFs, 30% filtered web content, and 20% educational resources. It even beats GPT-2 on TruthfulQA (47.31% vs 40.69%).