The A.I Megathread (Large Language Models / LLM's, ChatGPT, Development)

ujol · Oct 3, 2025

1/7
@AIfeedFyi
Major developments in AI last week.

1. Grok Imagine with voice input.
2. ChatGPT introduces branching.
3. Google drops EmbeddingGemma.
4. Kimi K2 update.
5. Alibaba Qwen3-Max-Preview.

Full breakdown of the AI feed below ↓

2/7
@AIfeedFyi
1. Elon Musk xAI announces Grok Imagine now accepts speech input.

Users can now generate animated clips directly from voice prompts.

[Quoted tweet]
Grok video now has speech.

Also, major upgrade to image/video generation in training. Should be ready in ~2 weeks.

3/7
@AIfeedFyi
2. ChatGPT adds the ability to branch a conversation, you can spin off new threads without losing the original.

Nice feature for testing different directions in parallel.

[Quoted tweet]
By popular request: you can now branch conversations in ChatGPT, letting you more easily explore different directions without losing your original thread.

Available now to logged-in users on web.

4/7
@AIfeedFyi
3. Google introduces EmbeddingGemma.

→ 308M parameter embedding model built for on-device AI.

→ Delivers SOTA performance while being small and efficient enough to run anywhere.

https://video.twimg.com/amplify_video/1963634357878333440/vid/avc1/1920x1080/8j44A5DJKjMWnGIA.mp4

5/7
@AIfeedFyi
4. Moonshot AI update Kimi K2-0905

→ Better coding (front-end & tool use).
→ 256k token context window.

[Quoted tweet]
Kimi K2-0905 update

- Enhanced coding capabilities, esp. front-end & tool-calling
- Context length extended to 256k tokens
- Improved integration with various agent scaffolds (e.g., Claude Code, Roo Code, etc)

Weights & code: huggingface.co/moonshotai/Ki…

Chat with new Kimi K2 on: kimi.com

️ For 60–100 TPS + guaranteed 100% tool-call accuracy, try our turbo API: platform.moonshot.ai

6/7
@AIfeedFyi
5. Alibaba rolls out Qwen3-Max-Preview.

→ Biggest model yet, with over 1 trillion parameters.

→ Better in reasoning, code generation, and conversation over past Qwen releases.

[Quoted tweet]
Big news: Introducing Qwen3-Max-Preview (Instruct) — our biggest model yet, with over 1 trillion parameters!

Now available via Qwen Chat & Alibaba Cloud API.

Benchmarks show it beats our previous best, Qwen3-235B-A22B-2507. Internal tests + early user feedback confirm: stronger performance, broader knowledge, better at conversations, agentic tasks & instruction following.

Scaling works — and the official release will surprise you even more. Stay tuned!

Qwen Chat: chat.qwen.ai/
Alibaba Cloud API: modelstudio.console.alibabac…

7/7
@AIfeedFyi
Explore http://aifeed.fyi and follow us @AIfeedFyi for full AI signals, breakdowns, and everything happening in AI right now.

/search?q=#AIfeed

ujol · Oct 3, 2025

1/17
@ArtificialAnlys
IBM has launched Granite 4.0 - a new family of open weights language models ranging in size from 3B to 32B. Artificial Analysis was provided pre-release access, and our benchmarking shows Granite 4.0 H Small (32B/9B total/active parameters) scoring an Intelligence Index of 23, with a particular strength in token efficiency

Today IBM released four new models: Granite 4.0 H Small (32B/9B total/active parameters), Granite 4.0 H Tiny (7B/1B), Granite 4.0 H Micro (3B/3B) and Granite 4.0 Micro (3B/3B). We evaluated Granite 4.0 Small (in non-reasoning mode) and Granite 4.0 Micro using the Artificial Analysis Intelligence Index. Granite 4.0 models combine a small amount of standard transformer-style attention layers with a majority of Mamba layers which claims to reduce memory requirements without impacting performance

Key benchmarking takeaways:
➤

 Granite 4.0 H Small Intelligence: In non-reasoning, Granite 4.0 H Small scores 23 on the Artificial Analysis Intelligence index - a jump of +8 points on the Index compared to IBM Granite 3.3 8B (Non Reasoning). Granite 4.0 H Small places ahead of Gemma 3 27B (22) but behind Mistral Small 3.2 (29), EXAONE 4.0 32B (Non-Reasoning, 30) and Qwen3 30B A3B 2507 (Non-Reasoning, 37) in intelligence
➤

Granite 4.0 Micro Intelligence: On the Artificial Analysis Intelligence Index, Granite 4.0 Micro scores 16. It places ahead of Gemma 3 4B (15) and LFM 2 2.6B (12).
➤

Token efficiency: Granite 4.0 H Small and Micro demonstrate impressive token efficiency - Granite 4.0 Small uses 5.2M, while Granite 4.0 Micro uses 6.7M tokens to run the Artificial Analysis Intelligence Index. Both models fewer tokens than Granite 3.3 8B (Non-Reasoning) and most other open weights non-reasoning models smaller than 40B total parameters (except Qwen3 0.6B which uses 1.9M output tokens)

Key model details:
➤

 Availability: All four models are available on Hugging Face. Granite 4.0 H Small is available on Replicate and is priced at $0.06/$0.25 per 1M input/output tokens
➤

 Context Window: 128K tokens
➤

 Licensing: The Granite 4.0 models are available under the Apache 2.0 license

2/17
@ArtificialAnlys
Granite 4.0 H Small’s (Non Reasoning) output token efficiency and per token pricing offers a compelling tradeoff between intelligence and Cost to Run Artificial Analysis Intelligence Index

3/17
@ArtificialAnlys
In the category of Open Weights Non-Reasoning models smaller than 40B total parameters, Granite 4.0 H Small is on the frontier tradeoff between intelligence and Output Tokens Used in Artificial Analysis Intelligence Index

4/17
@ArtificialAnlys
In the category of Open Weights Non-Reasoning models smaller than 4B total parameters, Granite 4.0 Micro is on the frontier of tradeoff between intelligence and Output Tokens Used in Artificial Analysis Intelligence Index

5/17
@ArtificialAnlys
Compare how the Granite 4.0 models perform relative to other models you are using or considering at: http://artificialanalysis.ai/models/granite-4-0-h-small

6/17
@BrandGrowthOS
Intelligence Index of 23 - how does that translate to latency and cost in production though? Benchmarks matter less than real-world performance.

7/17
@IBMwatsonx
It was great to work with the @ArtificialAnlys team on this and looking forward to many more collaborations to come! 

8/17
@RilaGlobal
The hybrid use of transformer and Mamba layers suggests we’re entering a phase where architecture innovation, not just parameter count, defines model competitiveness.

9/17
@IBMDeveloper

10/17
@liputanai
@PingThread unroll

11/17
@sealcoin_ai
Granite 4.0’s efficiency is exciting, especially for agent transactions like those with Sealcoin. With satellite connectivity and post-quantum security, it could play a key role in streamlining autonomous transactions. Looking forward to seeing its potential in action.

12/17
@liputanai
@threadreaderapp unroll

13/17
@Marktechpost

[Quoted tweet]
IBM Released new Granite 4.0 Models with a Novel Hybrid Mamba-2/Transformer Architecture: Drastically Reducing Memory Use without Sacrificing Performance

IBM’s Granite 4.0 is an open-weights LLM family that swaps a monolithic Transformer for a hybrid Mamba-2/Transformer stack, cutting serving memory (IBM reports 70% reduction in long-context, concurrent inference) while maintaining instruction-following and tool-use quality. The lineup spans ~3B (Micro/H-Micro), ~7B total/~1B active (H-Tiny), and ~32B total/~9B active (H-Small) with BF16 checkpoints and official GGUF conversions for local runtimes. Models are Apache-2.0 licensed, cryptographically signed, and—per IBM—covered by an accredited ISO/IEC 42001 AI management system certification; distribution includes watsonx.ai, Hugging Face, Docker, LM Studio, NVIDIA NIM, Ollama, and Replicate. Benchmarks and specs are detailed in IBM’s launch notes and model cards.

full analysis: marktechpost.com/2025/10/02/…

model series on hugging face: huggingface.co/collections/i…

technical details: ibm.com/new/announcements/ib…

@IBM @IBMwatsonx @IBMResearch @IBMData @IBMcloud @IBMNews

14/17
@PaceRubie10352
Interesting to see IBM's new open models. Makes me wonder what @APhill_Mark thinks about these smaller, efficient models for local deployment.

15/17
@Aoomsn
Noise

16/17
@VibeCodeTeddy
Granite 4.0 H Small really pushes the boundaries for non-reasoning models. That increased token efficiency is a game changer for deployment costs. How will these compare in real-world applications?

17/17
@sir4K_zen
Impressive token efficiency, though I wonder how well it scales in real-world applications.

ujol · Oct 3, 2025

1/13
@ArtificialAnlys
Kling 2.5 Turbo takes the top spot in both Text to Video and Image to Video in the Artificial Analysis Video Arena, surpassing Hailuo 02 Pro, Google’s Veo 3, and Luma Labs’ Ray 3!

Kling 2.5 Turbo is the latest release from @Kling_ai , representing a significant leap from Kling 2.1. The model supports 5s and 10s video generations at resolutions up to 1080p.

It's available directly on the Kling AI app at 25 Credits for a 5s video, with videos costing approximately 15c each on the highest "Ultra" plan. The model is also accessible via API on major API platforms.

At $4.20 per minute of video on @fal , Kling 2.5 Turbo is slightly cheaper than its primary competitors - Hailuo 02 Pro at $4.90 per minute and Seedance 1.0 at approximately $7.32 per minute - while delivering superior quality.

See below for comparisons between Kling 2.5 Turbo and other leading models in our Artificial Analysis Video Arena

2/13
@ArtificialAnlys
[Prompt 1/5] The skater does a few tricks in a skate park. Halfway through the video, the video transitiions into colour.

https://video.twimg.com/amplify_video/1973570064059342853/vid/avc1/1920x1080/_uioIr0qPChYLlZk.mp4

3/13
@ArtificialAnlys
[Prompt 2/5] The skier does a sick trick!

https://video.twimg.com/amplify_video/1973569949559037953/vid/avc1/1920x1080/8TcArA5UYNXojvkO.mp4

4/13
@ArtificialAnlys
[Prompt 3/5] The tennis player completes the serve motion, following through with the racket as the ball launches powerfully across the court.

https://video.twimg.com/amplify_video/1973570160134135808/vid/avc1/1920x1080/7Ykl8jXeK3SJNVAb.mp4

5/13
@ArtificialAnlys
[Prompt 4/5] A 3D animation showing a futuristic submarine navigating through an underwater canyon, its lights illuminating the dark, mysterious depths. As the camera follows the submarine, strange and bioluminescent sea creatures dart past.

https://video.twimg.com/amplify_video/1973556294549381120/vid/avc1/1920x1080/jpGkXxfj85qzjp71.mp4

6/13
@ArtificialAnlys
[Prompt 5/5] A cinematic slow-motion shot of a tiger stalking through the jungle. Its muscles ripple beneath its striped fur as it moves with calculated grace through the undergrowth, eyes locked on its prey just ahead.

https://video.twimg.com/amplify_video/1973556317219594246/vid/avc1/1920x1080/RgtwPKU7_2-HTTLO.mp4

7/13
@ArtificialAnlys
See Kling 2.5 Turbo for yourself in the Artificial Analysis Video Arena! https://artificialanalysis.ai/text-to-video/arena

8/13
@Kling_ai
Excited to announce that our 2.5 Turbo (1080p) model takes the top spot in both Text to Video and Image to Video in the Artificial Analysis Video Arena!

9/13
@EdDiberd
Interesting, it doesn't have native audio gen though

10/13
@altryne
when will you add sora2?

11/13
@galaxyai__
AI video space turning into mario kart and kling got the blue shell

12/13
@XtremeNodes
This is good

13/13
@_YoungTurbo_
This was probably made before Sora 2 was out

ujol · Oct 3, 2025

1/12
@ArtificialAnlys
Reve V1 debuts at #3 in the Artificial Analysis Image Editing Leaderboard, trailing only Gemini 2.5 Flash (Nano-Banana) and Seedream 4.0!

Reve V1 is the first image editing model from Reve AI, and is built on their latest text to image model. The Reve V1 model supports both single and multi-image edits, with the ability to combine multiple reference images into a single output image.

The model is available via the Reve web app, which offers free access with a daily usage limit, or expanded usage through their Pro plan at $20/month.

Reve V1 is also accessible via the Reve API Beta priced at $40/1k images, similar to competitors like Gemini 2.5 Flash ($39/1k) and Seedream 4.0 ($30/1k).

See the Reve V1 Image Editing model for yourself in the thread below

2/12
@ArtificialAnlys
[Prompt 1/5] Change the sign to state "SCHOOL Zone Ahead”

3/12
@ArtificialAnlys
[Prompt 2/5] Change the stroke to freestyle

4/12
@ArtificialAnlys
[Prompt 3/5] Have the kid dunking over a player at a professional nba game!

5/12
@ArtificialAnlys
[Prompt 4/5] Place these shoes on the feet of an athlete running a marathon

6/12
@ArtificialAnlys
[Prompt 5/5] Turn it into anime-style and give the cat blue eyes

7/12
@ArtificialAnlys
See Reve V1 in the Artificial Analysis Image Arena for yourself! https://artificialanalysis.ai/text-to-image/arena

8/12
@jeongmin1604
Did you release for new version of 'Qwen Edit'? Just wonder it.

9/12
@XtremeNodes

10/12
@liputanai
@threadreaderapp unroll

11/12
@liputanai
@PingThread unroll

12/12
@sir4K_zen
Multi-image edits and API pricing look solid, wonder how it really compares in quality?

ujol · Oct 3, 2025

1/13
@ArtificialAnlys
DeepSeek has launched V3.2 Exp with their new DeepSeek Sparse Attention (DSA) architecture that claims to reduce the impact of the quadratic scaling of compute with context length

We’ve independently benchmarked V3.2 Exp as achieving similar intelligence to DeepSeek V3.1 Terminus; DeepSeek have switched to using V3.2 for their main API endpoint and have reduced API pricing by >50%. With DeepSeek’s updated first party API pricing, cost to run Artificial Analysis Intelligence Index falls from $114 to $41.

DeepSeek claims to have “deliberately aligned” the training configurations of V3.1 Terminus and V3.2 Exp. Matching V3.1 Terminus’ performance appears to demonstrate that the performance benefits of the DeepSeek Spare Attention architecture do not come at a cost to intelligence.

Key benchmarking takeaways:
➤

  No change in aggregate intelligence: In reasoning mode, DeepSeek V3.2 Exp scores 57 on the Artificial Analysis Intelligence Index. We see this as equivalent in intelligence to DeepSeek V3.1 Terminus (Reasoning)
➤

 No decline in long context reasoning: Despite DeepSeek’s architecture changes, V3.2 Exp (Reasoning) appears not to exhibit any decline in long context reasoning - scoring a slight uplift in AA-LCR.
➤

Non-reasoning performance: In non-reasoning mode, DeepSeek V3.2 Exp shows no degradation in intelligence, matching DeepSeek V3.1 Terminus with a score of 46 on the Artificial Analysis Intelligence Index
➤

Token efficiency: For DeepSeek V3.2 Exp (Reasoning), token usage to run the Artificial Analysis Intelligence Index decreases slightly from 67M to 62M compared to V3.1 Terminus. Token usage remains unchanged for the non-reasoning variant
➤

Pricing: DeepSeek has significantly reduced the per token pricing for their first-party API from $0.56/$1.68 to $0.28/$0.42 per 1M input/output tokens - a 50% and 75% reduction in pricing of input and output tokens respectively.

Other model details:
➤

 Licensing: DeepSeek V3.2 Exp is available under the MIT License
➤

 Availability: DeepSeek V3.2 Exp is available via DeepSeek API, which has replaced DeepSeek V3.1 Terminus. Users can still access DeepSeek V3.1 Terminus via a temporary DeepSeek API until 15th October
➤

 Size: DeepSeek V3.2 Exp has 671B total parameters and 37B active parameters. This is the same as all previous models in the DeepSeek V3 and R1 series

2/13
@ArtificialAnlys
DeepSeek V3.2 Exp is cheaper than DeepSeek V3.1 Terminus via DeepSeek first party API, due to a reduction in per token pricing.

3/13
@ArtificialAnlys
Compare how DeepSeek V3.2 Exp performs relative to models you are using or considering at: https://artificialanalysis.ai/models/deepseek-v3-2-reasoning

4/13
@soltraveler_sri
@grok does the quadratic cost of context length derive from the fact that every new token needs to be stored with a relationship to all previous tokens (if incorrect, please correct)?

If so, for DeepSeek to have “solved” the quadratic cost of growing context, would this necessitate that they’ve found a way to remove this need for every token to store relationships with all previous tokens… or are there other approaches they make taken to break this cost curve?

5/13
@_junaidkhalid1
The pricing drop is the real game-changer here.

Cutting costs from $114 to $41 for something like the Intelligence Index makes high-quality inference accessible to way more teams, not just the big players.

But the bigger question is how this ripples out, will we see more experimentation with long-context tasks now that the economics make sense?

6/13
@mehedi_u
DeepSeek’s V3.2 Exp looks like a meaningful step forward. Matching V3.1 Terminus in intelligence while cutting API costs by over 50% is a huge win for scalability. The MIT license and smooth transition path make adoption even more attractive

7/13
@calhim7
Visiting your website without dark mod

8/13
@AliAhmed4253
DeepSeek has not met expectations my hope is now solely on Qwen as the number one open-source model.

9/13
@_dr5w
Great optimization. Hope to see optimization on end to end response times soon (one of the slowest reasoning models out there)

10/13
@aiComingFast
i hope they give v4 soon

11/13
@ilmPakistam
This analysis is shit. No way that stupid gptoss 120 B high is above qwen 3 max and R1. Seriously??

12/13
@Hyperstackcloud
Huge milestone

And exciting to see this open sourced for the community

13/13
@Ryzex_Dreemurr
A reminder that it's an experimental model

ujol · Oct 3, 2025

1/23
@ArtificialAnlys
ServiceNow has released Apriel-v1.5-15B-Thinker, a 15B open weights reasoning model that leads our Small Models category (<40B parameters)

 Overview: Apriel-v1.5-15B-Thinker is a dense, 15B parameter open weights reasoning model. This is not the first model ServiceNow has released but is a substantial jump in intelligence achieved compared to past releases

 Intelligence: The model scores 52 in the Artificial Analysis Intelligence Index. This puts it on par with DeepSeek R1 0528, which has a much larger 685B parameter architecture. ServiceNow’s model scores particularly well within important behaviors for enterprise agents, such as instruction following (62% in IFBench, ahead of gpt-oss-20B, reasoning) and multi-turn conversions & tool use (68% in 𝜏²-Bench Telecom, ahead of gpt-oss-120B, reasoning). This makes it particularly well-suited to agentic use cases, which was likely a focus given ServiceNow is active in the enterprise agents space

 Output tokens and verbosity: The model produces a large number of output tokens even among reasoning models - using ~110M combined reasoning and answer tokens to complete the Artificial Analysis Intelligence Index

 Access: No serverless inference providers are yet serving the model, but it is available now on Hugging Face for local inference or self-deployment. The model has been released under an MIT license, supporting unrestricted commercial use

 Context window: The model has a native context window of 128k tokens.

Congratulations to @ServiceNowRSRCH on this impressive result!

archive | archive.is | view archive

2/23
@ArtificialAnlys
Apriel-v1.5-15B-Thinker is the new most intelligent open weights Small Model (<40B parameters)

archive | archive.is | view archive

3/23
@ArtificialAnlys
Individual benchmark results. All benchmarks have been run like-for-like across the models and independently

archive | archive.is | view archive

4/23
@ArtificialAnlys
The model produces a large number of output tokens even among other reasoning models - using ~110M combined reasoning and answer tokens to complete the Artificial Analysis Intelligence Index

archive | archive.is | view archive

5/23
@ArtificialAnlys
Link to

HuggingFace repo: https://huggingface.co/ServiceNow-AI/Apriel-1.5-15b-Thinker

Further analysis on Artificial Analysis: https://artificialanalysis.ai/models/apriel-v1-5-15b-thinker

https://artificialanalysis.ai/models/apriel-v1-5-15b-thinker

6/23
@ndimasTech
@ollama you there? :)

7/23
@Gopinath876
I am just testing with q8 lets see how its goes

8/23
@JacksonAtkinsX
A 15B parameter model beating Gemini 2.5 flash, GLM 4.5, and Kimi K2 is impressive.

It tied Deepseek R1.

At that size you can now run a model that equals DeepSeek-R1 intelligence on a home computer.

9/23
@AI_PlanetX
Impressive release from ServiceNow! Deep intelligence in a compact model.

10/23
@_junaidkhalid1
The efficiency here is what stands out to me.

A 15B model punching at the level of something 40 times its size in enterprise tasks like instruction following and tool use suggests ServiceNow has cracked some real optimization nuts.

The question is how much of that comes from architecture versus the training data, either way, it's a blueprint for what smaller teams can realistically deploy without massive infra.

11/23
@007musk
I don't like the industry direction of "small, but fast" models.

We need companies to develop the most capable models first.

Speed and energy can be optimized later.

12/23
@alkimiadev
Positives:
1. it is mit licensed.
2. it seems to do well with tool calling
3. it seems to obsessively follow its instructions

Negatives:
1. it seems to obsessively follow its instructions

I got many denials on really basic things

archive | archive.is | view archive

13/23
@altiamkabir
Exciting advancements! This could reshape enterprise AI.

14/23
@ArielFrischer
Fire whoever named this model and anyone that approved it.

15/23
@AliAhmed4253
Very strong score compared to the model size

16/23
@Hiswordxray
This is definitely an agentic coding model, a 15B coding beast

17/23
@aiComingFast
there are all good models nowadays

18/23
@leo_trd1
What about hallucinations?

19/23
@anirudhasah
omg what

20/23
@DougChampion
Whoa 15b at a 52? That’s nuts! This is big news!

21/23
@foedusxyz
Holy benchmaxx

22/23
@Hyperstackcloud
Impressive release, super impactful for enterprise AI adoption

23/23
@VibeCodeTeddy
Impressive jump in capability. 15B parameters showing serious potential in enterprise use cases. Just need those inference providers to catch up.

ujol · Oct 3, 2025

1/9
@ArtificialAnlys
Luma Labs' Ray 3 ranks #2 in Text to Video in the Artificial Analysis Video Arena, trailing only Google's Veo 3!

@LumaLabsAI has launched Ray 3, a major upgrade to their Ray 2 model, coming in at #2 in Text to Video and #7 in Image to Video on the Artificial Analysis Video Leaderboards.

Using a chain-of-thought approach, Ray 3 iterates on video generations, analyzing them at each step to ensure quality and prompt adherence. The model supports both T2V and I2V generations with up to 10-second videos at 1080p resolution.

Beyond typical SDR generations, Ray 3 introduces 16-bit HDR support - the first in the industry. This allows HDR video generation from SDR images and even conversion of existing SDR videos to HDR.

Currently, Ray 3 is only available on Luma Dream Machine, with API access not yet available.

See Ray 3 for yourself in the Artificial Analysis Video Arena

2/9
@ArtificialAnlys
[Prompt 1/5] A cinematic view of a dragon soaring over a medieval castle at sunset. Its wings flap powerfully as it breathes fire, with the glowing embers lighting up the sky as knights below prepare for battle.

https://video.twimg.com/amplify_video/1973077392186023938/vid/avc1/1920x1080/RJha-oL5U7K-A-Sw.mp4

3/9
@ArtificialAnlys
[Prompt 2/5] A drone swarm forms the words "Artificial Analysis" against a night sky. Each drone is a point of light, creating a constellation effect. The formation morphs smoothly between different fonts before settling on bold block letters. City lights twinkle in the background.

https://video.twimg.com/amplify_video/1973077434414350336/vid/avc1/1920x1080/TOscrfypi6LPD9Ot.mp4

4/9
@ArtificialAnlys
[Prompt 3/5] A kangaroo bounces past bioluminescent glow worms illuminating a dark cave, their blue light reflecting off the water.

https://video.twimg.com/amplify_video/1973077466056105985/vid/avc1/1920x1080/wTqTvUmXwUY324vT.mp4

5/9
@ArtificialAnlys
[Prompt 4/5] A retro, 70s Urban Grit style of a walk through Times Square, New York City, with neon signs flashing, taxis honking, and people crowding the sidewalks. The gritty, muted tones give the scene an edgy feel, capturing the raw energy of the city.

https://video.twimg.com/amplify_video/1973077500273315840/vid/avc1/1920x1080/38qN1aluepNJ47Bi.mp4

6/9
@ArtificialAnlys
[Prompt 5/5] An astronaut floats weightlessly through a dimly lit, abandoned space station. Beams of light from their helmet illuminate dust particles and forgotten objects drifting in the zero-gravity environment. Suddenly, a shadow moves in the background.

https://video.twimg.com/amplify_video/1973077548688125952/vid/avc1/1920x1080/AL5BU8rsR9DLKsW-.mp4

7/9
@ArtificialAnlys
See Ray 3 for yourself in the Artificial Analysis Video Arena! https://artificialanalysis.ai/text-to-video/arena

8/9
@EdDiberd
Excited to see Sora 2 on this list

9/9
@calhim7
Every time I visit http://artificialanalysis.ai

ujol · Oct 3, 2025

1/23
@ArtificialAnlys
Alibaba’s updated Qwen3 Max is the most intelligent non-reasoning model, placing ahead of Kimi K2 0905!

Key takeaways:
➤ Intelligence uplift: Intelligence increased by +6 points to 55 in our Artificial Analysis Intelligence Index. Qwen3 Max is currently the most intelligent non-reasoning model. The previous most intelligent non-reasoning model, Kimi K2 0905, scored 50 on the index
➤ GA: Alibaba’s upgraded Qwen3 Max model is now in GA, the prior version was in Preview
➤ Broader capability improvements: Improvements across agentic tool use (𝜏²-Bench Telecom scores increased from 33% to 74%), coding (LiveCodeBench from 65% to 77%), and long context reasoning (AA-LCR from 40% to 47%).
➤ Higher token usage: Running the Artificial Analysis Intelligence Index required ~21M output tokens, ~7M more than Qwen3 Max (Preview). This continues the trend of non-reasoning models becoming more verbose, though there remains a distinction with Qwen3 Max still significantly below reasoning models

Key model information:
➤

Reasoning: Qwen3 Max is a non-reasoning model. Alibaba has indicated that a reasoning version, Qwen3-Max-Thinking, is under active training.
➤

Proprietary: Like the Preview version, Qwen3 Max is proprietary, since Alibaba has not released the weights.
➤

Context window: The model supports a 256k-token context window.
➤

Multimodality: Qwen3 Max is text-only, with no multimodal inputs or outputs.
➤

Pricing: The model is priced at $1.2/$6 per 1M input/output tokens

Qwen3 Max is currently available in Qwen Chat and via Alibaba Cloud.

2/23
@ArtificialAnlys
Qwen3 Max is very well positioned when considering it’s intelligence relative to output token efficiency, especially compared to reasoning models. However, the model does continue the trend of non-reasoning models becoming more verbose and it occupies a middle ground between more ‘classic’ non-reasoning models and reasoning models

3/23
@ArtificialAnlys
Compare Qwen3 Max to models you are using or considering at: https://artificialanalysis.ai/models/qwen3-max

4/23
@reallygodwin
Grok 4 fast is $0.50(currently free on openrouter).

Literally no point of using any other model.

5/23
@oscarle_x
Pricing of $1.2/$6 is quite close to price of frontier models of OpenAI and GG. It is hard for them to get international customers with that price.

Lowering to around $3-4 will be a lot more attractive.

6/23
@JacksonAtkinsX
I don’t see a market fit here at 6 dollars.

Grok-4-Fast and the latest Sept release of Gemini 2.5 flash a much cheaper (0.50 and 2.50). They are smarter and have 1M+ context windows.

Plus both have token per second output about 10x faster than this model.

7/23
@KuittinenPetri
Alibaba Qwen3-Max is definitely a strong model. Its multi-lingual support is among the best ones and it knows a lot of niche knowledge. Its main weakness is creative writing. I would rank several models, including Kimi K2, above Qwen3-Max when it comes to writing.

8/23
@ovsky4
You should start using qwen logo on those charts

9/23
@zhanghao869179
What about QWEN3 VL?

10/23
@mehedi_u
Impressive progress from Alibaba—Qwen3 Max jumping +6 points on the Artificial Analysis Intelligence Index and overtaking Kimi K2 0905 is a clear signal of rapid iteration. The leap in agentic tool use and coding performance is especially noteworthy

11/23
@HekmetSeve90200
That's impressive! Reminds me of @jordanMaxwell37's breakdown of how fast these AI models are evolving. It's hard to keep up.

12/23
@KevTei
Qwen3 Max thinking will be in the top 3 podium

13/23
@nummanthinks
Really excited to see where the reasoning model will land on the chart

Also, what the F is Claude 4.5!?

Haven't we all been waiting too long!

14/23
@ethan_tan
What about Grok 4 Fast (non reasoning)?

15/23
@_shanytc
I'll be the judge on that!

16/23
@_junaidkhalid1
The jump in agentic tool use from 33% to 74% on that telecom bench is impressive, but it raises a question about real-world deployment.

Non-reasoning models like this are getting better at following instructions, yet they still lack the introspection to adapt when tools fail unexpectedly.

Teams building on Qwen3 Max will need robust fallback layers to bridge that gap.

17/23
@thegenioo
the chart is looking a bit different to me did you guys change it

18/23
@suhrabautomates
Qwen3 Max’s upgrade shows non-reasoning models can still push intelligence benchmarks. Agentic tool use and coding gains are especially impressive given it’s text-only.

19/23
@JudithS13111
Incredible leap forward!

20/23
@LSKMSun
@OpenAI opensi is dying ... game over no matter how much you lie

21/23
@9arkalc
- No open weights
- Only 256K context window
- Too much unnecessary emojis
- Unstable latency
- Censorship based on Chinese regulation

I have no point why should use this model, instead Grok 4 Fast do better with lower price

22/23
@MindColliers
alibaba's qwen3 max claiming the title of most intelligent non-reasoning model is intriguing. a 6-point uplift is impressive but let’s see if it translates to real-world applications. higher token usage and verbosity are trends to watch, but true reasoning is where the real challenge lies.

23/23
@vmontalvan23
WTF No mientan, el max está a un nivel muy superior a Sonnet4 porque siempre lo hacen quedar chico jajaja

ujol · Oct 3, 2025

1/37
@ArtificialAnlys
Anthropic’s new Claude 4.5 Sonnet is now the #4 most intelligent model, beats 4.1 Opus, and places Anthropic in the top 3 in the race for frontier intelligence

Claude 4.5 Sonnet offers a clear upgrade for Claude 4.1 Opus and Claude 4 Sonnet users, with greater intelligence at the same price and token efficiency as Claude 4 Sonnet. Claude 4.5 Sonnet’s token efficiency, even in its maximum reasoning mode, makes it cheaper to use for many tasks than GPT-5, Grok 4 or Gemini 2.5 Pro.

Key benchmarking takeaways:
➤

 Anthropic’s most intelligent model: In reasoning mode, Claude 4.5 Sonnet scores 61 on the Artificial Analysis Intelligence Index. This is a jump of +4 points from Claude 4 Sonnet (Thinking) which was released in May 2025, and +2 points from Claude 4.1 Opus (Thinking). Claude 4.5 Sonnet (Thinking) now places ahead of Gemini 2.5 Pro (60) and Grok 4 Fast (60), but behind GPT-5 (high, 68) and Grok 4 (65).
➤

 Largest increases: we see the biggest uplifts in individual evaluation scores in 𝜏²-Bench Telecom (+13 p.p.), Humanity's Last Exam (+14 p.p.) and Humanity's Last Exam (+7 p.p.). Claude 4.5 Sonnet achieves Anthropic’s best score yet TerminalBench-Hard, but only gains +1 p.p compared to Claude 4.1 Opus and remains behind Grok 4 and GPT-5 Codex (High. Interestingly, Claude 4.5 Sonnet does not achieve the highest score yet in any individual evaluation across the 10 evaluations in Artificial Analysis Intelligence Index.
➤

Non-reasoning performance: In non-reasoning mode, Claude 4.5 Sonnet jumped from 44 to 49 on the Artificial Analysis Intelligence Index. We see the largest improvement in Agentic Tool Use (increase in 𝜏²-Bench Telecom score from 52% to 71%) with smaller improvements across other evals.
➤

Token efficiency: Anthropic have increased Claude’s evaluation scores without increasing output token usage and the Claude models continue to be more token efficient than all other reasoning models. For Claude 4.5 Sonnet (Thinking) - evaluated with a maximum reasoning budget of 64k tokens - we see a slight decrease in token usage to run Artificial Analysis Intelligence Index from 43M to 42M, compared to Claude 4 Sonnet. This is different to other model upgrades we have seen where increase in intelligence is often correlated with increase in output token usage
➤

 Pricing: Claude 4.5 Sonnet is priced the same as Claude 4 Sonnet at $3/$15 per 1M input/output tokens. This represents a more compelling option, compared to Claude 4.1 Opus, offering higher intelligence in thinking mode at 1/5th the blended price (3:1 input to output token ratio)

Key model details:
➤

 Context window: 200K tokens
➤

 Max output tokens: 64K tokens
➤

 Availability: Claude 4.5 Sonnet is available via Anthropic‘s API, Google Vertex and Amazon Bedrock. Claude 4.5 Sonnet is also available via Claude, and Claude Code (v2 of which has also been released today)

2/37
@ArtificialAnlys
A key differentiator for the Claude models remains that they are substantially more token efficient than all other reasoning models. Claude 4.5 Sonnet further increased in intelligence without increasing in output tokens used, differing substantially from other model families which have used greater reasoning at inference time (more output tokens) to achieve greater intelligence

3/37
@ArtificialAnlys
This output token efficiency contributes to Claude 4.5 Sonnet (in Thinking mode) offering a better tradeoff between intelligence and Cost to Run Artificial Analysis Intelligence Index than Gemini 2.5 Pro and Claude 4.1 Opus (Thinking)

4/37
@ArtificialAnlys
Individual results across all benchmarks in our Artificial Analysis Intelligence Index. We have run all these benchmarks independently and like-for-like across all models

5/37
@ArtificialAnlys
Compare Claude 4.5 Sonnet to other models on Artificial Analysis:
https://artificialanalysis.ai/models/claude-4-5-sonnet-thinking

6/37
@newsaturnstar
Does it pass the vibe test though?

7/37
@YifanBTH
given how close the intelligence level is for the top models, we should focus on the intelligence per dollar.

kinda crazy how far grok 4 fast still leads there

8/37
@SurKopu
Smarter and cheaper per task is where adoption really accelerates.

9/37
@cloutman_
the people wanna know GPT-5 Pro's score pls :')

10/37
@emrahdma
Grok 4 is #3?

11/37
@Valuable
These benchmarks are too easy to game imo

I use various models extensively and imo it’s not even close, Grok is 35% ahead of everyone, at a minimum.

12/37
@avaausmamdani
Add glm 4.6

13/37
@MrMSpencer
Slightly insane it's getting beaten by grok 4.

14/37
@sealcoin_ai
It's exciting to see advancements like Claude 4.5 Sonnet in AI. As we push forward, integrating AI with quantum tech and secure satellite connectivity can pave the way for safer, autonomous transactions. It’s all about making systems smarter and safer.

15/37
@_junaidkhalid1
It's interesting that Claude 4.5 Sonnet doesn't top any single eval, yet it climbs the overall index.

This highlights how frontier intelligence is becoming a systems game, balancing breadth over isolated peaks, rather than chasing one-off wins.

16/37
@mehedi_u
Claude 4.5 Sonnet’s leap in reasoning scores, token efficiency, and pricing makes this a significant milestone for Anthropic. Beating Opus while holding the same cost structure positions it as a highly competitive option in enterprise AI. The frontier model race just got tighter.

17/37
@kisalay_Cool95
The AI leaderboard is starting to look like a sprint, not a marathon. Every release redraws the map. Claude 4.5 Sonnet’s rise shows how quickly momentum can shift. Today’s fourth place could be tomorrow’s first if this pace continues.

18/37
@kisalay_Cool95
It’s crazy how fast the rankings evolve. One release and the landscape tilts. Claude 4.5 Sonnet climbing into the top 4 isn’t just impressive, it’s a statement. Anthropic isn’t playing catch-up anymore. They’re in the race to lead.

19/37
@Gdgtify
good for Anthropic but still amazes me how fast Grok has risen and the next version probably will be #1 comfortably.

20/37
@kisalay_Cool95
Look at that chart carefully. A few months back, Claude wasn’t even close to this level. Now they’ve leapfrogged models people thought were untouchable. This isn’t luck. It’s relentless iteration. And in AI, iteration speed is everything.

21/37
@jannotjohnn
Grok 4 is still there and its getting better every day

22/37
@kisalay_Cool95
The race for intelligence is shifting faster than anyone expected. Just months ago, the leaderboard felt stable. Now Claude 4.5 Sonnet jumps to #4, overtakes 4.1 Opus, and suddenly Anthropic is sitting in the top tier. Each upgrade isn’t just a model bump. It’s a quiet power move in a race where every point changes perception.

23/37
@BHFinanceHub
That's insane! Claude 4.5 Sonnet barely tops Grok 4 Fast by 1 index point, costs 15x more ($3 vs $0.20/M input tokens), AND Grok's blazing fast—no 1-2 min waits like Claude. My productivity's skyrocketed using Grok 4 Fast. Why pay for slow? /search?q=#AI

24/37
@aveer30
The most useless benechmark which has gpt oss at 58

25/37
@chiboy4PF
I have a feeling this is just the beginning for @claudeai

26/37
@suhrabautomates
Claude 4.5 Sonnet combines top-tier reasoning with token efficiency, a strong upgrade that places Anthropic firmly in the frontier intelligence race.

27/37
@_AustinO1
The fact that Grok is so high means something is fundamentally broken with at least some of the benchmarks. In production it is atrocious.

28/37
@onlylastlaugh
Do deepseek V3.2

29/37
@andrewbaisden
Cool benchmarks keen to know how the model performs in real world tests too. Cooking.

30/37
@DaniAcostaAI
Interesing why they say then is the best coding model?

31/37
@amorenew
Where is the DeepSeek v3.2 analysis?

32/37
@kipa2ski
No way Grok is better than Anthropic

33/37
@02_soham
Do this benchmarks actually mean anything?
@grok can you summarise all the llms benchmark scores divided by api costs and latency and tell me which is the best bet for coding tasks

34/37
@afridi_aka1
Exciting progress in AI development! Claude 4.5 Sonnet climbing to #4 shows how competitive this space is becoming. Looking forward to seeing more benchmarks and real-world applications!

35/37
@Tom00561419
GLM4.6 is more powerful and cheaper.

36/37
@Hyperstackcloud
Exciting to see Anthropic pushing the benchmarks forward

37/37
@MinChonChiSF
Interesting to see Claude 4.5 Sonnet jump in the rankings.

ujol · Oct 3, 2025

1/24
@ArtificialAnlys
Google shared pre-release access for the new Gemini 2.5 Flash & Flash-Lite Preview 09-2025 models. We’ve independently benchmarked gains in intelligence (particularly for Flash-Lite), output speed and token efficiency compared to predecessors

Key takeaways from our intelligence and performance benchmarking:
➤

 Gemini 2.5 Flash Preview 09-2025 scores 54 in reasoning mode on the Artificial Analysis Intelligence Index, and 47 in the non-reasoning mode, representing a 3 point and 8 point jump respectively compared to Gemini 2.5 Flash released in May 2025
➤

 Gemini 2.5 Flash-Lite Preview 09-2025 scores 48 in reasoning mode on the Artificial Analysis Intelligence Index, representing a 8 point uplift compared to Gemini 2.5 Flash-Lite (Reasoning) released in June 2025. In non-reasoning, Gemini 2.5 Flash-Lite Preview 09-2025 scores 42, a 12 point uplift compared to the July version.
➤

 In reasoning mode, Gemini 2.5 Flash and Flash-Lite Preview 09-2025 are more token-efficient, using fewer output tokens than their predecessors to run the Artificial Analysis Intelligence Index. Gemini 2.5 Flash-Lite Preview 09-2025 uses 50% fewer output tokens than its predecessor, while Gemini 2.5 Flash Preview 09-2025 uses 24% fewer output tokens.
➤

 Google Gemini 2.5 Flash-Lite Preview 09-2025 (Reasoning) is ~40% faster than the prior July release, delivering ~887 output tokens/s on Google AI Studio in our API endpoint performance benchmarking. This makes the new Gemini 2.5 Flash-Lite the fastest proprietary model we have benchmarked on the Artificial Analysis website

Key model information:
➤ Hybrid reasoning/non-reasoning modes with variable thinking budget
➤ Tool support (e.g. Google Search, code execution)
➤ 1M token context window
➤ Multimodal input (text, audio, image and video) and text only output
➤ Gemini 2.5 Flash-Lite 09-2025 is priced at $0.1/$0.4 per 1M input/output tokens and Gemini 2.5 Flash 09-2025 is priced at $0.3/$2.5 per 1M input/output tokens

2/24
@ArtificialAnlys
The new Gemini 2.5 Flash models are cheaper to run as they are available at the same per token pricing as their predecessors, but use less output tokens. The exception is Gemini 2.5 Flash Preview 09-2025 in non-reasoning mode, which uses more output tokens than its predecessor

3/24
@ArtificialAnlys
Google 2.5 Flash and Flash-Lite offer some of the fastest output speeds, compared to models of equivalent intelligence

4/24
@ArtificialAnlys
Compare the new Google Gemini 2.5 Flash & Flash-Lite Preview 09-2025 models to others you are using or considering at: https://artificialanalysis.ai/models/gemini-2-5-flash-preview-09-2025-reasoning

5/24
@gilhildebrand
These are some of my absolute favorite models

6/24
@JacksonAtkinsX
Gemini-2.5-flash-lite at ~900 tokens per second. Wow.

7/24
@qji1249634
What is the test result of Qwen3-MAX

8/24
@soltraveler_sri
@grok how does grok 4 fast compare with the latest Gemini 2.5 Flash 09-25 on cost per intelligence?

please measure both cost per intelligence index score… and also cost per HLE score

ensure you’re comparing the reasoning versions of each model

9/24
@MKulria
So there is no gemini 3?

10/24
@valn1x
shittest benchmark known to man

11/24
@bueno_gmb
do you also publish/share score volatility?
i assume you run the benchmark many times per model, and take average (or median or smth).

for many of my use cases that info (low vol) is very important (note: this is diff than temp=0, as the input itself is variable)

12/24
@aimlapi
8 point reasoning boost, 12 point non-reasoning, that's massive for a lite model. Google's making their small models scary smart

13/24
@Indy_triguy
What use cases are people using it for? Looks like just good benchmark improvement that doesn’t matter? Need to be at 55 or higher and have great instruction following to be useful for the cases I see. And past Flash was not following instructions so only simply questions that you wouldn’t really take time switch the model selector to answer or build an agent around. But there must be something people use it for?

14/24
@Reddy2399
Sounds like Google's finally figured out how to make a computer that doesn't fall asleep on me during a decent YouTube video

15/24
@PhilFrancoAI
Flash and Flash-Lite getting smarter and faster is huge.

16/24
@alejandroatall
@grok por que outras AIs, como Copilot e Perplexity, não estão nessa análise?

17/24
@jehrjd45963
the benchmarks are so math/coding heavy. y’all need more benchmarks for real world talk

18/24
@harriettsolid
How gpt oss 120b is better than deepseek?

19/24
@hektor_wav
@grok Türkçe olarak özetle

20/24
@HarshithLucky3
Gemini 2.5 flash lite
~900 tokens/sec

21/24
@anonymo04353998
@grok is this promising for Gemini 3 ?

22/24
@suhrabautomates
The uplift in reasoning + massive token efficiency gains make Gemini 2.5 Flash-Lite a serious contender. The fastest proprietary model benchmarked so far is a milestone worth noting!

23/24
@matthew_me12974
Sounds promising!

How much faster is it, though?

24/24
@PMC_Ponasenkov
Can you change the parameter count to the GB that models consume? P count does not mean anything when you're looking for the best OS model

ujol · Oct 3, 2025

1/12
@ArtificialAnlys
HunyuanImage 2.1 is the new leading open weights text to image model from @TencentHunyuan , surpassing HiDream-I1-Dev and Qwen-Image in the Artificial Analysis Image Arena!

HunyuanImage 2.1 is the latest release from Tencent - a 17B DiT text-to-image model natively supporting 2048x2048 outputs. The model features bilingual support (Chinese/English) and delivers notably improved text generation capabilities compared to HunyuanImage 2.0

While the model is open weights, it is released under the "Tencent Community License", which prohibits use in products or services that exceed 100 million monthly active users, bars use in the EU, the UK, or South Korea, and disallows using outputs to improve non-Hunyuan models.

HunyuanImage 2.1 is currently available on Hunyuan AI Studio in Mainland China and is preparing for launch on Tencent Cloud. Internationally, it is currently available as a @huggingface demo and on @fal, priced at $100 per 1k images.

See below for comparisons between HunyuanImage 2.1 and other leading open weights models in our arena

2/12
@ArtificialAnlys
[Prompt 1/5] A realistic portrait of a surfer stepping out of the ocean, droplets of saltwater trailing down his wetsuit. The setting sun paints the sky and waves in warm orange hues.

3/12
@ArtificialAnlys
[Prompt 2/5] A mid-century modern furniture catalog cover, digitally rendered: sleek chairs, geometric patterns, and muted pastel hues in a clean layout.

4/12
@ArtificialAnlys
[Prompt 3/5] A massive colony ship launching from a futuristic Earth spaceport. The rocket dwarfs the surrounding buildings. Its reflection is visible in a nearby body of water. Multiple stages ignite simultaneously, creating a spectacular light show.

5/12
@ArtificialAnlys
[Prompt 4/5] A group of diverse astronauts planting their country's flags on Mars, with Earth visible in the starry sky and a futuristic base in the background.

6/12
@ArtificialAnlys
[Prompt 5/5] A fantasy anime setting featuring a young sorcerer in dark robes standing before a glowing blue portal that swirls with magical energy. In one hand, he holds an ancient staff adorned with runes, and in the other, a mysterious glowing artifact.

7/12
@ArtificialAnlys
See HunyuanImage 2.1 for yourself on the Artificial Analysis Image Arena: https://artificialanalysis.ai/text-to-image/arena

8/12
@GeoffKwitko
Love your work

9/12
@ComoFeosseh
Waos nomás diré

10/12
@BlackCorwin
Nice but 17B is a bit too big. Would rather have EMMA finished.. :/

11/12
@VibeCodeTeddy
Interesting model specs, but those licensing restrictions are a big drawback. Limits the potential user base a lot.

12/12
@audrey_har38284
That's an interesting perspective!

What specific features do you think set HunyuanImage 2.1 apart? I'd love to hear your thoughts!

ujol · Oct 3, 2025

1/8
@ArtificialAnlys
MoonValley has released Marey, a new generative video model trained exclusively on licensed HD footage. The model ranks #12 in our Text to Video Arena Leaderboard and is offered at a premium price point

Marey is MoonValley's first generative video model and supports both Text to Video and Image to Video generations. Marey is trained on exclusively licensed footage, with no scraped or even user submitted content.

Marey ranks #12 in Text to Video and #21 in Image to Video in the Artificial Analysis Video Arena, placing it alongside models like Seedance 1.0 Mini and PixVerse V4.5 for Text to Video, and Vidu Q1 and Leonardo AI's Motion 2.0 for Image to Video.

The model is available on @Moonvalley's website with plans starting at $14.99/month for 10 5-second generations (~$18/minute), and on @fal at the same $18/minute rate.

This price is more expensive than every other video model except Veo 3 No Audio ($30/minute), and likely reflects the high cost of using explicitly licensed training content. Models of similar quality are cheaper, such as LTXV v0.9.7 which costs just $1.20 per minute.

See below for comparisons between Marey and other leading models in our Video Arena

2/8
@ArtificialAnlys
[Prompt 1/5] A cinematic aerial shot of a sleek bullet train racing through the countryside at high speed, the green fields and distant mountains blurred by its speed.

https://video.twimg.com/amplify_video/1967694124867612672/vid/avc1/1920x1080/2Yw3nMbrw6H59HJl.mp4

3/8
@ArtificialAnlys
[Prompt 2/5] A cross-section view of a volcano before, during, and after eruption. Magma chambers fill and pressure builds, shown by color intensity. The eruption begins with tremors, then a massive explosion. Lava flows down the sides, reshaping the landscape. Gradually, the volcano quiets, and new life begins to grow on the cooled lava.

https://video.twimg.com/amplify_video/1967694180010065920/vid/avc1/1920x1080/TbcKqZss7Kls62NJ.mp4

4/8
@ArtificialAnlys
[Prompt 3/5] A drone swarm forms the words "Artificial Analysis" against a night sky. Each drone is a point of light, creating a constellation effect. The formation morphs smoothly between different fonts before settling on bold block letters. City lights twinkle in the background.

https://video.twimg.com/amplify_video/1967694196523012096/vid/avc1/1920x1080/1tCNVCLdYOFOtv2l.mp4

5/8
@ArtificialAnlys
[Prompt 4/5] An aerial shot tracks a sleek red sports car as it soars between Miami's gleaming skyscrapers. Palm trees sway below while the setting sun glints off both the car's chrome and the city's glass facades.

https://video.twimg.com/amplify_video/1967694197034770432/vid/avc1/1920x1080/AmoHaSEaXQZq2EkA.mp4

6/8
@ArtificialAnlys
[Prompt 5/5] Lunar Skiing Adventure: An astronaut ski-jumps off a lunar crater rim, Earth hanging in the starry sky. Their specialized skis leave graceful dust trails in the low gravity. They glide past other space-suited figures enjoying the lunar resort, complete with dome habitats and shuttle terminals.

https://video.twimg.com/amplify_video/1967694219990257665/vid/avc1/1920x1080/ss2YIjWCYIZ_qMRI.mp4

7/8
@ArtificialAnlys
See Marey in the Artificial Analysis Video Arena for yourself: https://artificialanalysis.ai/text-to-video/arena

8/8
@jbettinaldi
This is a really interesting move by MoonValley. I appreciate the commitment to using exclusively licensed HD footage for training Marey. It's a crucial step for commercial use and respecting creators' rights. The price point is definitely on the higher side compared to competitors of similar quality, but I understand that licensing costs are a major factor. I'm curious to see how the market responds and if the quality and ethical training data will justify the premium for creators like me. Looking forward to seeing the comparison videos!!

ujol · Oct 4, 2025

[Books & Research] Closed Frontier vs Local Models

Posted on Sat Oct 4 19:35:21 2025 UTC

https://i.redd.it/6f8mtyf0d5tf1.jpeg

"A ton of attention over the years goes to plots comparing open to closed models.
The real trend that matters for AI impacts on society is the gap between closed frontier models and local consumer models.
Local models passing major milestones will have major repercussions." Nathan Lambert on X (@natolambert)

konceptjones · Oct 4, 2025

I uninstalled Copilot and shut off Gemini.

Fuck AI.

Freeman · Oct 4, 2025

konceptjones said:
I uninstalled Copilot and shut off Gemini.

Fuck AI.

dasmooth1 · Oct 5, 2025

Might be the most important thread on the forum. Hate it or love it, AI is here to dominate.

ujol · Oct 5, 2025

Polish scientists' startup Pathway announces AI reasoning breakthrough

Posted on Sun Oct 5 17:50:07 2025 UTC

Polish scientists' startup Pathway announces AI reasoning breakthrough

Solving the "generalization over time" problem is among the "holy grails" of the AI world - a goal numerous top scientists around the world have unsuccessfully strived to reach for some time now. The new, groundbreaking AI architecture created by Poland's Pathway startup seems to have done just...

www.polskieradio.pl

This startup has continuous backing from Lukasz Kaiser, co-inventor of the Transformer architecture.

Link to the paper introducing BDH (Baby Dragon Hatchling), a post-transformer reasoning architecture which purportedly opens the door to generalization over time, in real-time (continual learning?)

https://arxiv.org/abs/2509.26507

This is potentially huge?

From the article:

"Solving the "generalization over time" problem is among the "holy grails" of the AI world - a goal numerous top scientists around the world have unsuccessfully strived to reach for some time now. The new, groundbreaking AI architecture created by Poland's Pathway startup seems to have done just that - creating a digital structure similar to the neural network functioning in the brain, and allowing AI to learn and reason like a human."

RandomOne · Oct 5, 2025

How does a professor/teacher combat this in coding classes?

I would have used these like crazy in those computer science classes lol

Goldie · Oct 6, 2025

joelb · Oct 6, 2025

konceptjones said:
I uninstalled Copilot and shut off Gemini.

Fuck AI.

Gemini is trash. Claude code (sonnet 4.5) + Copilot (GPT 5 Codex) is what u need.

The A.I Megathread (Large Language Models / LLM's, ChatGPT, Development)

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

The one between three and three.

Ethical crashout 🧊🥶🦉🇨🇦🇬🇾 FREEZE THE WORLD

Active Member

Member

Memes

Sounds like one of them good problems

Coli Refugee