Just because I don't care doesn't mean I don't understand.
601 stories
·
3 followers

What just happened

1 Comment

The last month has transformed the state of AI, with the pace picking up dramatically in just the last week. AI labs have unleashed a flood of new products - some revolutionary, others incremental - making it hard for anyone to keep up. Several of these changes are, I believe, genuine breakthroughs that will reshape AI's (and maybe our) future. Here is where we now stand:

Smart AIs are now everywhere

At the end of last year, there was only one publicly available GPT-4/Gen2 class model, and that was GPT-4. Now there are between six and ten such models, and some of them are open weights, which means they are free for anyone to use or modify. From the US we have OpenAI’s GPT-4o, Anthropic’s Claude Sonnet 3.5, Google’s Gemini 1.5, the open Llama 3.2 from Meta, Elon Musk’s Grok 2, and Amazon’s new Nova. Chinese companies have released three open multi-lingual models that appear to have GPT-4 class performance, notably Alibaba’s Qwen, R1’s DeepSeek, and 01.ai’s Yi. Europe has a lone entrant in the space, France’s Mistral. What this word salad of confusing names means is that building capable AIs did not involve some magical formula only OpenAI had, but was available to companies with computer science talent and the ability to get the chips and power needed to train a model.

In fact, GPT-4 level artificial intelligence, so startling when it was released that it led to considerable anxiety about the future, can now be run on my home computer. Meta’s newest small model, released this month, named Llama 3.3, offers similar performance and can operate entirely offline on my gaming PC. And the new, tiny Phi 4 from Microsoft is GPT-4 level and can almost run on your phone, while its slightly less capable predecessor, Phi 3.5, certainly can. Intelligence, of a sort, is available on demand.

Llama 3.3, running on my home computer passes the "rhyming poem involving cheese puns" benchmark with only a couple of strained puns.

And, as I have discussed (and will post about again soon), these ubiquitous AIs are now starting to power agents, autonomous AIs that can pursue their own goals. You can see what that means in this post, where I use early agents to do comparison shopping and monitor a construction site.

VERY smart AIs are now here

All of this means that if GPT-4 level performance was the maximum an AI could achieve, that would likely be enough for us to have five to ten years of continued change as we got used to their capabilities. But there isn’t a sign that a major slowdown in AI development is imminent. We know this because the last month has had two other significant releases - the first sign of the Gen3 models (you can think of these as GPT-5 class models) and the release of the o1 models that can “think” before answering, effectively making them much better reasoners than other LLMs. We are in the early days of Gen3 releases, so I am not going to write about them too much in this post, but I do want to talk about o1.

I discussed the o1 release when it came out in early o1-preview form, but two more sophisticated variants, o1 and o1-pro, have considerably increased power. These models spend time invisibly “thinking” - mimicking human logical problem solving - before answering questions. This approach, called test time compute, turns out to be a key to making models better at problem solving. In fact, these models are now smart enough to make meaningful contributions to research, in ways big and small.

As one fun example, I read an article about a recent social media panic - an academic paper suggested that black plastic utensils could poison you because they were partially made with recycled e-waste. A compound called BDE-209 could leach from these utensils at such a high rate, the paper suggested, that it would approach the safe levels of dosage established by the EPA. A lot of people threw away their spatulas, but McGill University’s Joe Schwarcz thought this didn’t make sense and identified a math error where the authors incorrectly multiplied the dosage of BDE-209 by a factor of 10 on the seventh page of the article - an error missed by the paper’s authors and peer reviewers. I was curious if o1 could spot this error. So, from my phone, I pasted in the text of the PDF and typed: “carefully check the math in this paper.” That was it. o1 spotted the error immediately (other AI models did not).

When models are capable enough to not just process an entire academic paper, but to understand the context in which “checking math” makes sense, and then actually check the results successfully, that radically changes what AIs can do. In fact, my experiment, along with others doing the same thing, helped inspire an effort to see how often o1 can find errors in the scientific literature. We don’t know how frequently o1 can pull off this sort of feat, but it seems important to find out, as it points to a new frontier of capabilities.

In fact, even the earlier version of o1, the preview model, seems to represent a leap in scientific ability. A bombshell of a medical working paper from Harvard, Stanford, and other researchers concluded that “o1-preview demonstrates superhuman performance [emphasis mine] in differential diagnosis, diagnostic clinical reasoning, and management reasoning, superior in multiple domains compared to prior model generations and human physicians." The paper has not been through peer review yet, and it does not suggest that AI can replace doctors, but it, along with the results above, does suggest a changing world where not using AI as a second opinion may soon be a mistake.

Potentially more significantly, I have increasingly been told by researchers that o1, and especially o1-pro, is generating novel ideas and solving unexpected problems in their field (here is one case). The issue is that only experts can now evaluate whether the AI is wrong or right. As an example, my very smart colleague at Wharton, Daniel Rock, asked me to give o1-pro a challenge: “ask it to prove, using a proof that isn’t in the literature, the universal function approximation theorem for neural networks without 1) assuming infinitely wide layers and 2) for more than 2 layers.” Here is what it wrote back:

Is this right? I have no idea. This is beyond my fields of expertise. Daniel and other experts who looked at it couldn’t tell whether it was right at first glance, either, but felt it was interesting enough to look into. It turns out the proof has errors (though it might be that more interactions with o1-pro could fix them). But the results still introduced some novel approaches that spurred further thinking. As Daniel noted to me, when used by researchers, o1 doesn’t need to be right to be useful: “Asking o1 to complete proofs in creative ways is effectively asking it to be a research colleague. The model doesn't have to get proofs right to be useful, it just has to help us be better researchers.”

We now have an AI that seems to be able to address very hard, PhD-level problems, or at least work productively as a co-intelligence for researchers trying to solve them. Of course, the issue is that you don’t actually know if these answers are right unless you are a PhD in a field yourself, creating a new set of challenges in AI evaluation. Further testing will be needed to understand how useful it is, and in what fields, but this new frontier in AI ability is worth watching.

AIs can watch and talk to you

We have had AI voice models for a few months, but the last week saw the introduction of a new capability - vision. Both ChatGPT and Gemini can now see live video and interact with voice simultaneously. For example, I can now share a live screen with Gemini’s new small Gen3 model, Gemini 2.0 Flash. You should watch it give me feedback on a draft of this post to see what this feels like:

Or even better, try it yourself for free. Seriously, it is worth experiencing what this system can do. Gemini 2.0 Flash is still a small model with a limited memory, but you start to see the point here. Models that can interact with humans in real time through the most common human senses - vision and voice - turn AI into present companions, in the room with you, rather than entities trapped in a chat box on your computer. The fact that ChatGPT Advanced Voice Mode can do the same thing from your phone means this capability is widely available to millions of users. The implications are going to be quite profound as AI becomes more present in our lives.

AI video suddenly got very good

AI image creation has become really impressive over the past year, with models that can run on my laptop producing images that are indistinguishable from real photographs. They have also become much easier to direct, responding appropriately for the prompts “otter on a plane using bluetooth” and “otter on a plane using wifi.” If you want to experiment yourself, Google’s ImageFX is a really easy interface for using the powerful Imagen 3 model which was released in the last week.

But the real leap in the last week has come from AI text-to-video generators. Previously, AI models from Chinese companies generally represented the state-of-the-art in video generation, including impressive systems like Kling, as well as some open models. But the situation is changing rapidly. First, OpenAI released its powerful Sora tool and then Google, in what has become a theme of late, released its even more powerful Veo 2 video creator. You can play with Sora now if you subscribe to ChatGPT Plus, and it is worth doing, but I got early access to Veo 2 (coming in a month or two, apparently) and it is… astonishing.

It is always better to show than tell, so take a look at this compilation of 8 second clips (the limit for right now, though it can apparently do much longer movies). I provide the exact prompt in each clip, and the clips are only selected from the very first set of movies that Veo 2 made (it creates four clips at a time), so there is no cherry-picking from many examples. Pay attention to the apparent weight and heft of objects, shadows and reflection, the consistency across scenes as hair style and details are maintained, and how close the scenes are to what I asked for (the red balloon is there, if you look for it). There are errors, but they are now much harder to spot at first glance (though it still struggles with gymnastics, which are very hard for video models). Really impressive.

What does this all mean?

I will save a more detailed reflection for a future post, but the lesson to take away from this is that, for better and for worse, we are far from seeing the end of AI advancement. What's remarkable isn't just the individual breakthroughs - AIs checking math papers, generating nearly cinema-quality video clips, or running on gaming PCs. It's the pace and breadth of change. A year ago, GPT-4 felt like a glimpse of the future. Now it's basically running on phones, while new models are catching errors that slip past academic peer review. This isn't steady progress - we're watching AI take uneven leaps past our ability to easily gauge its implications. And this suggests that the opportunity to shape how these technologies transform your field exists now, when the situation is fluid, and not after the transformation is complete.

Subscribe now

Share

Read the whole story
jgbishop
2 days ago
reply
Wow!
Durham, NC
Share this story
Delete

A thread of some letters people wrote to each other on clay...

1 Comment
A thread of some letters people wrote to each other on clay tablets in Mesopotamia thousands of years ago. “I am the servant of my lord. May my lord not withhold a chariot from me.”

πŸ’¬ Join the discussion on kottke.org β†’

Read the whole story
jgbishop
10 days ago
reply
It turns out that people's conversations have always been pretty dull!
Durham, NC
Share this story
Delete

Full-Face Masks to Frustrate Identification

3 Comments

This is going to be interesting.

It’s a video of someone trying on a variety of printed full-face masks. They won’t fool anyone for long, but will survive casual scrutiny. And they’re cheap and easy to swap.

Read the whole story
jgbishop
11 days ago
reply
Yeesh. Our race to the bottom continues.
Durham, NC
Share this story
Delete
2 public comments
cjheinz
11 days ago
reply
Holy crap!
Lexington, KY; Naples, FL
LinuxGeek
11 days ago
reply
Full-Face masks look convincing for a static expression. Could effectively keep you anonymous from security and surveillance cameras. Probably not useful in defeating AI based age verification.

Making Tea

8 Comments and 12 Shares
No, of course we don't microwave the mug WITH the teabag in it. We microwave the teabag separately.
Read the whole story
jgbishop
11 days ago
reply
I'll admit to microwaving the mug and tea bag. It works well for me!
Durham, NC
Share this story
Delete
7 public comments
Covarr
10 days ago
reply
I put my strongest small ceramic bakeware in the toaster oven, filled with water. Sometimes you just gotta do things slow and appreciate life. Not like you'll be appreciating the tea; it's still not ready yet.
East Helena, MT
fxer
10 days ago
reply
You can’t microwave water, it will be polluted with radiation! Do you really want your kids exposed to electromagnetic waves?
Bend, Oregon
sommerfeld
11 days ago
reply
It's not that 110V kettles are less efficient at turning electricity to heat than 240V - they're just less powerful. UK kettles draw up to 3 kilowatts, while ones in the US max out at around half that.
zwol
10 days ago
And that's directly related to the voltage difference. In both countries, electric kettles have to be designed on the assumption that they can pull only 13 to 15 amps of load from the mains. This puts a hard limit on the wattage rating β€” but wattage is volts times amps, so the higher UK supply voltage makes higher power kettles possible. Microwave ovens, on the other hand, are typically powered by 20-amp dedicated circuits in the USA, so they can be higher power than kettles at the same supply voltage. I don't know how they're wired in the UK.
bcs
10 days ago
@zwol FWIW, I've never seen a microwave with a 20A plug.
zwol
10 days ago
@bcs I'm not sure about this but I have the impression that it's OK per US electrical code to use a NEMA 15 socket on a 20A circuit *as long as it's a dedicated circuit*, and this is one of the reasons why 20A plugs are so rare on US kitchen appliances. That said, something else is also going on, because I just checked and my microwave is rated at 17kW, which is 14.2 amps at 120V, but I can't find any electric kettle for sale that goes higher than 1.5kW (12.5A at 120V). Possibly the real concern here is that a kettle *can't* assume a dedicated circuit, so the designers have to leave some headroom in case there are lamps or something plugged into the same circuit.
bcs
10 days ago
@zwol you can 100% put a lower amp outlet on a higher amp circuit, and you don't need it to be dedicated. (It's the same as plugging an 8A lamp cord into a 15A socket; the load is responsible for protecting it's own cord.) In fact, 20A wires and 15A sockcts are very common. What you can't do is sell an appliance that draw more than 15A but plugs into a 15A socket.
PeterParslow
9 hours ago
Microwaves in the UK: all the ones I've seen (Brit living here 50+ years) are simply plugged into a 13 amp socket, like the kettle is. They're normally rated 1 kW, but some make it to 1.2kW.. Cookers (oven, hob) are usually wired into a separate 45 amp circuit.
rraszews
11 days ago
reply
What's weird is when you get into the details. Apparently American electric kettles are much slower than British ones (British people keep telling me it takes 30 seconds to boil water in an electric kettle; mine takes 5 minutes) while American microwaves are much faster (Again, takes 90 seconds in mine; they claim it takes 10 minutes). (There is some truth here; electric kettles are less efficient using American 110 mains voltage, not sure why British microwaves are so weak though)
Columbia, MD
fallinghawks
11 days ago
Consider getting a newer kettle. I (US) bought a Krups 1L earlier this year. It takes 2.5 minutes to boil 2 cups of water, which gives my microwave a run for its money. It's probably also using less electricity too.
jakar
10 days ago
Haven't researched this, but I'm willing to bet that an industrial 240V kettle exists somewhere here in America, and that I could theoretically run a new circuit easily enough to accommodate it. However, I also don't care enough to actually make it happen.
bootsofdoom
11 days ago
reply
Ah, Americans. Literally nobody "makes it in a kettle". You boil the water in a kettle and make the tea in a teapot. Obviously.
PeterParslow
9 hours ago
If we extend "kettle" to include saucepans, then the Indian approach is to put everything (tea, milk, sugar, some spices) into a pan and boil it for a while
jlvanderzwan
11 days ago
reply
What about microwaving the crown jewels?
alt_text_bot
11 days ago
reply
No, of course we don't microwave the mug WITH the teabag in it. We microwave the teabag separately.

Pearls Before Swine by Stephan Pastis for Sun, 08 Dec 2024

1 Comment

Pearls Before Swine by Stephan Pastis on Sun, 08 Dec 2024

Source - Patreon

Read the whole story
jgbishop
12 days ago
reply
The setup for these strips is always so crazy! Hahaha.
Durham, NC
Share this story
Delete

When a Telescope Is a National-Security Risk. The Vera Rubin Observatory is...

1 Comment
When a Telescope Is a National-Security Risk. The Vera Rubin Observatory is a new telescope that the US built in Chile and they had to jump through some hoops to ensure it’s not going to see anything top secret (like US spy satellites).

πŸ’¬ Join the discussion on kottke.org β†’

Read the whole story
jgbishop
18 days ago
reply
This is so stupid. If your satellite is visible to my eyeball, that's your problem, not mine.
Durham, NC
Share this story
Delete
Next Page of Stories