Just because I don't care doesn't mean I don't understand.
686 stories
·
3 followers

Can coding agents relicense open source through a “clean room” implementation of code?

1 Comment

Over the past few months it's become clear that coding agents are extraordinarily good at building a weird version of a "clean room" implementation of code.

The most famous version of this pattern is when Compaq created a clean-room clone of the IBM BIOS back in 1982. They had one team of engineers reverse engineer the BIOS to create a specification, then handed that specification to another team to build a new ground-up version.

This process used to take multiple teams of engineers weeks or months to complete. Coding agents can do a version of this in hours - I experimented with a variant of this pattern against JustHTML back in December.

There are a lot of open questions about this, both ethically and legally. These appear to be coming to a head in the venerable chardet Python library.

chardet was created by Mark Pilgrim back in 2006 and released under the LGPL. Mark retired from public internet life in 2011 and chardet's maintenance was taken over by others, most notably Dan Blanchard who has been responsible for every release since 1.1 in July 2012.

Two days ago Dan released chardet 7.0.0 with the following note in the release notes:

Ground-up, MIT-licensed rewrite of chardet. Same package name, same public API — drop-in replacement for chardet 5.x/6.x. Just way faster and more accurate!

Yesterday Mark Pilgrim opened #327: No right to relicense this project:

[...] First off, I would like to thank the current maintainers and everyone who has contributed to and improved this project over the years. Truly a Free Software success story.

However, it has been brought to my attention that, in the release 7.0.0, the maintainers claim to have the right to "relicense" the project. They have no such right; doing so is an explicit violation of the LGPL. Licensed code, when modified, must be released under the same LGPL license. Their claim that it is a "complete rewrite" is irrelevant, since they had ample exposure to the originally licensed code (i.e. this is not a "clean room" implementation). Adding a fancy code generator into the mix does not somehow grant them any additional rights.

Dan's lengthy reply included:

You're right that I have had extensive exposure to the original codebase: I've been maintaining it for over a decade. A traditional clean-room approach involves a strict separation between people with knowledge of the original and people writing the new implementation, and that separation did not exist here.

However, the purpose of clean-room methodology is to ensure the resulting code is not a derivative work of the original. It is a means to an end, not the end itself. In this case, I can demonstrate that the end result is the same — the new code is structurally independent of the old code — through direct measurement rather than process guarantees alone.

Dan goes on to present results from the JPlag tool - which describes itself as "State-of-the-Art Source Code Plagiarism & Collusion Detection" - showing that the new 7.0.0 release has a max similarity of 1.29% with the previous release and 0.64% with the 1.1 version. Other release versions had similarities more in the 80-93% range.

He then shares critical details about his process, highlights mine:

For full transparency, here's how the rewrite was conducted. I used the superpowers brainstorming skill to create a design document specifying the architecture and approach I wanted based on the following requirements I had for the rewrite [...]

I then started in an empty repository with no access to the old source tree, and explicitly instructed Claude not to base anything on LGPL/GPL-licensed code. I then reviewed, tested, and iterated on every piece of the result using Claude. [...]

I understand this is a new and uncomfortable area, and that using AI tools in the rewrite of a long-standing open source project raises legitimate questions. But the evidence here is clear: 7.0 is an independent work, not a derivative of the LGPL-licensed codebase. The MIT license applies to it legitimately.

Since the rewrite was conducted using Claude Code there are a whole lot of interesting artifacts available in the repo. 2026-02-25-chardet-rewrite-plan.md is particularly detailed, stepping through each stage of the rewrite process in turn - starting with the tests, then fleshing out the planned replacement code.

There are several twists that make this case particularly hard to confidently resolve:

  • Dan has been immersed in chardet for over a decade, and has clearly been strongly influenced by the original codebase.
  • There is one example where Claude Code referenced parts of the codebase while it worked, as shown in the plan - it looked at metadata/charsets.py, a file that lists charsets and their properties expressed as a dictionary of dataclasses.
  • More complicated: Claude itself was very likely trained on chardet as part of its enormous quantity of training data - though we have no way of confirming this for sure. Can a model trained on a codebase produce a morally or legally defensible clean-room implementation?
  • As discussed in this issue from 2014 (where Dan first openly contemplated a license change) Mark Pilgrim's original code was a manual port from C to Python of Mozilla's MPL-licensed character detection library.
  • How significant is the fact that the new release of chardet used the same PyPI package name as the old one? Would a fresh release under a new name have been more defensible?

I have no idea how this one is going to play out. I'm personally leaning towards the idea that the rewrite is legitimate, but the arguments on both sides of this are entirely credible.

I see this as a microcosm of the larger question around coding agents for fresh implementations of existing, mature code. This question is hitting the open source world first, but I expect it will soon start showing up in Compaq-like scenarios in the commercial world.

Once commercial companies see that their closely held IP is under threat I expect we'll see some well-funded litigation.

Update 6th March 2026: A detail that's worth emphasizing is that Dan does not claim that the new implementation is a pure "clean room" rewrite. Quoting his comment again:

A traditional clean-room approach involves a strict separation between people with knowledge of the original and people writing the new implementation, and that separation did not exist here.

I can't find it now, but I saw a comment somewhere that pointed out the absurdity of Dan being blocked from working on a new implementation of character detection as a result of the volunteer effort he put into helping to maintain an existing open source library in that domain.

I enjoyed Armin's take on this situation in AI And The Ship of Theseus, in particular:

There are huge consequences to this. When the cost of generating code goes down that much, and we can re-implement it from test suites alone, what does that mean for the future of software? Will we see a lot of software re-emerging under more permissive licenses? Will we see a lot of proprietary software re-emerging as open source? Will we see a lot of software re-emerging as proprietary?

Tags: licensing, mark-pilgrim, open-source, ai, generative-ai, llms, ai-assisted-programming, ai-ethics, coding-agents

Read the whole story
jgbishop
2 hours ago
reply
What an interesting dilemma (and more proof that software licenses are stupid; especially ones based on GPL).
Raleigh, NC
Share this story
Delete

Add + Discover Sites: YouTube, Reddit, podcasts, newsletters, and thousands of feeds to explore

1 Comment and 2 Shares

NewsBlur has always been great at reading feeds. But finding new ones? That was mostly on you. The old “Add Site” dialog was a search box and not much else. If you already had a feed URL, it worked fine. If you were looking for something new to read, you were on your own.

The new Add + Discover Sites page changes that. It’s a full-page discovery experience with eight tabs covering YouTube channels, Reddit communities, podcasts, newsletters, Google News topics, trending sites, popular feeds, and of course the classic search-and-subscribe workflow. There are over 50,000 curated feeds to browse, all organized into dozens of categories and subcategories.

Eight ways to find feeds

The tab bar across the top gives you eight different lenses into the world of RSS:

  • Search — The classic search bar, now with semantic search and autocomplete. Type a topic or URL and get instant suggestions. Below the search results you’ll find trending feeds ranked by a hybrid algorithm that combines subscription velocity, read engagement, and subscriber counts.

  • Web Feed — Create RSS feeds from any website. This one gets its own blog post.

  • Popular Sites — Thousands of curated RSS feeds organized into categories like Technology, Science, News, and Business. Each category has subcategories for drilling down further.

  • YouTube — Over 2,000 verified YouTube channels converted to RSS feeds. Browse by category or search for specific channels. Subscribe and read YouTube in your feed reader the way it should be.

  • Reddit — Nearly 6,000 real subreddits across 47 categories. From r/programming to r/sourdough, you can subscribe to any subreddit as an RSS feed.

  • Newsletters — Newsletters from Substack, Medium, Ghost, Beehiiv, and other platforms. Platform pills let you filter by newsletter provider if you have a preference.

  • Podcasts — Popular podcasts organized by genre. Search for shows or browse the curated collection.

  • Google News — Eight preset topics (World, Business, Technology, Sports, and more) that create feeds from Google News. One click to subscribe.

Categories and subcategories

Most tabs are organized with a two-level taxonomy. Click a category pill at the top to filter, then drill into subcategories for more specific browsing. YouTube’s Technology category, for example, breaks down into Programming, AI & Machine Learning, Gadgets, and more.

The categories are consistent across tabs where it makes sense, so you can explore Technology feeds across YouTube, Reddit, Popular Sites, and Podcasts without having to rethink the navigation each time.

Grid view and list view

Every tab supports two viewing modes. Grid view shows feed cards with thumbnails, descriptions, subscriber counts, and freshness indicators. List view compresses things into a denser layout when you want to scan quickly.

A style popover in the top right lets you toggle between views. Your preference is saved per tab.

Try before you subscribe

Every feed card has a Try button that instantly fetches the feed and shows you the actual stories. No commitment, no subscribing. Just a quick look at what you’d get. If you like what you see, the subscribe button is right there with a folder picker.

A breadcrumb link at the top takes you back to where you were browsing when you’re done previewing.

The new Add Site popover

If you don’t need the full discovery page, the popover that appears when you click “+” in the sidebar has been redesigned too. It still has the quick URL input for when you have a feed address handy, but now it also shows freshness indicators and has buttons to jump into any of the discovery tabs.

The search tab uses Elasticsearch to find feeds by name with fuzzy matching. Type “cooking” and you’ll get cooking blogs, YouTube cooking channels, cooking subreddits, and cooking podcasts. It searches across all feed types, not just traditional RSS. If Elasticsearch doesn’t find anything, the search falls back to a database query so you’ll always get results.

Where all these feeds came from

Building the discovery page meant curating a lot of feeds. I wrote management commands to discover and verify channels, subreddits, podcasts, and newsletters from real sources. The collection includes over 2,000 YouTube channels, 6,600 subreddits, 7,300 newsletters, 32,000 podcasts, and 14,000 RSS feeds. Over 63,000 feeds in total, all real, verified, and categorized.

The Add + Discover Sites page is available now on the web for all users. If you have feedback or ideas for new categories, platforms, or features, please share them on the NewsBlur forum.

Read the whole story
jgbishop
2 days ago
reply
I've wanted this for so long!!!
Raleigh, NC
Share this story
Delete

Pearls Before Swine - 2026-03-01

1 Comment
Pearls Before Swine

Comic strip for 2026/03/01

Read the whole story
jgbishop
6 days ago
reply
Haha!
Raleigh, NC
GaryBIshop
5 days ago
Who's on first base.
Share this story
Delete

Deep Blue

1 Comment

We coined a new term on the Oxide and Friends podcast last month (primary credit to Adam Leventhal) covering the sense of psychological ennui leading into existential dread that many software developers are feeling thanks to the encroachment of generative AI into their field of work.

We're calling it Deep Blue.

You can listen to it being coined in real time from 47:15 in the episode. I've included a transcript below.

Deep Blue is a very real issue.

Becoming a professional software engineer is hard. Getting good enough for people to pay you money to write software takes years of dedicated work. The rewards are significant: this is a well compensated career which opens up a lot of great opportunities.

It's also a career that's mostly free from gatekeepers and expensive prerequisites. You don't need an expensive degree or accreditation. A laptop, an internet connection and a lot of time and curiosity is enough to get you started.

And it rewards the nerds! Spending your teenage years tinkering with computers turned out to be a very smart investment in your future.

The idea that this could all be stripped away by a chatbot is deeply upsetting.

I've seen signs of Deep Blue in most of the online communities I spend time in. I've even faced accusations from my peers that I am actively harming their future careers through my work helping people understand how well AI-assisted programming can work.

I think this is an issue which is causing genuine mental anguish for a lot of people in our community. Giving it a name makes it easier for us to have conversations about it.

My experiences of Deep Blue

I distinctly remember my first experience of Deep Blue. For me it was triggered by ChatGPT Code Interpreter back in early 2023.

My primary project is Datasette, an ecosystem of open source tools for telling stories with data. I had dedicated myself to the challenge of helping people (initially focusing on journalists) clean up, analyze and find meaning in data, in all sorts of shapes and sizes.

I expected I would need to build a lot of software for this! It felt like a challenge that could keep me happily engaged for many years to come.

Then I tried uploading a CSV file of San Francisco Police Department Incident Reports - hundreds of thousands of rows - to ChatGPT Code Interpreter and... it did every piece of data cleanup and analysis I had on my napkin roadmap for the next few years with a couple of prompts.

It even converted the data into a neatly normalized SQLite database and let me download the result!

I remember having two competing thoughts in parallel.

On the one hand, as somebody who wants journalists to be able to do more with data, this felt like a huge breakthrough. Imagine giving every journalist in the world an on-demand analyst who could help them tackle any data question they could think of!

But on the other hand... what was I even for? My confidence in the value of my own projects took a painful hit. Was the path I'd chosen for myself suddenly a dead end?

I've had some further pangs of Deep Blue just in the past few weeks, thanks to the Claude Opus 4.5/4.6 and GPT-5.2/5.3 coding agent effect. As many other people are also observing, the latest generation of coding agents, given the right prompts, really can churn away for a few minutes to several hours and produce working, documented and fully tested software that exactly matches the criteria they were given.

"The code they write isn't any good" doesn't really cut it any more.

A lightly edited transcript

Bryan: I think that we're going to see a real problem with AI induced ennui where software engineers in particular get listless because the AI can do anything. Simon, what do you think about that?

Simon: Definitely. Anyone who's paying close attention to coding agents is feeling some of that already. There's an extent where you sort of get over it when you realize that you're still useful, even though your ability to memorize the syntax of program languages is completely irrelevant now.

Something I see a lot of is people out there who are having existential crises and are very, very unhappy because they're like, "I dedicated my career to learning this thing and now it just does it. What am I even for?". I will very happily try and convince those people that they are for a whole bunch of things and that none of that experience they've accumulated has gone to waste, but psychologically it's a difficult time for software engineers.

[...]

Bryan: Okay, so I'm going to predict that we name that. Whatever that is, we have a name for that kind of feeling and that kind of, whether you want to call it a blueness or a loss of purpose, and that we're kind of trying to address it collectively in a directed way.

Adam: Okay, this is your big moment. Pick the name. If you call your shot from here, this is you pointing to the stands. You know, I – Like deep blue, you know.

Bryan: Yeah, deep blue. I like that. I like deep blue. Deep blue. Oh, did you walk me into that, you bastard? You just blew out the candles on my birthday cake.

It wasn't my big moment at all. That was your big moment. No, that is, Adam, that is very good. That is deep blue.

Simon: All of the chess players and the Go players went through this a decade ago and they have come out stronger.

Turns out it was more than a decade ago: Deep Blue defeated Garry Kasparov in 1997.

Tags: definitions, careers, ai, generative-ai, llms, ai-assisted-programming, oxide, bryan-cantrill, ai-ethics, coding-agents

Read the whole story
jgbishop
20 days ago
reply
This is a terrible name, but it's definitely something worth talking about. AI is here to stay, like it or not.
Raleigh, NC
Share this story
Delete

Defeating a 40-year-old copy protection dongle

1 Comment

image

That’s right — this little device is what stood between me and the ability to run an even older piece of software that I recently unearthed during an expedition of software archaeology.

For a bit more background, I was recently involved in helping a friend’s accounting firm to move away from using an extremely legacy software package that they had locked themselves into using for the last four decades.

This software was built using a programming language called RPG (“Report Program Generator”), which is older than COBOL (!), and was used with IBM’s midrange computers such as the System/3, System/32, and all the way up to the AS/400. Apparently, RPG was subsequently ported to MS-DOS, so that the same software tools built with RPG could run on personal computers, which is how we ended up here.

This accounting firm was actually using a Windows 98 computer (yep, in 2026), and running the RPG software inside a DOS console window. And it turned out that, in order to run this software, it requires a special hardware copy-protection dongle to be attached to the computer’s parallel port! This was a relatively common practice in those days, particularly with “enterprise” software vendors who wanted to protect their very important™ software from unauthorized use.

image

Sadly, most of the text and markings on the dongle’s label has been worn or scratched off, but we can make out several clues:

  • The words “Stamford, CT”, and what’s very likely the logo of a company called “Software Security Inc”. The only evidence for the existence of this company is this record of them exhibiting their wares at SIGGRAPH conferences in the early 1990s, as well as several patents issued to them, relating to software protection.
  • A word that seems to say “RUNTIME”, which will become clear in a bit.

My first course of action was to take a disk image of the Windows 98 PC that was running this software, and get it running in an emulator, so that we could see what the software actually does, and perhaps export the data from this software into a more modern format, to be used with modern accounting tools. But of course all of this requires the hardware dongle; none of the accounting tools seem to work without it plugged in.

Before doing anything, I looked through the disk image for any additional interesting clues, and found plenty of fascinating (and archaeologically significant?) stuff:

image

  • We’ve got a compiler for the RPG II language (excellent!), made by a company called Software West Inc.
  • Even better, there are two versions of the RPG II compiler, released on various dates in the 1990s by Software West.
  • We’ve got the complete source code of the accounting software, written in RPG. It looks like the full accounting package consists of numerous RPG modules, with a gnarly combination of DOS batch files for orchestrating them, all set up as a “menu” system for the user to navigate using number combinations. Clearly the author of this accounting system was originally an IBM mainframe programmer, and insisted on bringing those skills over to DOS, with mixed results.

I began by playing around with the RPG compiler in isolation, and I learned very quickly that it’s the RPG compiler itself that requires the hardware dongle, and then the compiler automatically injects the same copy-protection logic into any executables it generates. This explains the text that seems to say “RUNTIME” on the dongle.

The compiler consists of a few executable files, notably RPGC.EXE, which is the compiler, and SEU.EXE, which is a source editor (“Source Entry Utility”). Here’s what we get when we launch SEU without the dongle, after a couple of seconds:

image

A bit rude, but this gives us an important clue: this program must be trying to communicate over the parallel port over the course of a few seconds (which could give us an opportunity to pause it for debugging, and see what it’s doing during that time), and then exits with a message (which we can now find in a disassembly of the program, and trace how it gets there).

A great tool for disassembling executables of this vintage is Reko. It understands 16-bit real mode executables, and even attempts to decompile them into readable C code that corresponds to the disassembly.

image

And so, looking at the decompiled/disassembled code in Reko, I expected to find in and out instructions, which would be the telltale sign of the program trying to communicate with the parallel port through the PC’s I/O ports. However… I didn’t see an in or out instruction anywhere! But then I noticed something: Reko disassembled the executable into two “segments”: 0800 and 0809, and I was only looking at segment 0809.

image

If we look at segment 0800, we see the smoking gun: in and out instructions, meaning that the copy-protection routine is definitely here, and best of all, the entire code segment is a mere 0x90 bytes, which suggests that the entire routine should be pretty easy to unravel and understand. For some reason, Reko was not able to decompile this code into a C representation, but it still produced a disassembly, which will work just fine for our purposes. Maybe this was a primitive form of obfuscation from those early days, which is now confusing Reko and preventing it from associating this chunk of code with the rest of the program… who knows.

Here is a GitHub Gist with the disassembly of this code, along with my annotations and notes. My x86 assembly knowledge is a little rusty, but here is the gist of what this code does:

  • It’s definitely a single self-contained routine, intended to be called using a “far” CALL instruction, since it returns with a RETF instruction.
  • It begins by detecting the address of the parallel port, by reading the BIOS data area. If the computer has more than one parallel port, the dongle must be connected to the first parallel port (LPT1).
  • It performs a loop where it writes values to the data register of the parallel port, and then reads the status register, and accumulates responses in the BH and BL registers.
  • At the end of the routine, the “result” of the whole procedure is stored in the BX register (BH and BL together), which will presumably be “verified” by the caller of the routine.
  • Very importantly, there doesn’t seem to be any “input” into this routine. It doesn’t pop anything from the stack, nor does it care about any register values passed into it. Which can only mean that the result of this routine is completely constant! No matter what complicated back-and-forth it does with the dongle, the result of this routine should always be the same.

With the knowledge that this routine must exit with some magic value stored in BX, we can now patch the first few bytes of the routine to do just that! Not yet knowing which value to put in BX, let’s start with 1234:

BB 34 12       MOV BX, 1234h
CB             RETF

Only the first four bytes need patching — set BX to our desired value, and get out of there. Running the patched executable with these new bytes still fails (expectedly) with the same message of “No dongle, no edit”, but it fails immediately, instead of after several seconds of talking to the parallel port. Progress!

Stepping through the disassembly more closely, we get another major clue: The only value that BH can be at the end of the routine is 76h. So, our total value for the magic number in BX must be of the form 76xx. In other words, only the BL value remains unknown:

BB __ 76       MOV BX, 76__h
CB             RETF

Since BL is an 8-bit register, it can only have 256 possible values. And what do we do when we have 256 combinations to try? Brute force it! I whipped up a script that plugs a value into that particular byte (from 0 to 255) and programmatically launches the executable in DosBox, and observes the output. Lo and behold, it worked! The brute forcing didn’t take long at all, because the correct number turned out to be… 6. Meaning that the total magic number in BX should be 7606h:

BB 06 76       MOV BX, 7606h
CB             RETF

image

Bingo!
And then, proceeding to examine the other executable files in the compiler suite, the parallel port routine turns out to be exactly the same. All of the executables have the exact same copy protection logic, as if it was rubber-stamped onto them. In fact, when the compiler (RPGC.EXE) compiles some RPG source code, it seems to copy the parallel port routine from itself into the compiled program. That’s right: the patched version of the compiler will produce executables with the same patched copy protection routine! Very convenient.

I must say, this copy protection mechanism seems a bit… simplistic? A hardware dongle that just passes back a constant number? Defeatable with a four-byte patch? Is this really worthy of a patent? But who am I to pass judgment. It’s possible that I haven’t fully understood the logic, and the copy protection will somehow re-surface in another way. It’s also possible that the creators of the RPG compiler (Software West, Inc) didn’t take proper advantage of the hardware dongle, and used it in a way that is so easily bypassed.

In any case, Software West’s RPG II compiler is now free from the constraint of the parallel port dongle! And at some point soon, I’ll work on purging any PII from the compiler directories, and make this compiler available as an artifact of computing history. It doesn’t seem to be available anywhere else on the web. If anyone reading this was associated with Software West Inc, feel free to get in touch — I have many questions!

Adblock test (Why?)

Read the whole story
jgbishop
34 days ago
reply
Cool story!
Raleigh, NC
Share this story
Delete

Buzz

2 Shares

Two sleepy kidney beans are awoken at 5:31 in the morning. Banging a pan with a spoon and jumping on the bed is their child, a coffee bean.

The post Buzz appeared first on The Perry Bible Fellowship.

Read the whole story
jgbishop
36 days ago
reply
Raleigh, NC
Share this story
Delete
Next Page of Stories