
Github Copilot has, by their own admission, been trained on mountains of GPL code, so i'm unclear on how it's not a form of laundering open source code into commercial works. The handwave of "it usually doesn't reproduce exact chunks" is not very satisfying.
Copyright does not only cover copying and pasting; it covers derivative works. Github Copilot was trained on open source code and the sum total of everything it knows was drawn from that code. There is no possible interpretation of "derivative" that does not include this.
I'm really tired of the tech industry treating neural networks like magic black boxes that spit out something completely novel, and taking free software for granted while paying out $150k salaries for writing ad delivery systems. The two have finally fused and it sucks.
Previous "AI" generation has been trained on public text and photos, which are harder to make copyright claims on, but this is drawn from large bodies of work with very explicit court-tested licenses, so I look forward to the inevitable massive class action suits over this.
"But eevee, humans also learn by reading open source code, so isn't that the same thing"
- No
- Humans are capable of abstract understanding and have a breadth of other knowledge to draw from
- Statistical models do not
- You have fallen for marketing
Even MIT code still requires attribution, and they don't even know who to attribute the output to.
Previously, previously, previously, previously, previously, previously, previously, previously, previously.
I think what you're saying is that it could be argued that the code that Copilot produces is subject to GPL terms. So anyone using Copilot to produce non-GPL code ought to be asking their legal team where they stand on that.
Of course the same goes for MIT, ASL, Artistic license etc, but the implications with GPL are more fun IMO.
Worse (or better, depending on which side of the fence you are): a system like Copilot is nothing without it's training. It could be argued that Copilot itself is a derivative work of thousands of pieces of GPL code, so it too should be subject to GPL terms...
Copilot is not a derivative work but its output is a derivative work of every piece of source code that was fed into its training set. Since it was fed code licensed under multiple conflicting terms, what terms cover its output? All of them, which is to say, every snippet that it produces (that is long enough to be subject to copyright) is in violation of some license or another.
Copilot's model is a derivative of its training set--the weights are, at least. Thus I would argue Copilot itself is a derivative work, since the model is fundamental to its novelty.
I assumed "Copilot" was the piece of human-written code that reads a digested corpus and regurgitates code. If you instead think that "Copilot" is the name of the particular compiled training set, then you are correct.
They could have trained it on only public domain code and it would have still emitted code, but this conversation about it would be different.
This will be interesting.
Wait for them to try invoke the GPL definition of “distribution” here as a way to end-run it, as you can make derivative works that are not “distributed” and sit inside a private server somewhere.
The whole “derp, people do it too” don’t understand that humans learning from the code is the one the main drivers, if not THE, of the original GPL.
You can copyright a dictionary, but the words in it are not subject to copyright individually. Even a whole definition can be used in another work under fair use. Fair use is a rat's nest, but as a layperson it seems like they are firmly on the right side of all four tests.
Can we do audacity next.
This seems like something that grew out of trying to build an AI that could find fingerprints in code (source, compressed/obfuscated, or compiled) that map to small number of suspects who probably wrote that code. And the gag github feature sounds like gratis "Mechanical Turk", not unlike visual recognition tests in "prove you are not a bot" systems.
Long counterpoint that I agree with about why the free software community really, really, does not want to win this one.
https://juliareda.eu/2021/07/github-copilot-is-not-infringing-your-copyright
If reading and learning from code means you have to obey the copyright on that code, two examples that spring to mind:
1. AT&T own the Unix/Posix API
2. Oracle own the Java API
Copilot is just the 'reading' part, not the 'learning' part. No version of "AI" in 2021 is learning anything. This is still what we old timers called 'data mining' back in the day. It's just that we have faster processors now that can churn through much more data. But it is still data mining. It's just 1's and 0's, people.
I won't deign to satisfy Microsoft's algorithm and click on any marketing materials for this thing, but who exactly is this 'product' for? People too lazy to learn how to code? My god, is VB6 coming back in some worse form?
Much of what she says is correct, but she skips a few important things; it looks like she didn't look closely enough at what Github Copilot does before launching into her favourite topics (which I pretty much agree completely with).
1. Sure, copyleft does not benefit from stronger copyright laws. But copyleft is harmed by a project that encourages license laundering of copyleft code and specifically avoids proprietary code laundering. Everything you can purloin with Copilot is free/open source licensed. None of it is proprietary. Why would that be?
I would be a lot happier if the WINE developers could conveniently "autocomplete" Microsoft's Win32 implementation and get away with falsely claiming authorship.
2. There's not a problem with scraping. The problem is what can be done with the results of scraping. There would be no problem if Copilot searched for functions that do what you want and displayed them to you. It could even let you copy them into your code, provided you agreed to the associated licensing terms.
"I scraped it" does not give you carte blanche to do what you like.
3. Short snippets are also fine. However, producing the entire fast inverse square root from Quake on demand, along with wrong attribution, is not a "short snippet".
At best, Github Copilot is a giant liability for anyone who uses it - imagine if your favourite search engine said "yeah, go ahead and use this image for whatever you like, even your next million dollar ad campaign", for every image it had scraped, no matter the source or licensing. At worst, it's a system designed to be a fig leaf for laundering GPL code into proprietary software.
Thanks for the response and Quake example in particular.
Commercial software has always benefited from the inclusion of free/open source code, and free/open source programmers themselves have different comfort levels with code being "purloined" - even at the length of the Quake square root function, it wouldn't bother me for example.
As to why it was trained only on free/open source code, since advocates have been claiming since the beginning that such code is higher quality than proprietary, why is this surprising? MS are endorsing the quality of non-proprietary software.
I agree that scraping does not give you carte blanche. And I like the idea of reminding anyone who uses Copilot in current form of the liability they could be exposing themselves to.
But some of the reactions I'm seeing remind me of the music industry when Napster appeared. Or the hassles that YouTube content creators (including jwz himself IIRC) suffer from when fragments of their work resemble something that already exists.
I agree, it would not be a problem if Copilot was better about attribution.
I was just thinking, "Hmm, is Id Software big enough to take on Microsoft? Oh yeah, they're owned by Bethesda now so they might be able to. Wait, Microsoft bought Bethesda... Shit..."
I think there's already some case law which says that you can't copyright "an API", only an "implementation".
The whole copilot discussion reminds me of the "What colour are your bits?" essay: https://ansuz.sooke.bc.ca/entry/23