← Back to Blog

GitHub Just Made Your Code Microsoft's Training Data. You Have 19 Days to Stop It.

· Don Ho · 5 min read

Last updated: April 5, 2026

By Don Ho, Esq. | April 5, 2026 Last updated: April 2026

Starting April 24, 2026, GitHub will use all Copilot Free, Pro, and Pro+ interaction data to train Microsoft’s AI models by default, including code snippets, file names, repository structures, and architectural patterns, creating IP exposure for every organization whose developers use personal-tier accounts on company code. On March 25, GitHub announced that starting April 24, all interaction data from Copilot Free, Pro, and Pro+ users will be used to train Microsoft’s AI models. You are opted in by default. If you do nothing, every code snippet you send to Copilot, every suggestion you accept, every file name and repository structure the tool touches during your session becomes training data for Microsoft’s CoreAI strategy. Copilot Business and Enterprise users are excluded. Everyone else is fair game.

GitHub’s Chief Product Officer Mario Rodriguez framed the change as necessary to improve model performance. That may be true. It is also true that Microsoft just converted 40 million developers into an unpaid data pipeline, and the opt-out is buried in a settings page that isn’t even accessible from the mobile app.

What GitHub Is Actually Collecting

The scope is broader than most developers realize. When the training data setting is enabled (which it is, by default), GitHub collects: accepted or modified outputs from Copilot suggestions, inputs and code snippets sent to Copilot, code context surrounding the cursor position, comments and documentation in the active file, file names and repository structure, navigation patterns within the project, interactions with Copilot features including chat and inline suggestions, and thumbs up/down feedback on suggestions.

GitHub draws a distinction between code “at rest” (stored in your repository, which they say they don’t access for training) and code “in session” (actively sent to Copilot while you’re working). That distinction sounds reassuring until you think about what “in session” actually means. If you’re using Copilot regularly, every active file in every repository you work in during a Copilot session is potentially in scope. The model sees your proprietary architecture, your naming conventions, your domain-specific patterns, and your business logic.

The collected data may also be shared with “GitHub affiliates,” defined as companies in the same corporate family. That means Microsoft and its subsidiaries. Third-party model providers are excluded from receiving this data for their own training, but that limitation applies to external partners, not to Microsoft itself.

The IP Problem Nobody’s Talking About

Here is the issue that should keep every general counsel awake tonight.

Individual Copilot users within an organization typically do not have the authority to license their employer’s source code to a third party. That is a basic principle of IP ownership. If you write code as part of your employment, your employer owns that code (in most jurisdictions, under work-for-hire doctrine). You cannot unilaterally grant Microsoft a license to use it for training data.

Yet GitHub’s opt-out mechanism is enforced at the individual user level, not the organization level. A single developer on your team who uses a personal Copilot Free or Pro account and doesn’t toggle the setting off has potentially exposed your proprietary codebase to Microsoft’s training pipeline.

GitHub’s FAQ partially addresses this: interaction data from users whose accounts are members of a paid organization will be excluded from model training, and data from paid organization repositories is never used regardless of the user’s tier. That sounds comprehensive. But it depends on every developer using their org-linked account for all work, never working on company code from a personal account, and never contributing to a company project from a personal Copilot session. In practice, those boundaries are porous. Developers work from personal machines. They test code locally. They use personal accounts for side projects that overlap with work.

The Competitive Intelligence Angle

One Reddit commenter put it plainly: “When you use Copilot, you’re not just getting suggestions. You’re implicitly teaching the model what good code looks like in your domain. Your proprietary patterns, architecture decisions, domain-specific idioms, naming conventions, all get folded into a general model. That model then improves suggestions for everyone else, including your direct competitors who use the same tool.”

This is not paranoia. It is the business model working exactly as designed. Microsoft trains the model on your patterns, then sells better suggestions to your competitors. The data flows one direction: from your proprietary codebase into a general-purpose model that benefits everyone who pays for the subscription. The value extraction is the product.

GitHub acknowledges the dynamic indirectly by noting that Microsoft, Anthropic, and JetBrains take similar approaches to using interaction data for model training. The fact that the industry has converged on this practice does not make it acceptable. It makes it an industry-wide IP problem. Anthropic’s recent OAuth ban showed how quickly platform policy changes can cut off access — now imagine that dynamic applied to your training data. The Perplexity class action, which alleges that AI chat data was routed to Meta and Google without consent, shows the same default-data-grab pattern playing out on the user side.

GDPR and International Exposure

For companies with European operations or employees, the GDPR question is immediate. GitHub claims “legitimate interest” as its lawful basis for processing interaction data. Under GDPR Article 6(1)(f), legitimate interest requires a balancing test: the controller’s interest must not be overridden by the data subject’s rights and freedoms.

Training commercial AI models on developers’ proprietary code, without affirmative consent, to benefit the controller’s competitive position is a difficult legitimate interest argument. The Supreme Court’s AI copyright ruling adds another layer: if AI-generated outputs can’t be copyrighted, the IP question around training inputs becomes even more critical. The Article 29 Working Party (now the EDPB) has consistently held that legitimate interest does not apply when the processing is unexpected from the data subject’s perspective or when a less intrusive alternative exists. An opt-in model is clearly less intrusive. GitHub chose opt-out. That choice will be tested. The EU AI Act’s logging requirements add another dimension: if your AI-assisted code touches high-risk systems in Europe, the compliance obligations stack fast.

What to Do Before April 24

For individual developers: go to github.com/settings/copilot/features. Under the “Privacy” section, disable “Allow GitHub to use my data for AI model training.” Do this now. Do not wait.

For engineering leaders: audit every developer’s Copilot tier and account configuration across your organization. Ensure all developers are using org-linked accounts with Copilot Business or Enterprise tiers that are excluded from training data collection. If any developer is using a personal Free, Pro, or Pro+ account for any company work, that is an immediate policy gap.

For general counsel: a structured AI compliance stack that covers governance, vendor management, and documentation is the framework for managing these obligations across tools. Update your acceptable use policies for AI coding tools. Require org-managed accounts for all coding AI tools. Prohibit use of personal-tier AI coding assistants on company code. Add Copilot data training settings to your quarterly IT compliance audit. If your company has European operations, assess GDPR exposure from any developer who has already been using Copilot without opting out.

For procurement teams: if you’re evaluating or renewing Copilot licenses, the training data policy is now a negotiation point. Ask Microsoft directly: will our interaction data be used for model training under any tier, any circumstance, any edge case? Get the answer in writing. Put it in the contract.

Opt-out deadlines and default data grabs are the new normal. Take the ACRA to map your IP exposure across AI platforms.

The deadline is April 24. Nineteen days. Every day a developer on your team uses Copilot without opting out is another day of proprietary code flowing into Microsoft’s training pipeline. The setting takes 30 seconds to change. The IP exposure from not changing it could last years.


Your developers’ default settings shouldn’t be your IP strategy. Kaizen AI Lab builds AI governance policies that cover coding tools, vendor contracts, and the data grabs hiding in your stack. Lock it down.


Kaizen AI Labs

Ready to Deploy AI in Your Business?

Schedule a discovery call with our AI consulting team. We'll map your operations, identify leverage points, and show you exactly where AI moves the needle.

Book a Consulting Call
AI

Adjacent Media by Kaizen Labs

Is Your Brand Visible to the Bots?

Get a free GEO audit and find out if your brand is being cited, found, or completely invisible in AI-generated answers. Then let's fix it.

Get a Free GEO Audit
GEO