GitHub Multilingual Repositories Dataset for AI Research

GitHub Releases Open Multilingual Repositories Dataset to Power the Next Generation of AI

Software may be written in programming languages, but human language sits at the very heart of how developers collaborate. From README files that explain how a project works, to issues where developers ask for help, to pull requests where teams review and debate code — the written word is everywhere in open source development. While much of this collaboration has historically happened in English, the global developer community is far more linguistically diverse than that single language suggests. As artificial intelligence becomes an increasingly central part of how software gets built, that diversity matters more than ever before.

Recognizing this gap, GitHub has published the GitHub Multilingual Repositories Dataset — a repository-level metadata resource specifically designed to help researchers and developers discover public GitHub repositories that contain evidence of non-English natural-language content. The dataset is freely available on GitHub under the CC0-1.0 license, making it accessible to anyone looking to advance the state of multilingual AI research and development.

Why Multilingual Developer Data Matters for AI

The rise of AI coding assistants, natural language code search, and automated documentation tools has transformed the software development landscape. But most of these tools have been trained predominantly on English-language data. That creates a real and measurable disadvantage for the hundreds of millions of developers worldwide who communicate, collaborate, and build primarily in languages other than English.

When an AI model lacks exposure to how developers in Brazil describe a bug in Portuguese, or how a Korean developer explains an architectural decision in an issue thread, that model is simply less useful — or less accurate — for those communities. Closing this gap requires training data that reflects the true linguistic breadth of the global developer ecosystem, and that starts with knowing where multilingual content actually lives.

The GitHub Multilingual Repositories Dataset directly addresses this need. Rather than dumping raw repository content, it provides curated metadata that points researchers toward repositories where multilingual collaboration is actively happening. This approach makes the dataset both lightweight and highly practical for discovery-focused research workflows.

What the Dataset Contains

The GitHub Multilingual Repositories Dataset is intentionally scoped as a metadata dataset, not a bulk content export. This design choice is significant. Instead of overwhelming researchers with raw text files, the dataset surfaces the right signals — repository-level indicators that non-English natural language content is present in meaningful quantities across READMEs, issues, and pull requests.

One of the most interesting findings to emerge from building the dataset is that language distribution is not uniform across different types of developer content. The data reveals nuanced patterns that challenge assumptions about which languages dominate in which contexts:

Korean is the most common non-English language found in issue text, reflecting a highly active Korean-speaking open source community that uses GitHub's issue tracker extensively for technical discussions.
However, Korean ranks only fifth when it comes to README files — a reminder that language use varies dramatically depending on the nature of the content.
Portuguese tops the non-English README list, with more than three million repositories containing Portuguese-language documentation. This reflects the enormous and rapidly growing Brazilian developer community, which is one of the most active open source contributor bases in the world.

These distinctions matter for AI researchers. A model being trained to understand developer intent from issue descriptions needs different training data than one being optimized for README comprehension or code review commentary. The dataset provides the metadata scaffolding to target data collection with this kind of precision.

Open Access and Microsoft's European Digital Commitments

The release of this dataset is not an isolated product decision — it follows through on a commitment GitHub made in 2025 as part of Microsoft's European Digital Commitments. Those commitments include making multilingual data more accessible, with a specific focus on supporting open source AI developers who are working to build more inclusive, globally representative models.

By publishing the dataset under the CC0-1.0 license, GitHub is removing all legal barriers to use. Researchers at universities, independent AI developers, non-profit organizations, and commercial teams can all use, modify, and build on the data without restriction. This kind of truly open licensing is essential for fostering the kind of collaborative, reproducible research that drives meaningful progress in the field.

How Researchers and Developers Can Use This Dataset

The practical applications for the GitHub Multilingual Repositories Dataset are broad and span multiple disciplines within AI and software engineering research.

NLP and LLM training: Teams building or fine-tuning large language models can use the dataset to identify and source multilingual developer content that reflects real-world usage patterns across communities.
Code intelligence tools: Developers building AI-powered tools for code review, documentation generation, or bug triaging can improve multilingual support by grounding their systems in data that reflects non-English developer workflows.
Linguistic research: Linguists and computational researchers studying how technical language evolves across cultures can use the dataset to surface repositories for comparative analysis.
Inclusivity audits: Organizations assessing the global reach and accessibility of their developer tools can benchmark their multilingual performance against the language distribution patterns the dataset reveals.

A Step Toward a More Inclusive AI Ecosystem

The GitHub Multilingual Repositories Dataset represents something larger than a single data release. It is a signal that the infrastructure layer of AI development — the data, the tooling, the discoverability — needs to reflect the diversity of the global developer community just as much as the end products do.

For too long, the default assumption in AI development has been that English data is sufficient, or that multilingual capability is a secondary concern to be addressed after core functionality is established. The reality is that billions of people around the world build software, solve problems, and share knowledge in dozens of languages every single day. AI systems that cannot engage with that reality will always fall short of their potential.

By making this dataset freely available, GitHub is giving the research and developer community the raw material it needs to close that gap — one repository discovery at a time. For anyone working at the intersection of AI, natural language processing, and software development, the GitHub Multilingual Repositories Dataset is a resource worth exploring today.

The dataset is available now at github.com/github/multilingual-repositories under the CC0-1.0 license, free for anyone to access, use, and build upon.