GitHub Multilingual Repositories Dataset: What You Need to Know

GitHub Releases a Major Open Dataset for Multilingual AI Research

When most people think about software development, they picture lines of code. But behind every repository lies something just as important: human language. Developers write READMEs to explain their projects, open issues to ask for help, and engage in pull requests to debate and improve their work. While English has long been the dominant language of these conversations, the reality of global software development is far more multilingual than many assume. Now, GitHub is taking a significant step to reflect that reality with the release of the GitHub Multilingual Repositories Dataset — a free, open metadata dataset designed to help researchers and developers discover public repositories where non-English natural-language content is actively present.

Available under the CC0-1.0 license on GitHub, this dataset represents one of the most comprehensive multilingual metadata resources ever made publicly available in the developer ecosystem. It covers over 80 million classification rows spanning more than 40 million repositories, and its release marks an important milestone for anyone working at the intersection of open source software and multilingual artificial intelligence.

Why Multilingual Data Matters for AI Development

As AI-powered tools become increasingly central to how developers build software — from code completion to automated documentation — the language diversity of training and research data has never been more consequential. If AI models are predominantly trained on English-language developer content, they will inevitably perform worse for the hundreds of millions of developers who collaborate in Portuguese, Korean, Chinese, Spanish, Arabic, and dozens of other languages.

This gap has real-world consequences. A developer in São Paulo opening a GitHub issue in Portuguese, or a team in Seoul collaborating on a pull request in Korean, deserves AI tools that understand and assist them as effectively as their English-speaking counterparts. The GitHub Multilingual Repositories Dataset is designed specifically to close that gap by giving researchers the raw material they need to study, benchmark, and improve multilingual AI systems for developer workflows.

What the Dataset Contains

It is worth emphasizing what this dataset is — and what it is not. GitHub has deliberately chosen not to release a raw dump of repository content. Instead, the dataset is a metadata resource that helps developers and researchers identify repositories where multilingual collaboration is likely occurring. This approach balances research utility with respect for the communities that generate the underlying content.

For each repository included, the dataset provides classification-level metadata indicating the presence of non-English natural language. This covers three key areas of developer communication:

READMEs — the primary documentation face of any repository, often the first thing a new contributor or user reads.
Issues — the conversational threads where bugs are reported, features are requested, and problems are solved.
Pull requests — the collaborative review spaces where code is discussed, debated, and merged.

One of the most striking findings to emerge from the dataset is that language distribution is not uniform across these three content types. Korean is the most common non-English language found in issue text, suggesting a particularly active Korean-speaking developer community engaging in collaborative problem-solving on GitHub. However, Korean ranks only fifth when it comes to READMEs. That distinction matters enormously for researchers who might otherwise make assumptions about language presence based on a single content type.

At the top of the README rankings sits Portuguese, with more than 3 million repositories containing Portuguese-language README content. This makes Portuguese the single most prevalent non-English language in repository documentation on GitHub — a finding that underscores the scale and vibrancy of the Brazilian and broader Lusophone developer communities.

A Commitment to Open Multilingual AI Infrastructure

The release of this dataset is not an isolated product decision. It follows through on a commitment GitHub made in 2025 as part of Microsoft's European Digital Commitments — a set of pledges designed to ensure that AI development infrastructure is more open, accountable, and inclusive. One specific component of those commitments was to make multilingual data more accessible, particularly to open source AI developers who lack the resources to independently compile datasets of this scale.

By publishing under the CC0-1.0 license — effectively dedicating the dataset to the public domain — GitHub has removed legal and attribution barriers that often slow down research adoption. Any researcher, startup, university lab, or independent developer can use, modify, and build upon this dataset without restriction. That openness is critical to ensuring the benefits of this resource are widely distributed rather than concentrated among a handful of well-resourced organizations.

How Researchers and Developers Can Use This Dataset

The practical applications of the GitHub Multilingual Repositories Dataset are broad and varied. For AI researchers, it provides a structured starting point for identifying multilingual corpora to study language model performance on developer-specific tasks. For tool builders, it enables the creation of more culturally aware developer assistants and documentation generators. For linguists and computational language researchers, it offers a window into how natural language evolves in the context of technical collaboration.

Some of the most immediate use cases include:

Benchmarking multilingual code-related language models against real-world repository data.
Studying patterns in how non-English developer communities structure their documentation and issue reporting.
Building recommendation or discovery systems that surface multilingual repositories to global contributors.
Training classifiers to detect language presence in developer content with greater precision.

The Bigger Picture: Toward Truly Global Developer AI

The release of the GitHub Multilingual Repositories Dataset is a reminder that building AI tools for developers is not just a technical challenge — it is a cultural and linguistic one. The open source ecosystem is genuinely global, and the conversations that drive it happen in many languages. Datasets like this one are essential infrastructure for anyone serious about building AI that serves all developers, not just those who happen to work in English.

As AI continues to reshape how software is written, reviewed, and maintained, the quality and diversity of the data underlying these systems will determine who benefits and who is left behind. GitHub's decision to open this resource to the world is a meaningful step toward a more inclusive future for developer tooling — and an invitation to the research community to do something impactful with it.

The GitHub Multilingual Repositories Dataset is available now on GitHub under CC0-1.0. Researchers and developers are encouraged to explore, use, and build upon it freely.