SpyCloud logo with text "The Largest Known Chinese PII Data Leak".

A 6 Billion-Record Breach: Anatomy of the Largest Known Chinese PII Data Leak

Table of Contents

Check your exposure

In January 2026, security researchers discovered an Elasticsearch cluster containing billions of records sitting exposed on a bulletproof hosting server. As Cybernews reported, the cluster contained billions of recently-imported records across a variety of differently-formatted indexes, suggesting that the data was aggregated from various breached sources, likely with malicious intent.

Summary of the data leak

SpyCloud Labs obtained a copy of the exposed dataset and, from the original 8.7 billion, were able to parse out 6.38 billion unique records including:

This dataset represents the largest known leaked dataset of Chinese personally identifiable information (PII) ever. Attackers both domestically and internationally can use this data to exploit Chinese citizens. This data can also be used to investigate and identify Chinese-language threat actors and associated threats.

How was the data stored in the exposed database?

The exposed PII was distributed across 39 different indices, which we’ve listed below with their original titles, number of records, and the types of records present, alongside the corresponding SpyCloud source catalog information showing how each index is listed in our data lake. As a note, Cybernews reported there were 160 indices in the cluster, however we only found worthwhile quantities of PII in 39 of the 160 indices.

SpyCloud Breach ID SpyCloud Breach Title Original Exposed Index Title Total Extracted Records Data Asset Types
131582 Exposed Chinese Dataset – Part 1 bcsj 2,557,712 [‘national_id’, ‘country’, ‘city’, ‘address_1’, ‘cc_bin’, ‘county’, ‘social_qq’, ‘birth_year’, ‘cc_last_four’, ‘country_code’, ‘full_name’, ‘phone’, ‘dob’, ‘bank_name’, ‘mobile_carrier’, ‘state’, ‘cc_number’, ‘postal_code’, ’email’, ‘username’]
132818 Exposed Chinese Dataset – Part 2 company 146,503,137 [‘country’, ‘city’, ‘address_1’, ‘county’, ‘industry’, ‘social_weibo’, ‘country_code’, ‘account_id’, ‘full_name’, ‘phone’, ‘company_name’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’]
132630 Exposed Chinese Dataset – Part 3 ddcx 62,259,049 [‘national_id’, ‘full_name’, ‘vehicle_identification_number’, ‘city’, ‘phone’, ‘address_1’, ‘dob’, ‘county’, ‘mobile_carrier’, ‘state’, ‘birth_year’]
131587 Exposed Chinese Dataset – Part 4 dksj 4,846,837 [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘country_code’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’]
135839 Exposed Chinese Dataset – Part 5 erys 1,107,186,697 [‘national_id’, ‘country’, ‘country_code’, ‘full_name’, ‘city’, ‘address_1’, ‘dob’, ‘county’, ‘state’, ‘birth_year’]
131588 Exposed Chinese Dataset – Part 6 fjgd 618,834 [‘national_id’, ‘full_name’, ‘city’, ‘address_1’, ‘dob’, ‘county’, ‘state’, ‘postal_code’, ‘birth_year’]
132827 Exposed Chinese Dataset – Part 7 glsj 1,223,748 [‘full_name’, ‘phone’]
132875 Exposed Chinese Dataset – Part 8 hdyxlt 28,553,065 [‘national_id’, ‘password’, ‘city’, ‘phone’, ‘address_1’, ‘dob’, ‘county’, ‘mobile_carrier’, ‘state’, ’email’, ‘birth_year’]
135840 Exposed Chinese Dataset – Part 9 imei 866,017,458 [‘city’, ‘mobile_equipment_id’, ‘phone’, ‘mobile_carrier’, ‘state’]
131597 Exposed Chinese Dataset – Part 10 jxxy 146,351 [‘national_id’, ‘educational_institution’, ‘full_name’, ‘city’, ‘phone’, ‘address_1’, ‘dob’, ‘county’, ‘state’, ‘birth_year’]
135118 Exposed Chinese Dataset – Part 11 jzsj 854,918,746 [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘country_code’, ‘password’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’]
131600 Exposed Chinese Dataset – Part 12 kdsj 1,020,948 [‘account_id’, ‘full_name’, ‘city’, ‘phone’, ‘mobile_carrier’, ‘state’]
132652 Exposed Chinese Dataset – Part 13 kdwm 78,508,644 [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘country_code’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’]
132634 Exposed Chinese Dataset – Part 14 kfc 30,686,774 [‘country’, ‘country_code’, ‘full_name’, ‘city’, ‘phone’, ‘address_1’, ‘county’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’]
132863 Exposed Chinese Dataset – Part 15 kfsj 25,311,858 [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘country_code’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’]
132923 Exposed Chinese Dataset – Part 16 ksdd 16,198,222 [‘country’, ‘city’, ‘address_1’, ‘county’, ‘country_code’, ‘password’, ‘account_id’, ‘full_name’, ‘phone’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’]
132921 Exposed Chinese Dataset – Part 17 lnsj 1,947,336 [‘password’, ‘city’, ‘phone’, ‘mobile_carrier’, ‘state’, ’email’]
132965 Exposed Chinese Dataset – Part 18 mmsj 1,947,337 [‘password’, ‘city’, ‘phone’, ‘mobile_carrier’, ‘state’, ’email’]
133367 Exposed Chinese Dataset – Part 19 ppx 167,113,787 [‘password’, ‘city’, ‘phone’, ‘mobile_carrier’, ‘state’, ’email’]
132784 Exposed Chinese Dataset – Part 20 qyjgtgs 16,724,308 [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘country_code’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’]
135836 Exposed Chinese Dataset – Part 21 sanys 355,999,499 [‘national_id’, ‘country’, ‘country_code’, ‘full_name’, ‘city’, ‘phone’, ‘address_1’, ‘dob’, ‘county’, ‘mobile_carrier’, ‘state’, ‘birth_year’]
132922 Exposed Chinese Dataset – Part 22 sfkd 53,952,981 [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘country_code’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’]
133385 Exposed Chinese Dataset – Part 23 sgsj 88,556,202 [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘country_code’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’]
133349 Exposed Chinese Dataset – Part 24 shssm 43,713,638 [‘national_id’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’]
133124 Exposed Chinese Dataset – Part 25 tbsj 22,234,473 [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘country_code’, ‘password’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’, ‘username’]
134960 Exposed Chinese Dataset – Part 26 txqq 516,223,656 [‘city’, ‘phone’, ‘mobile_carrier’, ‘state’, ‘social_qq’, ’email’]
134326 Exposed Chinese Dataset – Part 27 weixin 302,205,360 [‘full_name’, ‘city’, ‘phone’, ‘social_wechat’, ‘mobile_carrier’, ‘account_image_url’, ‘state’, ’email’]
134331 Exposed Chinese Dataset – Part 28 wyy 253,554,717 [‘password’, ’email’, ‘username’]
133364 Exposed Chinese Dataset – Part 29 xjsj 155,912,346 [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘student_id’, ‘ec_full_name’, ‘birth_year’, ‘country_code’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘ec_phone’, ‘state’, ‘postal_code’, ’email’]
133233 Exposed Chinese Dataset – Part 30 xmlt 8,280,578 [‘password’, ‘salt’, ‘ip_addresses’, ’email’, ‘username’]
135902 Exposed Chinese Dataset – Part 31 ylsj 365,123,430 [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘cc_bin’, ‘county’, ‘birth_year’, ‘cc_last_four’, ‘country_code’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’, ‘cc_number’, ’email’]
133375 Exposed Chinese Dataset – Part 32 yxlm 91,696,176 [‘account_id’]
133202 Exposed Chinese Dataset – Part 33 zaw 5,024,324 [‘password’, ‘full_name’, ‘city’, ‘phone’, ‘address_1’, ‘county’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’, ‘username’]
133590 Exposed Chinese Dataset – Part 34 zczgz 620,480 [‘national_id’, ‘full_name’, ‘city’, ‘address_1’, ‘dob’, ‘county’, ‘state’, ‘birth_year’]
134961 Exposed Chinese Dataset – Part 35 zfb 388,203,043 [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘country_code’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’, ’email’]
133231 Exposed Chinese Dataset – Part 36 zgsj 11,192,757 [‘national_id’, ‘country’, ‘educational_institution’, ‘city’, ‘address_1’, ‘county’, ‘ec_full_name’, ‘birth_year’, ‘country_code’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘ec_phone’, ‘state’, ‘postal_code’, ‘job_title’, ’email’]
136235 Exposed Chinese Dataset – Part 37 zhys 198,017,160 [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘country_code’, ‘account_id’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’]
133126 Exposed Chinese Dataset – Part 38 zlzp 1,747,797 [‘national_id’, ‘country’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘country_code’, ‘full_name’, ‘phone’, ‘dob’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’]
133651 Exposed Chinese Dataset – Part 39 zyh 104,215,621 [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘country_code’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’]

Decoding the database titles

As you can see, the original index titles mostly appear to be abbreviations, although some are easier to decipher than others. Most of them look like they are abbreviations for pinyin spellings of Chinese words. Many end with the letters “sj” – likely an abbreviation for 数据 (shùjù), the Chinese word for data. For example, the very first index title in the table is bcsj – likely 博彩数据 (bócǎi shùjù) or gambling data. The data does appear to contain financial-related PII including bank and credit card information, matching the type of data you might expect from data marketed as gambling data in Chinese cybercriminal forums. Other indexes that we can identify with moderate confidence that appear to include the word shùjù are:

Some of the other indexes that we were able to identify matched well known brand names:

A select few indexes appear to be English words or abbreviations. One of the indexes is simply labeled company. Another is imei, a common abbreviation for the unique international mobile equipment identity number assigned to most mobile devices. The records in this index all include these mobile device identifiers. There is also a good chance that the index labeled kfc actually does correspond to Kentucky Fried Chicken, which is massively popular in China and has over 11,000 Chinese locations.

Other strong hypotheses we have about database titles include:

Derived data

Most of the data itself was stored in direct key-value pairs that clearly labeled the data asset types in a mixture of English and Chinese. Indexes that contain idcard or phone data also appear to contain additional data labeled idcardInfo or phoneInfo, respectively. These ‘Info’ keys appear to contain additional data that was derived from the national ID or phone number, probably after the data was exfiltrated from its original source.
Cybersecurity threat detection and breach prevention platform.
Example of full raw data structure (left), and just the _source key (right) showing idcard, idcardInfo, phone and phoneInfo.

What this database likely represents within the Chinese cybercrime ecosystem

Based on the data present, the fact that it appeared to have been recently compiled together from a variety of unrelated sources, and the fact that it was discovered on a bulletproof hosting service, we believe that this was most likely a collection of data assembled by a Chinese criminal actor who was planning to sell the data, either as a complete collection or as the backend for an illicit PII lookup service.

In the past, we have written about Chinese cybercriminal PII sellers and the lookup-services they offer, called SGKs. SGKs – short for 社工库 (shègōng kù) or social engineering libraries – are repositories of leaked and stolen PII created by Chinese-language threat actors. They generally compile together hacked and leaked databases allowing for easy queryability of PII on Chinese citizens and users.

SpyCloud tracks dozens of these SGKs, which often take the form of basic clearnet websites or Telegram bots. Often SGKs include basic PII lookup bots, which usually return data similar to the data contained in this leaked database, as well as “premium lookup” services, where customers can pay to have corrupt insiders who work in the government, telcos, or banks retrieve more detailed data from privileged sources. 

Sample results from a basic SGK query. This SGK interfaces with users via a Telegram bot.

Visualizing the data from this leak

We were able to extract 6.38 billion unique records from 39 indices in the database. Across these indices city and state/province were the most common data asset types represented – likely because they were surfaced in indexes where they were derived from the national ID, indexes where they were derived from a phone number (from the number’s area code), as well as in indexes that included a full address.

phone and mobile_carrier are the next most common – they are extremely close in number because mobile_carrier appears to have been looked up based on the phone value. These are followed by full_name , and then national_id, county, gender, and dob – these latter four data assets are also relatively close in amount because county, gender, and birthdate generally appear to have been derived from the national_id number.

Pie chart showing distribution of asset types in SpyCloud database.

Pie chart showing the frequency of data asset types that we parsed out of the dataset.

We also mapped out leaked users of the DiDi app, a Chinese rideshare and food delivery service. It’s not terribly surprising that Henan and Sichuan are the most densely represented in the data, the third and fifth most populous provinces respectively. It’s a little less expected that third on the list is Yunnan – which is the 12th most populous province in China. We aren’t totally sure why the data doesn’t roughly track with the population density of the provinces here. Unfortunately, it’s hard to speculate further without knowing more details about this data’s provenance than are available.

Choropleth map showing number of DiDi app users present in the leaked ddcx index per province.

We also looked at the age distribution for the DiDi index. The average age of individuals in this dataset is 36 years old. Interestingly, the data cuts off sharply below age 20 – no individuals in the data have a birthdate later than 2006.

At first we thought that this perhaps indicated that the data only included DiDi drivers, a hypothesis further bolstered by the presence of VIN numbers in the index. However, DiDi drivers actually have a minimum age of 21 and only about 4,000 of the 62 million records in the dataset include VIN numbers.

Instead, we think this cutoff is actually more likely to be an artifact of the date when the data was exfiltrated. DiDi imposes a minimum rider age of 16 (it used to be 18, but was decreased to 16 in 2019). Therefore, we can conclude with moderate confidence that the data in this index was likely exfiltrated from its original source around 4 years ago (in 2022) – when individuals born in 2006 were turning 16.

Data analytics dashboard showing age distribution of records for cybersecurity insights.

Bar graph showing the age distribution of individuals present in the leaked ddcx data.

Next, we took a look at the two main indexes with credit card numbers – bcsj (gambling data) and ylsj (UnionPay data). One main noticeable difference is the lack of any JCB (Japan Credit Bureau) or Visa Electron (Visa Debit) cards in the gambling dataset, but a significant number of each in the UnionPay data. After further research, we found that while both of these financial institutions are based outside of China, UnionPay actually offers co-branded cards with both institutions.

Common credit card BINs in the gambling and UnionPay datasets.

Finally, we decided to visualize both the Western and Chinese zodiac symbols in the data. Why? Because it’s not every day that we ingest data containing zodiac information.

Each zodiac system has 12 categories – with the Western zodiac rotating on a monthly basis over an annual cycle and the Chinese zodiac rotating on an annual basis over a 12 year cycle. As you can see, both zodiac systems appear to be relatively evenly distributed in the leaked data. Libra and Rabbit are both the most common respectively, both by a small margin. This relatively (but not exact) even distribution is what we’d expect from real-world data.

Cybersecurity and data breach prevention logo for SpyCloud.

Distribution of astrological signs across the entire leaked dataset.

Chinese Zodiac distribution pie chart showing various animal signs and their percentages.

Distribution of Chinese zodiac signs across the entire leaked dataset.

Who is exposed in this massive leak?

This is an extremely thorough dataset in terms of its coverage of the Chinese population, amounting to massive exposure for Chinese citizens.

A practical example of the scale of exposure

Let’s look at what this means in practice. Take the domain mfa.cn.gov, the domain for the Chinese Ministry of Foreign Affairs (MFA). In just this 6 billion record leak alone, we can identify 133 unique @mfa.cn.gov email addresses across 207 total records, meaning we have an average of 1.6 distinct records per MFA employee email address.

If we pivot off of the national IDs and phone numbers included in this first set of results, we get a total of 563 total records, for an average of 4.2 distinct records for each unique MFA employee email address we can find in the breach. At SpyCloud, we refer to this concept as holistic identity exposure – the idea that the exposure of an employee’s personal information also puts their employer’s organizational identity security at risk. Finally, if we tally up all of the assets across each of those 563 records, we get 3,469 matching data assets – for an average of 26 matching data assets per unique @mfa.cn.gov email address.

Key takeaways from SpyCloud’s analysis of the leak

There is so much data in this leak, and many opportunities to slice and dice what’s inside, but our key takeaways include:

Scale and scope:

This is the largest known leaked dataset of Chinese PII ever discovered, containing 6.38 billion unique records.

Data assets

The leak aggregates PII across 39 indices, including over 4.48 billion phone numbers, 3.61 billion full names, 2.55 billion Chinese national ID numbers, and 433.2 million passwords.

Source aggregation

The data was compiled from various breached sources, likely by a Chinese-language actor planning to sell it or use it as a backend for illicit PII lookup services known as SGKs.

Exposure quantity

The national ID records alone cover approximately 58% of China’s 1.4 billion population. On average, each unique national ID appears in 3.1 records, meaning the majority of Chinese citizens have personal information exposed from multiple distinct sources.

Breadth of exposures

The leak demonstrates holistic identity exposure, where pivoting off a single piece of PII (like a national ID or phone number) can yield multiple records and associated assets about an individual. For example, 133 unique @mfa.cn.gov email addresses were associated with an average of 4.2 distinct records and 26 matching data assets per employee.

This massive breach contains data about most Chinese people, exposing Chinese individuals, Chinese organizations, and multinational organizations alike. It also showcases the immense breadth and depth of breached personal data being collected and aggregated by Chinese cybercriminals.

Discover what cybercriminals know about your business and your customers – and how those insights help you protect digital identities and prevent targeted attacks.

Keep reading

Cybercrime update graphic showing SpyCloud data security and threat trends.
March 2026 Cybercrime Update
This month's cybercrime update covers a forum takedown, ransomware-style extortion from unexpected threat actors, and a state-sponsored campaign hitting close to home.
SpyCloud cybersecurity update with data breach trends and threat analysis.
February Cybercrime Update: Disruptions, Data Leaks, & Doxxing
It was a short but spicy month in the cybercrime world. Here’s what to know, including hacktivism updates from the Middle East, disruption activity, & sensitive new data leaks.
SpyCloud logo with "Taking Down Tycoon 2FA" text for cybersecurity awareness.
Taking Down Tycoon 2FA: Inside a Global Phishing Infrastructure Takedown
Get the inside scoop on the global Tycoon 2FA phishing infrastructure takedown, including threat actor techniques and enterprise victim impact.

Check Your Company's Exposure

See your real-time exposure details powered by SpyCloud.

New report: 49% of phishing victims last year were corporate users. Read now

X