In January 2026, security researchers discovered an Elasticsearch cluster containing billions of records sitting exposed on a bulletproof hosting server. As Cybernews reported, the cluster contained billions of recently-imported records across a variety of differently-formatted indexes, suggesting that the data was aggregated from various breached sources, likely with malicious intent.
Summary of the data leak
SpyCloud Labs obtained a copy of the exposed dataset and, from the original 8.7 billion, were able to parse out 6.38 billion unique records including:
- 2,545,485,625 records containing a Chinese national ID number
- 4,483,730,340 records containing a phone number
- 3,612,140,888 records containing a full name
- 737,517,498 records containing a physical address
- 508,597,386 records containing an email address
- 433,168,228 records containing a password
This dataset represents the largest known leaked dataset of Chinese personally identifiable information (PII) ever. Attackers both domestically and internationally can use this data to exploit Chinese citizens. This data can also be used to investigate and identify Chinese-language threat actors and associated threats.
How was the data stored in the exposed database?
The exposed PII was distributed across 39 different indices, which we’ve listed below with their original titles, number of records, and the types of records present, alongside the corresponding SpyCloud source catalog information showing how each index is listed in our data lake. As a note, Cybernews reported there were 160 indices in the cluster, however we only found worthwhile quantities of PII in 39 of the 160 indices.
| SpyCloud Breach ID | SpyCloud Breach Title | Original Exposed Index Title | Total Extracted Records | Data Asset Types |
| 131582 | Exposed Chinese Dataset – Part 1 | bcsj | 2,557,712 | [‘national_id’, ‘country’, ‘city’, ‘address_1’, ‘cc_bin’, ‘county’, ‘social_qq’, ‘birth_year’, ‘cc_last_four’, ‘country_code’, ‘full_name’, ‘phone’, ‘dob’, ‘bank_name’, ‘mobile_carrier’, ‘state’, ‘cc_number’, ‘postal_code’, ’email’, ‘username’] |
| 132818 | Exposed Chinese Dataset – Part 2 | company | 146,503,137 | [‘country’, ‘city’, ‘address_1’, ‘county’, ‘industry’, ‘social_weibo’, ‘country_code’, ‘account_id’, ‘full_name’, ‘phone’, ‘company_name’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’] |
| 132630 | Exposed Chinese Dataset – Part 3 | ddcx | 62,259,049 | [‘national_id’, ‘full_name’, ‘vehicle_identification_number’, ‘city’, ‘phone’, ‘address_1’, ‘dob’, ‘county’, ‘mobile_carrier’, ‘state’, ‘birth_year’] |
| 131587 | Exposed Chinese Dataset – Part 4 | dksj | 4,846,837 | [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘country_code’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’] |
| 135839 | Exposed Chinese Dataset – Part 5 | erys | 1,107,186,697 | [‘national_id’, ‘country’, ‘country_code’, ‘full_name’, ‘city’, ‘address_1’, ‘dob’, ‘county’, ‘state’, ‘birth_year’] |
| 131588 | Exposed Chinese Dataset – Part 6 | fjgd | 618,834 | [‘national_id’, ‘full_name’, ‘city’, ‘address_1’, ‘dob’, ‘county’, ‘state’, ‘postal_code’, ‘birth_year’] |
| 132827 | Exposed Chinese Dataset – Part 7 | glsj | 1,223,748 | [‘full_name’, ‘phone’] |
| 132875 | Exposed Chinese Dataset – Part 8 | hdyxlt | 28,553,065 | [‘national_id’, ‘password’, ‘city’, ‘phone’, ‘address_1’, ‘dob’, ‘county’, ‘mobile_carrier’, ‘state’, ’email’, ‘birth_year’] |
| 135840 | Exposed Chinese Dataset – Part 9 | imei | 866,017,458 | [‘city’, ‘mobile_equipment_id’, ‘phone’, ‘mobile_carrier’, ‘state’] |
| 131597 | Exposed Chinese Dataset – Part 10 | jxxy | 146,351 | [‘national_id’, ‘educational_institution’, ‘full_name’, ‘city’, ‘phone’, ‘address_1’, ‘dob’, ‘county’, ‘state’, ‘birth_year’] |
| 135118 | Exposed Chinese Dataset – Part 11 | jzsj | 854,918,746 | [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘country_code’, ‘password’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’] |
| 131600 | Exposed Chinese Dataset – Part 12 | kdsj | 1,020,948 | [‘account_id’, ‘full_name’, ‘city’, ‘phone’, ‘mobile_carrier’, ‘state’] |
| 132652 | Exposed Chinese Dataset – Part 13 | kdwm | 78,508,644 | [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘country_code’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’] |
| 132634 | Exposed Chinese Dataset – Part 14 | kfc | 30,686,774 | [‘country’, ‘country_code’, ‘full_name’, ‘city’, ‘phone’, ‘address_1’, ‘county’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’] |
| 132863 | Exposed Chinese Dataset – Part 15 | kfsj | 25,311,858 | [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘country_code’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’] |
| 132923 | Exposed Chinese Dataset – Part 16 | ksdd | 16,198,222 | [‘country’, ‘city’, ‘address_1’, ‘county’, ‘country_code’, ‘password’, ‘account_id’, ‘full_name’, ‘phone’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’] |
| 132921 | Exposed Chinese Dataset – Part 17 | lnsj | 1,947,336 | [‘password’, ‘city’, ‘phone’, ‘mobile_carrier’, ‘state’, ’email’] |
| 132965 | Exposed Chinese Dataset – Part 18 | mmsj | 1,947,337 | [‘password’, ‘city’, ‘phone’, ‘mobile_carrier’, ‘state’, ’email’] |
| 133367 | Exposed Chinese Dataset – Part 19 | ppx | 167,113,787 | [‘password’, ‘city’, ‘phone’, ‘mobile_carrier’, ‘state’, ’email’] |
| 132784 | Exposed Chinese Dataset – Part 20 | qyjgtgs | 16,724,308 | [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘country_code’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’] |
| 135836 | Exposed Chinese Dataset – Part 21 | sanys | 355,999,499 | [‘national_id’, ‘country’, ‘country_code’, ‘full_name’, ‘city’, ‘phone’, ‘address_1’, ‘dob’, ‘county’, ‘mobile_carrier’, ‘state’, ‘birth_year’] |
| 132922 | Exposed Chinese Dataset – Part 22 | sfkd | 53,952,981 | [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘country_code’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’] |
| 133385 | Exposed Chinese Dataset – Part 23 | sgsj | 88,556,202 | [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘country_code’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’] |
| 133349 | Exposed Chinese Dataset – Part 24 | shssm | 43,713,638 | [‘national_id’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’] |
| 133124 | Exposed Chinese Dataset – Part 25 | tbsj | 22,234,473 | [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘country_code’, ‘password’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’, ‘username’] |
| 134960 | Exposed Chinese Dataset – Part 26 | txqq | 516,223,656 | [‘city’, ‘phone’, ‘mobile_carrier’, ‘state’, ‘social_qq’, ’email’] |
| 134326 | Exposed Chinese Dataset – Part 27 | weixin | 302,205,360 | [‘full_name’, ‘city’, ‘phone’, ‘social_wechat’, ‘mobile_carrier’, ‘account_image_url’, ‘state’, ’email’] |
| 134331 | Exposed Chinese Dataset – Part 28 | wyy | 253,554,717 | [‘password’, ’email’, ‘username’] |
| 133364 | Exposed Chinese Dataset – Part 29 | xjsj | 155,912,346 | [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘student_id’, ‘ec_full_name’, ‘birth_year’, ‘country_code’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘ec_phone’, ‘state’, ‘postal_code’, ’email’] |
| 133233 | Exposed Chinese Dataset – Part 30 | xmlt | 8,280,578 | [‘password’, ‘salt’, ‘ip_addresses’, ’email’, ‘username’] |
| 135902 | Exposed Chinese Dataset – Part 31 | ylsj | 365,123,430 | [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘cc_bin’, ‘county’, ‘birth_year’, ‘cc_last_four’, ‘country_code’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’, ‘cc_number’, ’email’] |
| 133375 | Exposed Chinese Dataset – Part 32 | yxlm | 91,696,176 | [‘account_id’] |
| 133202 | Exposed Chinese Dataset – Part 33 | zaw | 5,024,324 | [‘password’, ‘full_name’, ‘city’, ‘phone’, ‘address_1’, ‘county’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’, ‘username’] |
| 133590 | Exposed Chinese Dataset – Part 34 | zczgz | 620,480 | [‘national_id’, ‘full_name’, ‘city’, ‘address_1’, ‘dob’, ‘county’, ‘state’, ‘birth_year’] |
| 134961 | Exposed Chinese Dataset – Part 35 | zfb | 388,203,043 | [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘country_code’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’, ’email’] |
| 133231 | Exposed Chinese Dataset – Part 36 | zgsj | 11,192,757 | [‘national_id’, ‘country’, ‘educational_institution’, ‘city’, ‘address_1’, ‘county’, ‘ec_full_name’, ‘birth_year’, ‘country_code’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘ec_phone’, ‘state’, ‘postal_code’, ‘job_title’, ’email’] |
| 136235 | Exposed Chinese Dataset – Part 37 | zhys | 198,017,160 | [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘country_code’, ‘account_id’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’] |
| 133126 | Exposed Chinese Dataset – Part 38 | zlzp | 1,747,797 | [‘national_id’, ‘country’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘country_code’, ‘full_name’, ‘phone’, ‘dob’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’] |
| 133651 | Exposed Chinese Dataset – Part 39 | zyh | 104,215,621 | [‘national_id’, ‘country’, ‘educational_institution’, ‘vehicle_identification_number’, ‘city’, ‘address_1’, ‘county’, ‘birth_year’, ‘country_code’, ‘full_name’, ‘phone’, ‘dob’, ‘company_name’, ‘mobile_carrier’, ‘state’, ‘postal_code’, ’email’] |
Decoding the database titles
As you can see, the original index titles mostly appear to be abbreviations, although some are easier to decipher than others. Most of them look like they are abbreviations for pinyin spellings of Chinese words. Many end with the letters “sj” – likely an abbreviation for 数据 (shùjù), the Chinese word for data. For example, the very first index title in the table is bcsj – likely 博彩数据 (bócǎi shùjù) or gambling data. The data does appear to contain financial-related PII including bank and credit card information, matching the type of data you might expect from data marketed as gambling data in Chinese cybercriminal forums. Other indexes that we can identify with moderate confidence that appear to include the word shùjù are:
- dksj which likely stands for 贷款数据 (dàikuǎn shùjù) or loan data,
- tbsj which likely stands for 淘宝数据 (táobǎo shùjù) or Taobao Data – referring to the popular Chinese e-commerce app,
- mmsj which appears to stand for 密码数据 (mìmǎ shùjù) or password data – the majority of these records contain phone numbers and associated passwords, and
- ylsj which likely stands for 银联数据 (yínlián shùjù) or UnionPay data – referring to China’s largest payment card network. We analyzed the BINs of credit card numbers within this dataset, and the majority do appear to be China UnionPay issued cards or UnionPay and Discover co-branded cards.
- ddcx likely stands for 滴滴出行 (dī dī chūxíng) – which is the full name of the DiDi ride hailing and food delivery app. About 4,000 of the records in this index include VIN numbers, data a rideshare app would likely collect from its drivers.
- txqq is almost certainly 腾讯QQ (Tencent QQ) – the popular messaging service developed by Tencent.
- weixin is unique in that it includes the full pinyin spelling of 微信 (wēixìn) instead of an abbreviation. Weixin is usually called WeChat in English, and is another popular free messaging app.
- wyy likely stands for 网易邮箱 (wǎngyì yóuxiāng), or NetEase Mail – a popular Chinese email service provider. Over half of the records contain email addresses ending in @163.com or @126.com, both popular domains that NetEase offers as freemail options in addition to their enterprise email services.
- zfb is likely short for 支付宝 (zhīfùbǎo), the Chinese name for AliPay.
A select few indexes appear to be English words or abbreviations. One of the indexes is simply labeled company. Another is imei, a common abbreviation for the unique international mobile equipment identity number assigned to most mobile devices. The records in this index all include these mobile device identifiers. There is also a good chance that the index labeled kfc actually does correspond to Kentucky Fried Chicken, which is massively popular in China and has over 11,000 Chinese locations.
Other strong hypotheses we have about database titles include:
- sanys, which probably stands for 三要素 (sān yàosù) or three elements, a common Chinese cybercriminal term for data which includes the three main personal data asset types: name, national id, and phone number.
- jxxy likely stands for 驾校学员 (jiàxiào xuéyuán) or driving school student – the presence of the index "驾校名称" (driving school name) in the raw data also points to this conclusion.
- sfkd probably refers to 顺丰快递 (shùnfēng kuàidì) or SF Express Delivery – the largest shipping company in China.
- We also think that fjgd appears to refer to Fujian province (福建), because all of the national ID numbers start with 35 – the province code for Fujian. We are less certain about the meaning of the second two letters ‘gd’; they might refer to 归档 (guīdǎng) – or archived records.
Derived data
- phoneInfo contains data in the format province | city | mobile_carrier – data that can be identified for any phone number using free, publicly available lookup tools.
- idcardInfo contains data in the format province, city, county | dob | astrological zodiac | chinese zodiac | gender. All of this idcardInfo data is derivable from a Chinese national ID number. The Chinese citizenship identification number (公民身份号码) is an 18-digit national ID number that embeds the province, prefecture, and city where the ID number was assigned (usually a birthplace), the citizen’s birthdate, and the citizen’s gender. Zodiac signs can then be further derived from the birthdate. Thus, the threat actor likely derived these seven additional idcardInfo assets just by decoding the Chinese national ID numbers.
What this database likely represents within the Chinese cybercrime ecosystem
Based on the data present, the fact that it appeared to have been recently compiled together from a variety of unrelated sources, and the fact that it was discovered on a bulletproof hosting service, we believe that this was most likely a collection of data assembled by a Chinese criminal actor who was planning to sell the data, either as a complete collection or as the backend for an illicit PII lookup service.
In the past, we have written about Chinese cybercriminal PII sellers and the lookup-services they offer, called SGKs. SGKs – short for 社工库 (shègōng kù) or social engineering libraries – are repositories of leaked and stolen PII created by Chinese-language threat actors. They generally compile together hacked and leaked databases allowing for easy queryability of PII on Chinese citizens and users.
SpyCloud tracks dozens of these SGKs, which often take the form of basic clearnet websites or Telegram bots. Often SGKs include basic PII lookup bots, which usually return data similar to the data contained in this leaked database, as well as “premium lookup” services, where customers can pay to have corrupt insiders who work in the government, telcos, or banks retrieve more detailed data from privileged sources.
Sample results from a basic SGK query. This SGK interfaces with users via a Telegram bot.
Visualizing the data from this leak
We were able to extract 6.38 billion unique records from 39 indices in the database. Across these indices city and state/province were the most common data asset types represented – likely because they were surfaced in indexes where they were derived from the national ID, indexes where they were derived from a phone number (from the number’s area code), as well as in indexes that included a full address.
phone and mobile_carrier are the next most common – they are extremely close in number because mobile_carrier appears to have been looked up based on the phone value. These are followed by full_name , and then national_id, county, gender, and dob – these latter four data assets are also relatively close in amount because county, gender, and birthdate generally appear to have been derived from the national_id number.
Pie chart showing the frequency of data asset types that we parsed out of the dataset.
We also mapped out leaked users of the DiDi app, a Chinese rideshare and food delivery service. It’s not terribly surprising that Henan and Sichuan are the most densely represented in the data, the third and fifth most populous provinces respectively. It’s a little less expected that third on the list is Yunnan – which is the 12th most populous province in China. We aren’t totally sure why the data doesn’t roughly track with the population density of the provinces here. Unfortunately, it’s hard to speculate further without knowing more details about this data’s provenance than are available.
We also looked at the age distribution for the DiDi index. The average age of individuals in this dataset is 36 years old. Interestingly, the data cuts off sharply below age 20 – no individuals in the data have a birthdate later than 2006.
At first we thought that this perhaps indicated that the data only included DiDi drivers, a hypothesis further bolstered by the presence of VIN numbers in the index. However, DiDi drivers actually have a minimum age of 21 and only about 4,000 of the 62 million records in the dataset include VIN numbers.
Instead, we think this cutoff is actually more likely to be an artifact of the date when the data was exfiltrated. DiDi imposes a minimum rider age of 16 (it used to be 18, but was decreased to 16 in 2019). Therefore, we can conclude with moderate confidence that the data in this index was likely exfiltrated from its original source around 4 years ago (in 2022) – when individuals born in 2006 were turning 16.
Bar graph showing the age distribution of individuals present in the leaked ddcx data.
Next, we took a look at the two main indexes with credit card numbers – bcsj (gambling data) and ylsj (UnionPay data). One main noticeable difference is the lack of any JCB (Japan Credit Bureau) or Visa Electron (Visa Debit) cards in the gambling dataset, but a significant number of each in the UnionPay data. After further research, we found that while both of these financial institutions are based outside of China, UnionPay actually offers co-branded cards with both institutions.
Common credit card BINs in the gambling and UnionPay datasets.
Finally, we decided to visualize both the Western and Chinese zodiac symbols in the data. Why? Because it’s not every day that we ingest data containing zodiac information.
Each zodiac system has 12 categories – with the Western zodiac rotating on a monthly basis over an annual cycle and the Chinese zodiac rotating on an annual basis over a 12 year cycle. As you can see, both zodiac systems appear to be relatively evenly distributed in the leaked data. Libra and Rabbit are both the most common respectively, both by a small margin. This relatively (but not exact) even distribution is what we’d expect from real-world data.
Distribution of astrological signs across the entire leaked dataset.
Distribution of Chinese zodiac signs across the entire leaked dataset.
Who is exposed in this massive leak?
This is an extremely thorough dataset in terms of its coverage of the Chinese population, amounting to massive exposure for Chinese citizens.
- If we just look at national ID numbers alone, the dataset contains 813 million unique national IDs. That means that the dataset covers approximately 58% of the entire country’s population of 1.4 billion people.
- Across the entire dataset, each unique national ID value appears in an average of 3.1 records, meaning that (again, just looking at the records containing a national ID) the leak includes data results from approximately 3 distinct sources for the majority of Chinese citizens.
- In short, this dataset includes multiple points of personal information about most Chinese people.
A practical example of the scale of exposure
Let’s look at what this means in practice. Take the domain mfa.cn.gov, the domain for the Chinese Ministry of Foreign Affairs (MFA). In just this 6 billion record leak alone, we can identify 133 unique @mfa.cn.gov email addresses across 207 total records, meaning we have an average of 1.6 distinct records per MFA employee email address.
If we pivot off of the national IDs and phone numbers included in this first set of results, we get a total of 563 total records, for an average of 4.2 distinct records for each unique MFA employee email address we can find in the breach. At SpyCloud, we refer to this concept as holistic identity exposure – the idea that the exposure of an employee’s personal information also puts their employer’s organizational identity security at risk. Finally, if we tally up all of the assets across each of those 563 records, we get 3,469 matching data assets – for an average of 26 matching data assets per unique @mfa.cn.gov email address.
Key takeaways from SpyCloud’s analysis of the leak
There is so much data in this leak, and many opportunities to slice and dice what’s inside, but our key takeaways include:
Scale and scope:
Data assets
The leak aggregates PII across 39 indices, including over 4.48 billion phone numbers, 3.61 billion full names, 2.55 billion Chinese national ID numbers, and 433.2 million passwords.
Source aggregation
The data was compiled from various breached sources, likely by a Chinese-language actor planning to sell it or use it as a backend for illicit PII lookup services known as SGKs.
Exposure quantity
The national ID records alone cover approximately 58% of China’s 1.4 billion population. On average, each unique national ID appears in 3.1 records, meaning the majority of Chinese citizens have personal information exposed from multiple distinct sources.
Breadth of exposures
The leak demonstrates holistic identity exposure, where pivoting off a single piece of PII (like a national ID or phone number) can yield multiple records and associated assets about an individual. For example, 133 unique @mfa.cn.gov email addresses were associated with an average of 4.2 distinct records and 26 matching data assets per employee.
This massive breach contains data about most Chinese people, exposing Chinese individuals, Chinese organizations, and multinational organizations alike. It also showcases the immense breadth and depth of breached personal data being collected and aggregated by Chinese cybercriminals.