David Truong

David Truong

Releasing X23 data corpus

For nearly 3 years I was working on a startup called x23.ai, which I recently decided to close down. The announcement can be seen here.

Shutting down something is always difficult. You've dedicated a lot of your time and effort to something, and it ended up in failure. But as an eternal optimist, the silver lining is that instead of all the effort going to waste, I can contribute something useful to the public domain.

X23 Data Corpus

Enter the corpus, a massive normalised dataset covering both DAO/crypto governance forum data (before our pivot) and solidity smart contract vulnerability findings. I believe this is one of the largest crypto related dataset releases (11 GB!), and I hope it can be useful to researchers and builders on the frontier.

All data is released under the permissible CC BY 4.0 licence, unless otherwise noted.

Everything in the dataset comes from publicly accessible source data. The value we’re releasing is the normalized structure, schemas, deduplication, packaging, samples, checksums, and embeddings that make it easier to use.

Note: This is not intended to be a canonical CVE/NVD-style database. It’s a practical, normalized dataset from the work we did at x23.ai, released so other people can build on it.

DAO/crypto governance data

The governance dataset covers the periods September 2015 through September 2025, across 42 projects/DAOs/communities, and includes 40k+ forum topics, 14k+ authors, 155k+ forum posts, 400k+ extracted links, and other related data. The communities include Aave, Aavegotchi, Arbitrum, Aura, Balancer, Celestia, Compound, CoW DAO, Curve, dYdX, EigenLayer, ENS, Ethereum Magicians / EthResearch, ether.fi, Euler, Flashbots, Frax, Gitcoin, GMX, Gnosis, Instadapp, Jupiter, Kamino, Lido, MakerDAO, Merit Circle / Beam, Moonwell, Morpho, Octant, Optimism, Paladin, Polygon, Reserve, Rocket Pool, SafeDAO, Scroll, Spectra, Superfluid, The Graph, Uniswap, Venus, and zkSync / ZK Nation.

Hugging Face Link

Solidity vulnerability data

The (mostly solidity) vulnerabilitiy dataset covers audit reports published between May 2019 and April 2026. In total there are 55k+ individual vulnerability findings from 4k+ reports.

For each finding, we include normalized fields, source/report references where available, severity/category/status metadata, affected artifacts, functions, remediation text, and embedding-ready text views. The release also includes short and long vector embeddings for the retained findings, so the dataset can be used directly for retrieval, semantic search, clustering, agent memory, or benchmark-style evaluation.

Hugging Face Link


If you do something cool with the data, please do let me know!

← Back