Integrates data from over 30 authoritative resources, encompassing TCM terminology, Chinese patent medicines, herbal pieces, natural products, chemical components, disease targets, and more. Bridges traditional TCM knowledge with modern biomedical sciences, facilitating research and application in both domains.
Comprises 143,331 entities and 3,609,559 relationships, constructed using seven types of previously organised data. Provides a comprehensive platform for TCM research, enabling advanced queries and analysis of TCM knowledge.
A multimodal question-answering dataset designed for evaluating large TCM language models, encompassing over 52,000 questions across various TCM disciplines. Provides a standardised benchmark for assessing the performance of TCM-specific LLMs on real-world tasks. Supports the development and evaluation of AI models in TCM education and clinical decision-making
Comprises 12 sub-datasets spanning five major categories: knowledge QA, language understanding, diagnostic reasoning, prescription generation, and safety evaluation. Serves as a comprehensive testbed for evaluating LLMs on TCM knowledge, reasoning, and safety. Guides the development of more competent and trustworthy medical AI systems in the TCM domain.
Supports interactive Q&A, linking herbs, formulas, symptoms, and modern terminology. Backend for HuatuoGPT consultation AI. Real-time TCM decision-making and education.
Integrates six high-quality TCM and Western medicine databases, encompassing 20 entities and 46 kinds of relations, totalling over 3.4 million records. Facilitates advanced data integration, enabling comprehensive analysis of biological processes, pathways, anatomical sites, and side effects. Supports modernised TCM research and development by providing a unified intelligence platform.
Holds over 40 million herbal compounds and 15 million species records. Links plant species, compound structures, and clinical applications. Herb authentication, quality control, and pharmaceutical R&D.
Knowledge network integrating herbs, diseases, gene ontology, and phenotypes. Performs mesh-style inference for drug development and mechanistic research. Network pharmacology and personalised treatment matching.
Contains over 1,400 herbs, 44,000 formulas, and related molecular info. Supports prescription decoding, target prediction, and disease matching. Clinical reference and drug design.
Incorporates TCM formulas, pharmacokinetic data, and bioactivity evidence. Facilitates herb-syndrome-disease analysis. AI training sets for herbal efficacy and interaction modelling.


TCM Knowledge Acquisition and Novel Knowledge Validation
Knowledge acquisition in Traditional Chinese Medicine (TCM) is the systematic process of collecting, interpreting, and structuring knowledge from classical texts, clinical experience, and empirical data to improve the field's understanding, diagnosis, treatment, and innovation.
Historically, TCM knowledge was transmitted through classical literature such as the Huangdi Neijing, case studies, and master-apprentice relationships. In the modern context, this process has expanded to include digital databases, clinical records, and biomedical research. With the rise of artificial intelligence and data science, the field of TCM knowledge acquisition is undergoing a radical transformation. Key Aspects of the TCM Knowledge Acquisition process include:
-
Classical Knowledge Extraction: This involves extracting valuable diagnostic and therapeutic information from ancient texts written in Classical Chinese. Natural language processing (NLP) and machine learning techniques are now used to digitise and analyse these texts, facilitating greater accessibility and interpretation.
-
Clinical Data Mining: Systematically analysing electronic medical records, treatment protocols, and patient outcomes. AI tools help identify patterns and correlations that may not be immediately evident, allowing clinicians and researchers to refine syndrome differentiation and treatment strategies.
-
Standardisation and Ontology Development: Creating structured vocabularies, taxonomies, and knowledge graphs for TCM concepts such as syndromes, herbs, acupoints, and formulas. This enables interoperability across systems, facilitates education, and supports AI training models.
-
Multi-source Integration: Combining data from herbal medicine, acupuncture, imaging, pulse diagnosis, genomics, and patient-reported outcomes. AI integrates and cross-validates this information, providing a more holistic understanding of health and disease.
The evolving field of TCM knowledge acquisition, powered by AI and big data, marks a shift from static knowledge preservation to dynamic knowledge generation. Novel knowledge bridges traditional wisdom with modern science, ensuring that TCM remains a living, adaptive system capable of meeting the demands of 21st-century healthcare. This integration enhances clinical efficacy and innovation and upholds the integrity of TCM’s holistic worldview in a rapidly changing medical landscape.
AI-Based Data Mining
Data mining in Traditional Chinese Medicine (TCM) involves using computational techniques to analyse large datasets of TCM information (such as medical records, herbal prescriptions, and ancient texts) to extract meaningful patterns and knowledge. This helps in understanding the effectiveness of TCM treatments, identifying potential drug candidates, and optimising TCM practices. Applications of Data Mining in TCM:
-
Analysing TCM Prescriptions: Data mining can help identify patterns in TCM prescriptions, such as which herbs are commonly used together and how they relate to specific symptoms or diseases.
-
Understanding TCM Theory: By analysing ancient TCM texts, data mining can help uncover the underlying principles and theories of TCM, such as the five elements, yin and yang, and the flow of qi.
-
Developing New TCM Treatments: Data mining can help identify potential new TCM treatments by analysing large datasets of herbal ingredients and their effects.
-
Improving TCM Diagnostics: Data mining can be used to develop more accurate and efficient diagnostic tools for TCM by analysing patient data and patterns.
-
Optimising TCM Manufacturing Processes: Data mining can help optimise the manufacturing processes of TCM products, such as herbal medicines, by identifying critical process parameters and improving quality control.
-
Researching TCM-Related Diseases: Data mining can be used to analyse large datasets of patient records and research articles to identify the relationships between TCM and various diseases, such as COVID-19, depression, and insomnia.
Data Mining Techniques Used in TCM include association rules mining, clustering analysis, factor analysis, topic modelling and deep learning. Examples of Data Mining Applications in TCM:
-
Identifying Core Herb Pairs: Data mining can identify the most frequently used pairs of herbs in TCM prescriptions.
-
Summarising the Utility and Attributes of TCM Prescriptions: Data mining can analyse the characteristics and effects of different TCM prescriptions.
-
Finding the Optimal Dose of Chinese Herbs: Data mining can help determine the optimal dosage for different conditions.
-
Developing New TCM Prescriptions: Data mining can help generate new TCM prescriptions based on existing knowledge and data.
-
Identifying Drug Candidates: Data mining can help identify potential drug candidates based on TCM for treating depression.
Large-scale data mining can uncover new knowledge from clinical experience, literature, and databases. Machine learning has discovered novel combinations and network pharmacology relationships when applied to databases.
Using AI to integrate data from various TCM databases, including chemical, clinical and pharmacological data, significantly enhances the comprehensiveness and accuracy of diagnostics. This holistic data analysis approach ensures that all relevant aspects of patient health are considered, resulting in more informed and effective diagnostic outcomes. The integration process involves advanced data mining techniques and AI algorithms capable of handling the complexity and diversity of TCM data.



HIT links herbal active ingredients to their molecular targets, providing insights into the pharmacological actions of TCM herbs
HIT
(Herbal Ingredients' Targets) Database
TCM-SD is a benchmark dataset for probing syndrome differentiation via natural language processing. It contains over 54,000 real-world clinical records covering 148 syndromes, supporting AI-driven TCM research.
An English-language database containing detailed data and standard specifications of many traditional Chinese medicines, facilitating convenient searching and supporting research and development of new drugs.
TCMGeneDIT is a database that uses text mining techniques to associate TCM herbs with genes and diseases. It facilitates the exploration of molecular mechanisms underlying TCM therapies.
TCM Gene-Disease Association Database (TCMGeneDIT)
Offers detailed information on the majority of herbs used in clinical settings worldwide. Herbs are categorised by function, channel, and property, providing insights into their traditional uses and applications.
Focuses on commonly used TCM herbs, providing pictures and overviews of their sources, basic nature, actions, usage, and safety profiles.
ITCM integrates ten TCM-related databases, including information from various pharmacopoeias and official recommendations for COVID-19 formulas. It serves as a comprehensive resource for TCM research and application.
A comprehensive clinical tool featuring over 680 single herbs, 1,485 formulas, 361 acupuncture points, and 15,000 Western diagnoses (including ICD-9 codes). It also includes a Medical Chinese dictionary with over 100,000 words and unique translations of classical Chinese texts.
TCMSP integrates pharmacokinetic properties, drug-likeness, and target information for 499 Chinese herbs. It provides tools for analysing herb-compound-target-disease networks, aiding in understanding TCM mechanisms and drug discovery processes.
TCMM focuses on modernising TCM by aligning traditional knowledge with modern medical insights. It includes over 3 million records covering prescriptions, ingredients, targets, and diseases, and offers tools like Rx Gen for customisable prescription generation.
Integrating Artificial Intelligence (AI) into Traditional Chinese Medicine (TCM) research marks a transformative shift in how ancient wisdom is studied, applied, and advanced. No longer confined to oral transmission or classical texts, TCM knowledge is being digitised, analysed, and reinterpreted through the lens of modern computational science. AI enables researchers to uncover patterns, validate clinical efficacy, and generate new insights by processing complex and voluminous TCM data that spans centuries of empirical practice. This convergence of traditional healing and intelligent technology heralds a new era, where TCM can evolve into a more standardised, data-driven, and globally recognised medical system without losing its holistic essence.
Central to this evolution are robust TCM databases, AI-driven data mining, knowledge acquisition methods, and the construction of knowledge graphs—tools that enable the systematic organisation and exploration of traditional knowledge. The possibility of creating an integrated TCM repository—one central, intelligent platform for accessing and expanding the entire body of TCM knowledge—holds immense potential. Collectively, these innovations are reshaping the landscape of TCM research, bridging tradition and technology to ensure that the wisdom of the past continues to inform and empower future medicine.

TCM Databases
TCM databases are vibrant repositories of ancient herbal wisdom, diagnostic techniques, and clinical case records. Today, they form the backbone of AI systems capable of learning, predicting, and evolving with remarkable precision. These databases serve as comprehensive resources, seamlessly enabling data exchange for drug discovery, clinical trials, and a deeper understanding of the fundamental mechanisms underpinning TCM. Core Information Contained in TCM Databases:
-
TCM prescriptions
-
Herbs and medicinal materials
-
Chemical ingredients and compounds
-
Biological targets
-
Associated diseases and symptoms.
-
Component identification
-
Biological pathways
-
Symptom mapping
-
Historical and linguistic data
TCM databases enhance research efficiency, accuracy, and scope by providing accurate, standardised information. They streamline manual data collection, significantly reduce the time and resources required for analysis, and offer a centralised knowledge base for students, researchers, and clinicians. Importantly, these databases also foster interdisciplinary collaboration—encouraging the convergence of pharmacology, bioinformatics, data science, and traditional medicine. This fusion enhances the understanding of TCM’s therapeutic potential and supports its integration into mainstream global healthcare systems. The emergence of TCM databases has revolutionised the research landscape. They facilitate access to critical information, accelerate innovation, foster cross-domain partnerships, and pave the way for a digitally empowered future of Traditional Chinese Medicine.
TCM databases offer many features that significantly advance Traditional Chinese Medicine research. By providing reliable, diverse, and standardised information, alongside specialised tools, they enable in-depth exploration of complex therapeutic relationships and patterns. Large-scale databases supply abundant, high-quality training samples essential for developing and refining AI models.
The ongoing expansion of these databases is critical. As the volume and granularity of data grow, so does the precision, relevance, and contextual adaptability of AI applications in TCM. Enhanced datasets allow AI systems to better navigate the subtleties of TCM diagnosis and treatment, enabling them to adapt across diverse clinical scenarios and patient populations. This improves the robustness of digital health tools and accelerates the integration of TCM into evidence-based, personalised healthcare solutions.
TCMBank is the largest non-commercial TCM database, encompassing over 9,000 herbs, 61,000 ingredients, 15,000 targets, and 32,000 diseases. It provides 3D structures of ingredients and integrates AI-assisted drug discovery tools, making it a valuable resource for modern drug development.
ETCM offers comprehensive information on TCM herbs, formulas, ingredients, and their relationships with gene targets and diseases. It includes over 48,000 formulas and nearly 10,000 Chinese patent drugs, facilitating systematic analysis and network construction for research purposes.

AI In TCM Research
TCMID integrates data on over 8,000 herbs, 25,000 herbal compounds, and 17,500 targets. It bridges TCM with modern life sciences, providing a platform for exploring herb-compound-target-disease relationships.
Chem-TCM contains chemical information on approximately 350 herbs, listing over 12,000 chemical records. It links botanical information with Western therapeutic targets, aiding in assessing molecular activity within TCM categories.
Developed by Nigel Wiseman, this database includes over 30,000 Chinese terms with Pinyin transcription and English translation, 6,000 medicinal, 2,000 formulas, and detailed information on approximately 400 acupuncture points.
Developed by the China Academy of Chinese Medical Sciences, this system encompasses 48 databases with over 2.2 million records, including TCM journal literature, disease diagnosis and treatment, various herbal databases, formulas, and national standards.
Chinese Medicine Network Research
Chinese Medicine Network research represents a groundbreaking initiative to express and validate Traditional Chinese Medicine (TCM) theories through quantitative models. By integrating systems biology, pharmacology, and network science, this research transforms TCM's traditionally qualitative framework into structured, computational forms that can be analysed, validated, and applied in a modern context. This approach leverages two complementary strategies—top-down and bottom-up—to design and refine herbal prescriptions and understand the complex mechanisms underlying disease and therapy.
-
Top-Down Approach: From Classical Formulas to New Applications: The top-down strategy begins with established classical prescriptions and knowledge-rich starting points. Researchers use network analysis to dissect these prescriptions, mapping the relationships between herbs and their active compounds, biological targets and pathways, disease networks, and symptom clusters. This analysis enables the rational modification or reassembly of existing formulas to enhance efficacy, reduce toxicity, or tailor them to specific modern clinical needs. For instance, by understanding how a classical formula like Liu Wei Di Huang Wan interacts with kidney-related disease pathways, researchers can design new variants optimised for diabetic nephropathy or hypertension.
-
Bottom-Up Approach: Data-Driven Prescription Design from Disease Networks: The bottom-up approach does not rely on existing prescriptions. Instead, it starts by constructing disease-related networks based on genomic, proteomic, and clinical data. These networks identify critical disease-related targets and pathways, which are then matched to herbal components using computational tools such as network pharmacology, target prediction algorithms, and machine learning models. Through this process, entirely new TCM formulas can be designed based on objective disease mechanisms. This approach is particularly promising for emerging diseases, complex syndromes, or conditions where classical prescriptions offer limited guidance.
Chinese Medicine Network research bridges the traditional wisdom of TCM with modern biomedical science, offering several key benefits: quantitative validation of ancient theories and empirical knowledge, personalised prescription design, tailored to patient-specific biomolecular profiles, scientific standardisation, improving reproducibility and regulatory acceptance and accelerated drug discovery by identifying multi-target, multi-pathway herbal combinations. By combining the top-down reverence for classical knowledge with the bottom-up power of modern data science, Chinese Medicine Network research marks a transformative step in the modernisation and globalisation of TCM.






Knowledge Graph Construction
Knowledge graph construction is a transformative method for digitising and organising the vast and complex body of Traditional Chinese Medicine (TCM) knowledge. It provides a structured, interconnected framework that maps relationships among herbs, symptoms, syndromes, diagnostic methods, and treatment strategies. This approach enables more intelligent, efficient, and scalable applications of TCM in both clinical and research settings.
-
Representation Learning and Chemical Data Organisation: Representation learning is central to constructing knowledge graphs in TCM. By encoding entities (e.g., herbs, formulas, symptoms, compounds) and their relationships as vector representations, AI models can better understand the semantics and associations within the TCM corpus. This enables efficient information retrieval across large TCM databases, semantic search and recommendation (e.g., finding similar herbs or prescriptions), and improved chemical and pharmacological data organisation within the TCM knowledge base. This supports advanced applications in drug discovery, clinical decision support, and educational tools.
-
Mixed-Scale Graph Learning for Formula Optimisation: Mixed-scale graph learning allows researchers to analyse and predict effective herbal combinations by examining multi-layered interactions between herbs and their active chemical compounds, compound–compound synergies or antagonisms and herbs and therapeutic targets across multiple diseases. This method provides insights into formula optimisation, uncovering patterns that guide the discovery or refinement of prescriptions for improved efficacy and safety.
The foundation of TCM’s practical wisdom lies in centuries of clinical experience. Digitising this body of knowledge involves meticulously collecting and curating real-world data from clinical records, including patient complaints and symptom patterns, pulse and tongue diagnostics, syndromic classifications, treatment strategies, and outcomes. This raw data is then structured and formalised using knowledge engineering techniques. The structured data populates the TCM knowledge graph, linking clinical features with diagnostic patterns, herbs, acupoints, and treatment outcomes. Once structured, the data is deployed into a web-based knowledge graph platform. This enables advanced visualisation of TCM relationships and clinical reasoning pathways, development of intelligent recommendation systems that suggest treatments based on patient input and historical cases and support for automated or semi-automated diagnosis and treatment planning. Educational tools for students and practitioners to explore diagnostic logic and formula composition.
These graphs promote a nuanced, interconnected understanding of TCM principles and therapeutic strategies. Knowledge graphs are powerful tools to modernise TCM while preserving its holistic essence by supporting real-time decision-making, intelligent querying, and automated reasoning.
Focuses on herbal medicine, integrating herb–compound–disease–gene interactions. Enables pathway analysis and drug repurposing. Herbal formulation analysis and compound screening
Maps TCM symptoms with modern biomedical diseases, herbs, and targets. Enables comparison between TCM and Western medicine symptoms. Translational medicine, symptom ontology alignment.
Integrates herbs, symptoms, diseases, syndromes, prescriptions, and compounds. Supports semantic search, diagnosis assistance, and prescription generation. Used in TCM recommendation systems and intelligent consultation bots.
Links TCM herbs with diseases, genes, proteins, and chemical constituents. Supports drug discovery, systems biology, and mechanistic research. Network pharmacology, gene-target exploration
Constructed from over 68 classical Chinese medical texts, resulting in a multi-relational knowledge graph with more than 48,000 entities and 152,000 interrelationships. Serves as the backbone for the OpenTCM system, enabling GraphRAG-powered large language models (LLMs) for TCM knowledge retrieval and diagnosis. Enhances TCM ingredient search and diagnostic question-answering without requiring pre-training or fine-tuning of LLMs.
This project complements the Traditional Chinese Medicine Multi-dimensional Knowledge Graph (TCM-MKG) project, which focuses on quantifying compatibility mechanisms within TCM. Employs graph neural networks (GNNs) to analyse and interpret the intricate compatibility relationships within TCM formulations. Facilitates advanced TCM compatibility studies, aiding in the quantitative evaluation of herbal combinations.
Defines an ontology with 29 types of entities and 32 kinds of relations, annotated in a high-quality entity and relation extraction dataset. Facilitates the construction of domain-specific knowledge graphs in TCM, enhancing information retrieval and knowledge discovery.
Models the dynamics of patient conditions over multiple visits, incorporating interactions between herbs and patient conditions. Enhances the accuracy of TCM prescription recommendations by considering the temporal evolution of patient states.
Utilises a fine-tuned ChatGLM3-6B model for TCM-related question-and-answer tasks, extracting key entities via a specialised TCM entity recognition model (TCMER). Supports the construction and application of TCM knowledge graphs, enhancing information retrieval and decision-making in TCM.
Leverages large language models to collect and embed substantial TCM-related data, generating precise representations transformed into a knowledge graph format. Aids in teaching, disease diagnosis, and treatment decisions contribute to TCM modernisation.

TCM Repository
The author of Integrating AI and TCM envisions a future where the full breadth of Traditional Chinese Medicine (TCM) knowledge is unified into a single, comprehensive digital repository. Imagine a digitalised and interconnected web-based platform that houses the entirety of TCM resources—classical texts, herbal compendia, clinical case records, diagnostic patterns, acupoint atlases, pharmacopoeias, and modern research.
This centralised TCM knowledge base would be designed for high accessibility and interoperability. Practitioners, researchers, educators, and patients could easily retrieve information with intuitive search tools and structured metadata. More importantly, this repository would not be static—it would serve as a dynamic, evolving foundation for training artificial intelligence systems, including large language models (LLMs) and domain-specific AIs tailored to the nuances of TCM.
By integrating disparate TCM databases—currently housed in isolated institutions or scattered across various platforms—this unified system would provide:
-
Standardised, machine-readable formats for herbs, formulas, diagnostic patterns, and treatment protocols
-
Linked knowledge graphs showing relationships between symptoms, herbs, syndromes, and pathways.
-
Historical and contemporary insights woven together to preserve lineage while encouraging innovation.
-
A multilingual interface to make TCM knowledge accessible across cultures and regions.
-
An open platform for AI training, enabling the development of more accurate, culturally competent, and clinically relevant TCM-AI systems.
Such a platform would revolutionise how TCM is taught, practised, and evolved in the digital age. It would empower not only AI models but also human practitioners, fostering collaboration between ancient healing wisdom and modern technological advancement. Ultimately, this digital integration would mark a critical step toward globalising TCM and securing its relevance in the 21st century and beyond.