ChexMix was designed for the extraction of hierarchical and topological information related to bioentities. Therefore, ChexMix extracts the bioentities that coexist with the queried keywords in PubMed and encodes them into unique identifiers indexing their associated information. The combination of a hierarchical representation with a mapping of bioentities with identifiers at each level makes it possible to organize and cross-reference the relationships between them. For example, species resulting from keywords of interest, such as chemicals or diseases, can be represented hierarchically starting from the highest rank, “cellular organisms”, according to the phylogenetic taxonomic system of the database. NCBI taxonomy data.16. The search results are organized according to the hierarchical characteristics of each bioentity and can be displayed in graphs for hierarchical data visualization or nested lists (Fig. 1); therefore, the information may be useful for inspection of related information among keywords of interest. Here, ChexMix was applied to discover the biomedical sources of natural products that produce the bioactive compound, amentoflavone, which holds a wide range of biological activities, including antioxidant, anti-inflammatory, anticancer, antiviral, and antifungal properties.22. This compound also exhibits potent antisenescent activity against skin aging induced by ultraviolet B irradiation, preventing nuclear aberrations.23; thus, it can be used for the prevention of skin aging in the cosmetics industry.
First, 319 bioentities were extracted from ChexMix using the keyword “amentoflavone” under the highest taxonomic rank, “cellular organisms” (Fig. 2). Among them, 223 species included in the clade Viridiplantae (literally “green plants”) were targeted. It was possible to verify that these species co-occurred with amentoflavone in the same study and to determine whether any plant species could produce amentoflavone (Supplementary Table S1).
In order to avoid duplicate studies and find new bioactive sources, the analysis focused on related species belonging to the Viburnum genus, recovering 19 samples from different parts of eight species native to Korea that had not previously been studied on amentoflavone-related subjects (Fig. 3, Supplementary Table S2). Then, the existence of amentoflavone was evaluated in samples of these plants and quantified by HPLC. The presence of amentoflavone was confirmed by its isotopic peak at 537.4 m/z [M + H]− detected by liquid chromatography-mass spectrometry. Among them, the leaves of V. erosum contained the highest amount of amentoflavone (7.39 mg/g) compared to tamariscina selaginellawhich is the representative natural ingredient of anti-wrinkle effect and the main source of amentoflavone in cosmetic industry24. Overall, summarizing hierarchical bioentity information using ChexMix should help inspect massive and rare bioentities in databases in future investigations.
The performance of Chexmix results was quantitatively assessed based on the bioentities extracted using a set of keywords associated with the original keyword “amentoflavone”. 243 taxonomy networks were obtained using ChexMix from MeSH terms of chemicals coexisting with “amentoflavone” in the literature, and they were analyzed by basic network properties and similarity measures (Supplementary Table S3). Similarity metrics compared each of the 243 networks with the “amentoflavone” network, where the number of true positives was calculated by the number of common nodes in the two networks (Supplementary Table S3).
Moreover, ChexMix can also integrate the results of several keywords. The MeSH identifiers of the co-occurring bioentities with the keywords of interest could be used to link the results by two different queries (Fig. 4). For example, two species names, Taxus cuspidata and Podophyllum peltatum, were queried by ChexMix and generated two small arrays consisting of bioentities with MeSH identifiers extracted from PubTator. It was possible to inspect co-occurring bioentities among MeSH identifiers in the integrated network. The network of each species showed different MeSH identifier profiles and MeSH identifiers related to “cancer”, in particular “ovarian neoplasms”, co-occurred. This is consistent with the fact that paclitaxel from T. cuspidata and podophyllotoxin from P. peltatum are well-known potent anti-cancer drugs for ovarian cancer25,26,27.
Here, a scenario of using ChexMix to alleviate the complex task of compiling big data by narrowing the scope of bioentities or grouping similar bioentities using the hierarchical relationships was described. First, to obtain the number of appearances of the bioentities in the literature queried by keywords of interest, ChexMix collects the PubMed and PMC literature, then retrieves the annotations in this data from PTC and converts them into unique identifiers in depending on the respective bioentity class. ChexMix allows Boolean operators (‘AND’, ‘OR’, ‘NOT’), double quotes for phrases and asterisks for truncated terms for PubMed literature search. Each bioentity extracted from ChexMix is categorized into more general bioentity categories and arranged in a hierarchical structure.
When single or multiple keywords of interest are entered into ChexMix, bio-entities from all citations that have keywords are retrieved and automatically mapped into unique identifiers. The search results indicate the co-occurrence of the bioentities in the available literature, allowing them to be linked and giving the co-occurrence network. ChexMix simplifies the process by managing access to data from multiple sources and providing functions to manipulate the data structure of the network.
The analysis is mainly focused on taxonomy terms to inspect species that biologically affect physiological disorders or diseases within the network. Each taxonomy name in the search results is listed in a hierarchical form. Trivial bioentities are located in the upper ranks of the list. Other species close to the resulting taxonomic tree are expected to have similar biological effects, representing potential alternative biomedical options. ChexMix can also generate the links between taxonomic terms and MeSH identifiers, which can be found under ‘Diseases [C]’ and ‘Chemicals and Drugs [D]’, in the same literature. MeSH identifiers coexisting with a taxonomic term in the literature are expected to have a close relationship.
In Fig. 4, the intersection set of co-occurring MeSH terms with each taxonomy keyword is highlighted over the network set resulting from the union set of two networks. Networks generated from a single keyword in ChexMix can be simply reprocessed by the combination of set operations, such as union, difference and intersection with other networks. Reorganizing complex networks from single or multiple keywords provides new information or clues for bioentities in PubMed, the largest biomedical database.
In the present study, we focused on how to use ChexMix to build a taxonomic tree or co-occurrence network from multiple keywords, and analyze networks from bioentities identified by PTC. We designed ChexMix to easily adapt the various types of bioentities and integrate other existing databases as well as recently introduced state-of-the-art text mining systems.28. We hope that ChexMix will be used by other researchers to integrate other datasets, and manipulate and visualize relationships between bioentities.