Chemical structure searching plays an important role in cheminformatics. For example, identification of a lead compound, which is a primary task in the early stage of drug discovery, requires tremendous effort on high-volume throughout screening of a great number of candidate compounds. It possesses a challenge for cheminformatics to improve the efficiency and accuracy to save drug development cost. Computational screening methods have been deemed as being very useful in this endeavor .
In the last three decades, the methodology of chemical structure screening has been rapidly developed. The most prevalent search scenario, similarity search, i.e. searching molecules in the data-base that have similar structure from given input, came into wide use in late 1980s . The key benefits of similarity search include: (ⅰ) Little information is needed to formulate a reasonable query. In particular, no assumption needs to be made about which part(s) of the query molecule confers activity. Thus, similarity methods can be used at the beginning of a chemical or material discovery project when there is little information about the target and only one or two known activities. (ⅱ) Many implementations of similarity methods are computationally inexpensive, so large database searches cannot be efficiently performed.
Despite the advantages offered by structure search and its growing interest in both academia and industry, the majority of community of cheminformatics has been kept away from this technology for years, largely because of the lack of high-performance structure search engines for the general research community.
Conventionally, chemical structure screening systems are developed on a specialized database or indexing engine. However, existing specialized databases are less flexible compared with general-purpose database engines. Furthermore, in most existing specialized databases, screening through millions of structures takes seconds or even minutes, which is too slow for an interactive Web service.
The challenges to improve performance are threefold. First, most existing systems do a linear scan of almost all candidate structures. Although the similarity computation of a single candidate structure is inexpensive, such scan is still too costly when multiplied by millions of structures in the database. Second, the search criteria should support not only structure similarity, but also other chemical properties. Chemical information is highly variable in structure and therefore requires high flexibility in management such as indexing and storage. Third, high performance databases require extensive optimizations, e.g. leveraging event-driven I/O and managing in-memory cache efficiently.
For the first challenge, our solution is to transform the similarity search problem onto a general text search problem. It is inspired from image search, where the search engine returns the most similar images without calculating the similarity of all candidate images one by one. A large-scale image search engine extracts a descriptor, like PCA-SIFT , from every image in its database, then builds an inverted index of the descriptors to store the list of occurrences of each descriptor. When the user up-loads an image to search, the engine calculates the descriptors of the uploaded image, and searches for the best matches according to the inverted index.
In the field of chemical searching, to describe the topological structure of a molecule, there are many kinds of descriptors. The most famous ones are ap  (Atomic Pair) and tt  (Topological Torsion). Asides from ap and tt, other descriptors include dp, the Donor-Acceptor Pair , which is a reduced pharmacophore version of ap. Such descriptors can be used for molecular structural searching.
In this article, we built a database that contained a correspondence between molecules and their descriptors, and an inverted index that treated each descriptor as a unique keyword. When performing similarity search, we translated the query (a structure) into descriptors (keywords) and performed a general text search on the keywords.
The second challenge is large variety and hetero-geneity of chemical information. Relational data models require a fixed schema of all possible properties (fields), but materials may have hundreds of properties which are hard to enumerate a priori. So we choose schema-less document-oriented database instead of relational database. We describe each material or molecule as a document, which is roughly equivalent to the programming concept of an object. Instead of designing a "type" or a "structure" for the data, we simply put the semistructured document into database, based on the naming conventions.
This schema-less design gives us the ability to maintain multiple heterogeneous datasets in a same database and search their structure or performance data via a unified query interface. The index server incrementally indexes every property of inserted or updated documents, and the new documents will appear in search results after indexing process completes.
Over the past decade or so, we see the advent of open source schema-less databases (e.g., MongoDB  and CouchDB ) and text search engines (e.g., Lucene  and ElasticSearch ) that have high scalability and high performance. These tools are employed to solve the third challenge. Building on the basis of well-developed software platform enjoys the benefits of performance, scalability, system stability, and community support.
In this work, we present DCAIKU, a large-scale chemical structure & data search engine, built on CouchDB  and ElasticSearch . CouchDB, as one of the most prevalent schema-less databases, is used for document storage. ElasticSearch, as one of the most advanced text search engines, to provide inverted index with high performance. DCAIKU's database design, search interface and visualization, implementation of keyword search and chemical structural search, and finally, evaluation results are also presented.
DCAIKU is a Web service for users to search molecules and crystals by keyword and/or structure. The molecular information of the chemicals is stored in a CouchDB  database and indexed by ElasticSearch . Users specify structure query via a molecular drawing panel. When the Web backend in DCAIKU server receives a query from the browser, it parses the query, searches the ElasticSearch index to find matching chemicals ordered by relevance, then fetches their molecular information from the CouchDB database and sends them to the browser. Finally, the browser visualizes the molecular structure for the user.
When new molecular structures are collected, we optimize their geometric structures and calculate their structural descriptors for searching. Then the molecular information is stored in the CouchDB database, shown as the red dotted line in FIG. 1. A synchronizer subscribes changes to the CouchDB database and automatically creates indexes for inserted or updated records in ElasticSearch, shown as the red dashed line in FIG. 1.Ⅱ. SCHEMA-LESS DATABASE DESIGN
Relational data models impose strict constraints on data types, which is impractical for chemical data. For example, a crystallographic information file (CIF) usually contains atom coordinates, which is a two-dimensional table. In relational model, such a table should be represented in multiple entries of a one-to-many relation table, which adds complexity in query and inefficiency in storage. Furthermore, the CIF format defines hundreds of properties regarding experimental measurements, analysis, atomicity, chemical properties, structure, publication and metadata, but a typical CIF file may only contain a small fraction of defined properties. If we use relational model, most cells in the table would be empty, which is a waste of storage space.
Therefore, we design our database to be non-relational and schema-less. Because chemical datasets are heterogeneous, and different datasets may have dramatically different sets of properties, we categorized our data into several cards. Each record card belongs to one type, and contains the properties in a specific domain, like naming, identification, structure, or material performance.
The top level of our database is an unordered map from material ID (MID) to an array of data cards. Each card is a CouchDB Document with a field named "type" indicating its type and rest of fields for its detailed data. For any material, it can contain multiple cards with the same type, for storing multiple data records within the same domain.
Aside from "＄type" field, other metadata fields of record cards include "＄origin" and "＄provider". The "＄origin" may be a link to a scholar article, or an open-access database; and the "＄provider" indicates the person or organization who uploaded this record.
Currently, we have defined five record types in our database: identification, aliases, structure, crystal-lography, and computed properties. An identification card records the identification of given material or molecule, typically IUPAC names, SMILES, InCHI, and CAS numbers. Alias cards describe the synonyms of given material, and it can be used in the name search.
Structure cards contain one structure of a given material, with the format of it. Currently we have standard delay format (SDF) and CIF structure records, and the SDF structures are searchable. The actual structure data can be stored either in the card, or in another file, and the card contains only the link to that file. A structure may be marked as "visualization", "measured" or "computed", to indicate its source and precision.
The crystallography and computed properties cards are examples of performance data cards. In these cards, data records are mostly numerical, and can be searched using range filters.
The sources of the data in DCAIKU could be collected in various ways: providing an interface for researchers to upload their "fresh" data from experiments or using a spider to collect from other public services. Our card-based design is capable to record multiple data for one type of one material, and their source could be also recorded. The database can also provide quality gating, by adding an additional field in the cards, and filter out the cards marked as "low quality".Ⅲ. SEARCH INTERFACE
To optimize user experience, we simplify our user-facing search interface to two simple components: a query string, or a structural query. Either query string or structural query can be empty, but they cannot be both empty, which makes this search query meaningless. The user interface for DCAIKU search engine is only a simple input box with two buttons, "structure" and "filter" (FIG. 2):
After the "structure" button is clicked, it will show a drawing panel with a ChemDoodle  drawing control, where the user draws the molecular structure he/she wants to search. The drawn molecule will be serialized into a standardized MOL format, which becomes the structural query.
The query string contains all non-structural searching demands in the free-form input box. The backend search engine breaks the query string into several keywords and forms a search query applied to ElasticSearch. A keyword may be a word in natural language, indicating name search, or a filter, for searching performance or other detailed data in our database. All filters are in this form: fieldname operator value. The fieldname is an English word indicating one searchable data field in our database, usually from a performance record card. The operator indicates the operation of filtering. For numerical fields, users can use "
For users' convenience, we designed a "filter" button aside the input control. This button contains a menu of filterable fields in our database, such as "molecular mass" and "band width". When a filter menu item is clicked, its keyword will be appended into the input box, so that users can type the search condition after it.
Similar to Web search, sometimes we need a "strict" filter that forces all matches to pass it, therefore DCAIKU supports adding a "+" operator before any keyword or filter, indicating this item must be satisfied; or a "
The search query will be encoded into a JSON object and sent to the server via HTTP POST request, while its results will be returned as a response. A normal response (with HTTP status code 200) indicates a successful query, and the response body will be a JSON array containing the matches within one page. Other HTTP status codes indicate an invalid query.Ⅳ. INTERACTIVE STRUCTURE VISUALIZATION
We develop an interactive frontend rendering engine based on WebGL  to visualize the structure of molecules on the material details page, so that the user can view the 3-dimensional structure of the material chosen and navigate around or even inside the molecule interactively.
We made the first effort to consider diffuse reflection in molecule visualization. In previous molecule visualization systems, like RasMol  and VMD , all frontfacing atoms receive the same amount of light, so atoms in outer and inner layers are simply indistinguishable. Our system can clearly visualize the difference of light intensity and shadow effects of surface atoms and internal atoms (see FIG. 3). In addition, we considered the nature of the atom when calculating reflection, for example metal atoms have larger reflectivity than nonmetal ones. We also optimized the
Given that DCAIKU is built on ElasticSearch, the query string part will be converted into a standard ElasticSearch query object
After the conversion, the ElasticSearch query object will be sent to the ElasticSearch service. The ElasticSearch service will return an ordered list of matching MIDs, then the backend server fetches the corresponding documents from CouchDB to form the JSON to return to user.
The matches returned are first sorted by the quantity of satisfied filters in the "OR" query in descending order; then sorted by relevance in descending order as well. Relevance of a document is defined to be the sum of keyword relevance and structural relevance. Keyword relevance is the sum of occurrences of each keyword in query string, normalized by the overall frequency of the keyword in the database. This ranking algorithm is known as term frequency-inverse document frequency (TF-IDF) . Similarly, structural relevance is calculated as the sum of occurrences of each matching descriptor, normalized by the overall frequency of the descriptor in the database and the total number of descriptors in the molecule.
To speed up keyword search, DCAIKU builds full-text indexes on each field name. For structure search, an unordered index is created for each ap/tt descriptor. On searching a query, DCAIKU orders the fields by the cardinality of distinct values in the field, so that the most discriminative field is filtered first by the index, resulting in less records to be examined one by one.Ⅵ. STRUCTURAL SEARCH A. Descriptors
In DCAIKU, we choose to use ap and tt descriptors for molecular structural search. An ap descriptor records the information of two heavy atoms in a given molecule, and the length of the shortest path connecting them, while a tt descriptor describes four atoms connected by three bonds. The atomic information contains three parts: (ⅰ) The element name of a certain atom, (ⅱ) Number of heavy atoms connected to it, and (ⅲ) Number of
Ap descriptors are obtained from the atom information of all atom pairs in the molecule, with the shortest distance between each pair. In the example above, an ap will be "C10.O11.3" between atom 1 and atom 8. Note that there may be some identical aps, and our system records all of them, because their quantity is also important in this situation.
The tt descriptors contain the atom information from four connected atoms, like the C1
All descriptors are encoded into a sequence of alphanumeric "words" separated by spaces because the word-based searching behavior of ElasticSearch handles single "words" better.B. Algorithm
Given that a molecule structure is also a graph with vertex labels identical to the element name, and edge labels representing the bond order, our descriptor-generation algorithm takes a molecule encoded in the MOL format first, and converts it into a graph
To obtain the ap descriptors, we need to calculate the shortest distances of all pairwise atoms. This can be easily achieved by the Floyd-Warshall algorithm . However, considering
For tt descriptors, we can find that the distance between the first and the last atoms are exactly 3. Therefore, we can obtain a
The operation, "Encode", is used to convert a descriptor to a string that can be stored in our database. We choose to use Base64 VLQ  to encode all our descriptors.C. Structural search protocol
Structural searching does not share the same protocol as query strings. When a user performs a structural search, the backend server will receive the structure he/she has drawn, and convert that structure into a sequence of descriptors. After that, we convert all descriptors of the query structure into a series of match queries and send them to ElasticSearch. Users may combine structural search with query strings, which is handled by a simple and Boolean search provided by ElasticSearch.Ⅶ. EVALUATION
We've built a test database based on PubChem's open data  and Crystallography Open Database (COD) , resulting a test database containing 3, 585, 725 small molecules and 295, 223 crystal structures.
The DCAIKU search engine runs on a single server with two Intel Xeon E5-2650 v3 CPUs, 128 GiB of DRAM and 2 TB of NVMe SSD. We use 6 representative keyword queries (Table Ⅰ) and 6 representative structural queries (Table Ⅱ) to measure the accuracy and performance.
Because network latency and response downloading time may be nonnegligible, we precalibrate the network round-trip latency using ping, and measure the time from sending the last byte of the query to receiving the first byte of the response. The reported latency is the measured query to response time subtracted by the network latency. The latency is measured when the system is almost idle. For throughput benchmark, we stress the site with a continuous stream of identical queries, keeping 50 queries in parallel. We found that none of the search engines cache query results. The throughput is given by the average number of responses received per second.A. Keyword search
The results of keyword search are shown in Table Ⅰ. It reveals that though the input query is relatively complex, our system built on ElasticSearch can handle them well, providing our desired results, while maintaining the low latency. The results of searching for "Silicon without element Si" (test 2) results with Germanium, because that the data entry of Germanium has a reference to silicon. The contradiction test 4 returns an expected null-result.
Tests 5 and 6 in our crystallography database shows the potential of our system for searching molecules or materials satisfying specific performance requirements.
A keyword search returning hundreds of matches typically completes within 0.3 s, regardless of the number of search conditions.B. Structural search
Structural search results are shown in Table Ⅱ. From this table, we can see that our descriptor system works well for most queries: in most cases, they return the same molecule in the first place of the candidate list. The first non-identical similar result reveals that our ap/tt descriptor system can verify the similarity between the query and other molecules in our database.
However, in test 6, our system does not return the exact query in the first place. This shows one of the disadvantages of the ap/tt system: for topologically similar molecules, but with different lengths of carbon chains, our ap/tt descriptors may be too similar to distinguish relevant candidates effectively.
The result of test 3 demonstrates the ability of searching molecules that are similar to an unrecorded molecule in our system. Our system returns a 7-membered ring instead of the six-membered in our query, while preserving other properties, especially leaving heterocycles as it is.
Structural search has higher latency than keyword search because a large portion of molecules match at least one ap/tt descriptor with the query. Our system can retrieve the top 25 results sorted by relevance in 3 s for all test queries. Tests 4
In addition, we measure the latency of DCAIKU under stress test of 20 concurrent queries. Compared to idle latency, the latency under high load varies from 1.1
For comparison, we benchmark several public chemical search engines, namely eMolecules , PubChem  and ChemSpider , using the same methodology as testing DCAIKU. The latency measurements have less than 10% standard deviation, so we only report the average latency of 10 repeated tests. For public services not supporting advanced filters, we convert each filter into a simple keyword search.
Our system has only one Linux server and contains 129 GiB data and 47 GiB index of 3.9 M records. The corpus size of eMolecules, PubChem and ChemSpider are 1.5, 91, and 58 M respectively; these services are also using multiple servers to provide their service, with load balancing and clustering, rather than single server. Therefore, although our corpus size is smaller than PubChem and ChemSpider, DCAIKU is able to seamlessly scale to a cluster of servers by splitting the corpus to multiple disjoint portions to be served by different servers. Merging the top N results (e.g. the first page of results) from each server is a cheap operation, thus the overall query latency and throughput would not differ much from a single node.
As shown in FIG. 5 (a) and (b), the query latency of DCAIKU is significantly lower than existing public services, especially in structural searching. Some services, like ChemSpider, takes about one minute to return a structural search result, while all queries to DCAIKU return within 3 s.
As shown in FIG. 5 (c) and (d), the throughput of DCAIKU also outperforms other public services. For structural search, the throughput is only bench-marked for DCAIKU and eMolecules due to concurrency restrictions of other two services. The throughput of DCAIKU is bounded by the CPU performance of fetching and merging indexes in relevance order. The structural search throughput is an order of magnitude lower than keyword search, because a structural query may have tens of ap/tt descriptors while a keyword search typically only contains a few keywords or filters.Ⅸ. CONCLUSION
This paper presents DCAIKU, a high-performance chemical structure & data search engine, built on CouchDB and ElasticSearch. DCAIKU converts the chemical structure similarity search problem into a general text search problem to utilize off-the-shelf full-text search engines. DCAIKU also supports flexible document structures and heterogeneous datasets with the help of schema-less document database, so that users can freely combine their demands about molecules, materials and other chemistry topics via a unified query interface.
Our evaluations show that DCAIKU can handle both keyword search and structural search against millions of records with both high accuracy and low latency. For queries that have no perfect match in the database, DCAIKU can return similar results as well. Our work makes a crystal-clear case showing the feasibility of large-scale molecule structure search through ap/tt descriptor matching.
The DCAIKU search engine will enable large-scale and cost-effective structural search in materials science and chemistry research. We plan to further improve the system in the following aspects: (ⅰ) Support more material types, especially in biochemistry, like proteins, drugs, etc. (ⅱ) Improve the accuracy of structural search by adding more descriptors to distinguish different lengths of carbon chains in topologically similar structures. (ⅲ) Support donor-acceptor search, which is very common in catalyst design and drug design.Ⅹ. ACKNOWLEDGEMENTS
This work was supported by the National Natural Science Foundation of China, the Ministry of Science and Technology of China, and the Swedish Research Council.
|||P. Willett, J. M. Barnard, and G. M. Downs, J. Chem. Inf. Comput. Sci. 38 , 983 (1998). DOI:10.1021/ci9800211|
|||G. M. Downs and P. Willett, Reviews in Com-putational Chemistry, K. B. Lipkowitz and D. B. Boyd Eds., Hoboken, New Jersey: Wiley-VCH, Inc., (2007).|
|||Y. Ke and R. Sukthankar, Proceedings of the 2004 IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition, Washington, DC, USA: IEEE, (2004).|
|||R. E. Carhart, D. H. Smith, and R. Venkata-raghavan, J. Chem. Inf. Comput. Sci. 25 , 64 (1985). DOI:10.1021/ci00046a002|
|||R. Nilakantan, N. Bauman, J. S. Dixon, and R. Venkataraghavan, J. Chem. Inf. Comput. Sci. 27 , 82 (1987). DOI:10.1021/ci00054a008|
|||V. Gutmann, The Donor-Acceptor Approach to Molecular Interactions. New York: Plenum Press (1978).|
|||K. Chodorow, MongoDB: the Definitive Guide, 2nd Ed, Sebastopol, US: O'Reilly Media, Inc., (2013).|
|||J. C. Anderson, J. Lehnardt, and N. Slater, CouchDB: the Definitive Guide, Cambridge: O'Reilly Media, Inc., (2010).|
|||M. McCandless, E. Hatcher, and O. Gospodnetić, Lucene in Action: Covers Apache Lucene 3. 0, 2nd Ed, Greenwich: Manning Publications Co., (2010).|
|||C. Gormley and Z. Tong, Elasticsearch: The Definitive Guide: A Distributed Real-Time Search and Analytics Engine, Sebastopol: O'Reilly Media, Inc., 328(2015).|
|||M. C. Burger, J. Cheminform. 7 , 35 (2015). DOI:10.1186/s13321-015-0085-3|
|||C. Marrin, WebGL Specification, Khronos WebGL Working Group, (2011).|
|||R. A. Sayle, and E. J. Milner-White, Trends Biochem. Sci. 20 , 374 (1995). DOI:10.1016/S0968-0004(00)89080-5|
|||W. Humphrey, A. Dalke, and K. Schulten, J. Mol. Graph. 14 , 33 (1996). DOI:10.1016/0263-7855(96)00018-5|
|||G. Salton and M. J. McGill, Introduction to Modern Information Retrieval, Auckland, Tokyo: McGraw-Hill, Inc., (1983).|
|||R. W. Floyd, Commun. ACM 5 , 345 (1962).|
|||M. Thorup, J. ACM 46 , 362 (1999). DOI:10.1145/316542.316548|
|||S. Josefsson, The Base16, Base32, and Base64 Data Encodings, US: Network Working Group, (2006).|
|||E. E. Bolton, Y. L. Wang, P. A. Thiessen, and S. H. Bryant, Annu. Rep. Comput. Chem. 4 , 217 (2008). DOI:10.1016/S1574-1400(08)00012-1|
|||S. Gražulis, A. Daškevič, A. Merkys, D. Chateigner, L. Lutterotti, M. Quir, N. R. Serebryanaya, P. Moeck, R. T. Downs, and A. Le Bail, Nucleic Acids Res. 40 , D420 (2011).|