Chinese Journal of Polar Since  2018, Vol. 31 Issue (3): 341-349

The article information

Ren-zhi Li, Bo-jie Li, Guo-zhen Zhang, Jun Jiang, Yi Luo
李任之, 李博杰, 张国桢, 江俊, 罗毅
A High-Performance and Flexible Chemical Structure & Data Search Engine Built on CouchDB & ElasticSearch
基于CouchDB和Elastic Search的高性能化学结构搜索引擎与数据库的构建
Chinese Journal of Polar Since, 2018, 31(3): 341-349
化学物理学报, 2018, 31(3): 341-349
http://dx.doi.org/10.1063/1674-0068/31/cjcp1711202

Article history

Received on: November 6, 2017
Accepted on: December 25, 2017
A High-Performance and Flexible Chemical Structure & Data Search Engine Built on CouchDB & ElasticSearch
Ren-zhi Li, Bo-jie Li, Guo-zhen Zhang, Jun Jiang, Yi Luo     
Dated: Received on November 6, 2017; Accepted on December 25, 2017
Hefei National Laboratory for Physical Sciences at the Microscale, School of Chemistry and Materials Science, University of Science and Technology of China, Hefei 230026, China
*Author to whom correspondence should be addressed. Jun Jiang, E-mail:jiangj1@ustc.edu.cn; Yi Luo, E-mail:yiluo@ustc.edu.cn
Abstract: Computer-assisted chemical structure searching plays a critical role for efficient structure screening in cheminformatics. We designed a high-performance chemical structure & data search engine called DCAIKU, built on CouchDB and ElasticSearch engines. DCAIKU converts the chemical structure similarity search problem into a general text search problem to utilize off-the-shelf full-text search engines. DCAIKU also supports flexible document structures and heterogeneous datasets with the help of schema-less document database. Our evaluations show that DCAIKU can handle both keyword search and structural search against millions of records with both high accuracy and low latency. We expect that DCAIKU will lay the foundation towards large-scale and cost-effective structural search in materials science and chemistry research.
Key words: Search engine     Cheminformatics     Structural search     Schema-less databases    
Ⅰ. INTRODUCTION

Chemical structure searching plays an important role in cheminformatics. For example, identification of a lead compound, which is a primary task in the early stage of drug discovery, requires tremendous effort on high-volume throughout screening of a great number of candidate compounds. It possesses a challenge for cheminformatics to improve the efficiency and accuracy to save drug development cost. Computational screening methods have been deemed as being very useful in this endeavor [1].

In the last three decades, the methodology of chemical structure screening has been rapidly developed. The most prevalent search scenario, similarity search, i.e. searching molecules in the data-base that have similar structure from given input, came into wide use in late 1980s [2]. The key benefits of similarity search include: (ⅰ) Little information is needed to formulate a reasonable query. In particular, no assumption needs to be made about which part(s) of the query molecule confers activity. Thus, similarity methods can be used at the beginning of a chemical or material discovery project when there is little information about the target and only one or two known activities. (ⅱ) Many implementations of similarity methods are computationally inexpensive, so large database searches cannot be efficiently performed.

Despite the advantages offered by structure search and its growing interest in both academia and industry, the majority of community of cheminformatics has been kept away from this technology for years, largely because of the lack of high-performance structure search engines for the general research community.

Conventionally, chemical structure screening systems are developed on a specialized database or indexing engine. However, existing specialized databases are less flexible compared with general-purpose database engines. Furthermore, in most existing specialized databases, screening through millions of structures takes seconds or even minutes, which is too slow for an interactive Web service.

The challenges to improve performance are threefold. First, most existing systems do a linear scan of almost all candidate structures. Although the similarity computation of a single candidate structure is inexpensive, such scan is still too costly when multiplied by millions of structures in the database. Second, the search criteria should support not only structure similarity, but also other chemical properties. Chemical information is highly variable in structure and therefore requires high flexibility in management such as indexing and storage. Third, high performance databases require extensive optimizations, e.g. leveraging event-driven I/O and managing in-memory cache efficiently.

For the first challenge, our solution is to transform the similarity search problem onto a general text search problem. It is inspired from image search, where the search engine returns the most similar images without calculating the similarity of all candidate images one by one. A large-scale image search engine extracts a descriptor, like PCA-SIFT [3], from every image in its database, then builds an inverted index of the descriptors to store the list of occurrences of each descriptor. When the user up-loads an image to search, the engine calculates the descriptors of the uploaded image, and searches for the best matches according to the inverted index.

In the field of chemical searching, to describe the topological structure of a molecule, there are many kinds of descriptors. The most famous ones are ap [4] (Atomic Pair) and tt [5] (Topological Torsion). Asides from ap and tt, other descriptors include dp, the Donor-Acceptor Pair [6], which is a reduced pharmacophore version of ap. Such descriptors can be used for molecular structural searching.

In this article, we built a database that contained a correspondence between molecules and their descriptors, and an inverted index that treated each descriptor as a unique keyword. When performing similarity search, we translated the query (a structure) into descriptors (keywords) and performed a general text search on the keywords.

The second challenge is large variety and hetero-geneity of chemical information. Relational data models require a fixed schema of all possible properties (fields), but materials may have hundreds of properties which are hard to enumerate a priori. So we choose schema-less document-oriented database instead of relational database. We describe each material or molecule as a document, which is roughly equivalent to the programming concept of an object. Instead of designing a "type" or a "structure" for the data, we simply put the semistructured document into database, based on the naming conventions.

This schema-less design gives us the ability to maintain multiple heterogeneous datasets in a same database and search their structure or performance data via a unified query interface. The index server incrementally indexes every property of inserted or updated documents, and the new documents will appear in search results after indexing process completes.

Over the past decade or so, we see the advent of open source schema-less databases (e.g., MongoDB [7] and CouchDB [8]) and text search engines (e.g., Lucene [9] and ElasticSearch [10]) that have high scalability and high performance. These tools are employed to solve the third challenge. Building on the basis of well-developed software platform enjoys the benefits of performance, scalability, system stability, and community support.

In this work, we present DCAIKU, a large-scale chemical structure & data search engine, built on CouchDB [8] and ElasticSearch [10]. CouchDB, as one of the most prevalent schema-less databases, is used for document storage. ElasticSearch, as one of the most advanced text search engines, to provide inverted index with high performance. DCAIKU's database design, search interface and visualization, implementation of keyword search and chemical structural search, and finally, evaluation results are also presented.

DCAIKU is a Web service for users to search molecules and crystals by keyword and/or structure. The molecular information of the chemicals is stored in a CouchDB [8] database and indexed by ElasticSearch [10]. Users specify structure query via a molecular drawing panel. When the Web backend in DCAIKU server receives a query from the browser, it parses the query, searches the ElasticSearch index to find matching chemicals ordered by relevance, then fetches their molecular information from the CouchDB database and sends them to the browser. Finally, the browser visualizes the molecular structure for the user.

When new molecular structures are collected, we optimize their geometric structures and calculate their structural descriptors for searching. Then the molecular information is stored in the CouchDB database, shown as the red dotted line in FIG. 1. A synchronizer subscribes changes to the CouchDB database and automatically creates indexes for inserted or updated records in ElasticSearch, shown as the red dashed line in FIG. 1.

FIG. 1 System architecture of DCAIKU.
Ⅱ. SCHEMA-LESS DATABASE DESIGN

Relational data models impose strict constraints on data types, which is impractical for chemical data. For example, a crystallographic information file (CIF) usually contains atom coordinates, which is a two-dimensional table. In relational model, such a table should be represented in multiple entries of a one-to-many relation table, which adds complexity in query and inefficiency in storage. Furthermore, the CIF format defines hundreds of properties regarding experimental measurements, analysis, atomicity, chemical properties, structure, publication and metadata, but a typical CIF file may only contain a small fraction of defined properties. If we use relational model, most cells in the table would be empty, which is a waste of storage space.

Therefore, we design our database to be non-relational and schema-less. Because chemical datasets are heterogeneous, and different datasets may have dramatically different sets of properties, we categorized our data into several cards. Each record card belongs to one type, and contains the properties in a specific domain, like naming, identification, structure, or material performance.

The top level of our database is an unordered map from material ID (MID) to an array of data cards. Each card is a CouchDB Document with a field named "type" indicating its type and rest of fields for its detailed data. For any material, it can contain multiple cards with the same type, for storing multiple data records within the same domain.

Aside from "$type" field, other metadata fields of record cards include "$origin" and "$provider". The "$origin" may be a link to a scholar article, or an open-access database; and the "$provider" indicates the person or organization who uploaded this record.

Currently, we have defined five record types in our database: identification, aliases, structure, crystal-lography, and computed properties. An identification card records the identification of given material or molecule, typically IUPAC names, SMILES, InCHI, and CAS numbers. Alias cards describe the synonyms of given material, and it can be used in the name search.

Structure cards contain one structure of a given material, with the format of it. Currently we have standard delay format (SDF) and CIF structure records, and the SDF structures are searchable. The actual structure data can be stored either in the card, or in another file, and the card contains only the link to that file. A structure may be marked as "visualization", "measured" or "computed", to indicate its source and precision.

The crystallography and computed properties cards are examples of performance data cards. In these cards, data records are mostly numerical, and can be searched using range filters.

The sources of the data in DCAIKU could be collected in various ways: providing an interface for researchers to upload their "fresh" data from experiments or using a spider to collect from other public services. Our card-based design is capable to record multiple data for one type of one material, and their source could be also recorded. The database can also provide quality gating, by adding an additional field in the cards, and filter out the cards marked as "low quality".

Ⅲ. SEARCH INTERFACE

To optimize user experience, we simplify our user-facing search interface to two simple components: a query string, or a structural query. Either query string or structural query can be empty, but they cannot be both empty, which makes this search query meaningless. The user interface for DCAIKU search engine is only a simple input box with two buttons, "structure" and "filter" (FIG. 2):

FIG. 2 Searching user interface.

After the "structure" button is clicked, it will show a drawing panel with a ChemDoodle [11] drawing control, where the user draws the molecular structure he/she wants to search. The drawn molecule will be serialized into a standardized MOL format, which becomes the structural query.

The query string contains all non-structural searching demands in the free-form input box. The backend search engine breaks the query string into several keywords and forms a search query applied to ElasticSearch. A keyword may be a word in natural language, indicating name search, or a filter, for searching performance or other detailed data in our database. All filters are in this form: fieldname operator value. The fieldname is an English word indicating one searchable data field in our database, usually from a performance record card. The operator indicates the operation of filtering. For numerical fields, users can use "$>$", "$ <$" or "=", to filter these values. For text fields, the ":" operator specifies a specific field to match, for example, "system:Cubic" will not (mistakenly) match molecules with "Cubic" in its name.

For users' convenience, we designed a "filter" button aside the input control. This button contains a menu of filterable fields in our database, such as "molecular mass" and "band width". When a filter menu item is clicked, its keyword will be appended into the input box, so that users can type the search condition after it.

Similar to Web search, sometimes we need a "strict" filter that forces all matches to pass it, therefore DCAIKU supports adding a "+" operator before any keyword or filter, indicating this item must be satisfied; or a "$-$", indicating all matches must avoid it. Users can combine normal keywords, mandatory keywords, and negative keywords in arbitrary quantity or order.

The search query will be encoded into a JSON object and sent to the server via HTTP POST request, while its results will be returned as a response. A normal response (with HTTP status code 200) indicates a successful query, and the response body will be a JSON array containing the matches within one page. Other HTTP status codes indicate an invalid query.

Ⅳ. INTERACTIVE STRUCTURE VISUALIZATION

We develop an interactive frontend rendering engine based on WebGL [12] to visualize the structure of molecules on the material details page, so that the user can view the 3-dimensional structure of the material chosen and navigate around or even inside the molecule interactively.

We made the first effort to consider diffuse reflection in molecule visualization. In previous molecule visualization systems, like RasMol [13] and VMD [14], all frontfacing atoms receive the same amount of light, so atoms in outer and inner layers are simply indistinguishable. Our system can clearly visualize the difference of light intensity and shadow effects of surface atoms and internal atoms (see FIG. 3). In addition, we considered the nature of the atom when calculating reflection, for example metal atoms have larger reflectivity than nonmetal ones. We also optimized the $z$-index calculation algorithm for molecular scenario, so the rendering can be done in reasonable performance (about 25 frames per seconds on a laptop with modern GPU, with 5 s of loading and preparation time) even if there are thousands of molecules in a single search.

FIG. 3 (a) Structure of "bio-MOF-1" visualized in DCAIKU. The shading reveals its inner structure. (b) The same crystal shown in VESTA [23] for comparison. Lack of shading reduces the ability of clear visualization.

Due to web browsers' limitation of the number of WebGL instances per page (15 for Chrome, in particular), our WebGL-based structure visualization engine cannot be used to render thumbnails on search result page. So, we also develop a JavaScript-based rendering engine on the search result page to show a thumbnail of the 3D molecule structure for each search result.

Ⅴ. KEYWORD SEARCH

Given that DCAIKU is built on ElasticSearch, the query string part will be converted into a standard ElasticSearch query object $Q$ inside our server, and passed into ElasticSearch. $Q$ contains three parts: (ⅰ) Neutral indicates the general keywords, (ⅱ) Positive indicates mandatory positive keywords, and (ⅲ) Negative indicates negative keywords.

$Q$ will be converted into the "should", "must" and "must not" parts of an ElasticSearch bool query, respectively. Following the semantics of bool in ElasticSearch, such a map will be conformal to our design of query strings.

Items in $Q$'s Neutral list are all ElasticSearch match queries, with one field name associated to the keyword. Given that there may be multiple database fields that can be treated as a molecule or a material's "name", multiple terms will be produced, and passed into ElasticSearch. Items in $Q$'s positive and negative lists are somehow different. Despite that in most cases, they are identical to neutral, however sometimes we need to construct a more complex search expression from the keyword. For example, when searching a molecular formula like "C6H12O6", matching a formula field containing terms "C6", "H12", and "O6", is more appropriate than matching the entire "C6H12O6" directly. Therefore, we added a preprocessing step to segment certain kinds of keywords into a Lisplike S-expression, and construct an ElasticSearch Boolean query term based on it. The preprocessors may vary from field to field. In the former example, the keyword "C6H12O6" will be transformed into expression [OR, "C6H12O6", [AND, "C6", "H12", "O6"]]. We include the original keyword as well as the transformed S-expression to prioritize documents that match the original keyword exactly. The full conversion algorithm is shown in Algorithm 1.

Algorithm 1 ncoding a search term $q$ with field name $f$.

After the conversion, the ElasticSearch query object will be sent to the ElasticSearch service. The ElasticSearch service will return an ordered list of matching MIDs, then the backend server fetches the corresponding documents from CouchDB to form the JSON to return to user.

The matches returned are first sorted by the quantity of satisfied filters in the "OR" query in descending order; then sorted by relevance in descending order as well. Relevance of a document is defined to be the sum of keyword relevance and structural relevance. Keyword relevance is the sum of occurrences of each keyword in query string, normalized by the overall frequency of the keyword in the database. This ranking algorithm is known as term frequency-inverse document frequency (TF-IDF) [15]. Similarly, structural relevance is calculated as the sum of occurrences of each matching descriptor, normalized by the overall frequency of the descriptor in the database and the total number of descriptors in the molecule.

To speed up keyword search, DCAIKU builds full-text indexes on each field name. For structure search, an unordered index is created for each ap/tt descriptor. On searching a query, DCAIKU orders the fields by the cardinality of distinct values in the field, so that the most discriminative field is filtered first by the index, resulting in less records to be examined one by one.

Ⅵ. STRUCTURAL SEARCH A. Descriptors

In DCAIKU, we choose to use ap and tt descriptors for molecular structural search. An ap descriptor records the information of two heavy atoms in a given molecule, and the length of the shortest path connecting them, while a tt descriptor describes four atoms connected by three bonds. The atomic information contains three parts: (ⅰ) The element name of a certain atom, (ⅱ) Number of heavy atoms connected to it, and (ⅲ) Number of $\pi$-bonds connected to it. For example, in this molecule shown in FIG. 4, the atom information is: 1: C10, 2: C30, 3: C20, 4: C20, 5: C20, 6: C20, 7: C31, 8: O11.

FIG. 4 A sample molecule.

Ap descriptors are obtained from the atom information of all atom pairs in the molecule, with the shortest distance between each pair. In the example above, an ap will be "C10.O11.3" between atom 1 and atom 8. Note that there may be some identical aps, and our system records all of them, because their quantity is also important in this situation.

The tt descriptors contain the atom information from four connected atoms, like the C1$-$C2$-$C3$-$C4 or C1$-$C2$-$C7$-$O8 in the example above. Descriptors are simply the atom information of these atoms. For torsion C1$-$C2$-$C7$-$O8, the corresponding tt descriptor is "C10.C30.C31.O11".

All descriptors are encoded into a sequence of alphanumeric "words" separated by spaces because the word-based searching behavior of ElasticSearch handles single "words" better.

B. Algorithm

Given that a molecule structure is also a graph with vertex labels identical to the element name, and edge labels representing the bond order, our descriptor-generation algorithm takes a molecule encoded in the MOL format first, and converts it into a graph $G$=($V$, $E$) where $V_i$ represents the element name, and $E_{ij}$ represents the bond order between $V_i$ and $V_j$ (or zero if they are not connected). In the first step, we counted the bonds and $\pi$-bonds connected to each atom, stored in two arrays, $B_i$ and $P_i$, respectively. After this process, such information and each atom's element name are are bundled into a triplet termed as "atom information" $A_i$ and it would be used in the further encoding process.

Algorithm 2 Counting bonds and $\pi$-atoms.

To obtain the ap descriptors, we need to calculate the shortest distances of all pairwise atoms. This can be easily achieved by the Floyd-Warshall algorithm [16]. However, considering $G$ is sparse ($|E|$=$O(|V|)$), and we care only about the quantity of bonds, we decided to use a faster $O(VE)$ algorithm introduced in 1999 by Thorup [17]. The shortest distances will be stored in the matrix $D_{ij}$. Algorithm 3 shows the algorithm used to get ap descriptors.

Algorithm 3 Getting ap descriptors.

For tt descriptors, we can find that the distance between the first and the last atoms are exactly 3. Therefore, we can obtain a $ O(V^2)$algorithm (Algorithm 4) with the help of $D_{ij}$.

The operation, "Encode", is used to convert a descriptor to a string that can be stored in our database. We choose to use Base64 VLQ [18] to encode all our descriptors.

C. Structural search protocol

Structural searching does not share the same protocol as query strings. When a user performs a structural search, the backend server will receive the structure he/she has drawn, and convert that structure into a sequence of descriptors. After that, we convert all descriptors of the query structure into a series of match queries and send them to ElasticSearch. Users may combine structural search with query strings, which is handled by a simple and Boolean search provided by ElasticSearch.

Ⅶ. EVALUATION

We've built a test database based on PubChem's open data [19] and Crystallography Open Database (COD) [20], resulting a test database containing 3, 585, 725 small molecules and 295, 223 crystal structures.

The DCAIKU search engine runs on a single server with two Intel Xeon E5-2650 v3 CPUs, 128 GiB of DRAM and 2 TB of NVMe SSD. We use 6 representative keyword queries (Table Ⅰ) and 6 representative structural queries (Table Ⅱ) to measure the accuracy and performance.

Table Ⅰ Keyword search results of six test queries.
Table Ⅱ Structural search results of six test queries.

Because network latency and response downloading time may be nonnegligible, we precalibrate the network round-trip latency using ping, and measure the time from sending the last byte of the query to receiving the first byte of the response. The reported latency is the measured query to response time subtracted by the network latency. The latency is measured when the system is almost idle. For throughput benchmark, we stress the site with a continuous stream of identical queries, keeping 50 queries in parallel. We found that none of the search engines cache query results. The throughput is given by the average number of responses received per second.

A. Keyword search

The results of keyword search are shown in Table Ⅰ. It reveals that though the input query is relatively complex, our system built on ElasticSearch can handle them well, providing our desired results, while maintaining the low latency. The results of searching for "Silicon without element Si" (test 2) results with Germanium, because that the data entry of Germanium has a reference to silicon. The contradiction test 4 returns an expected null-result.

Tests 5 and 6 in our crystallography database shows the potential of our system for searching molecules or materials satisfying specific performance requirements.

A keyword search returning hundreds of matches typically completes within 0.3 s, regardless of the number of search conditions.

B. Structural search

Structural search results are shown in Table Ⅱ. From this table, we can see that our descriptor system works well for most queries: in most cases, they return the same molecule in the first place of the candidate list. The first non-identical similar result reveals that our ap/tt descriptor system can verify the similarity between the query and other molecules in our database.

However, in test 6, our system does not return the exact query in the first place. This shows one of the disadvantages of the ap/tt system: for topologically similar molecules, but with different lengths of carbon chains, our ap/tt descriptors may be too similar to distinguish relevant candidates effectively.

The result of test 3 demonstrates the ability of searching molecules that are similar to an unrecorded molecule in our system. Our system returns a 7-membered ring instead of the six-membered in our query, while preserving other properties, especially leaving heterocycles as it is.

Structural search has higher latency than keyword search because a large portion of molecules match at least one ap/tt descriptor with the query. Our system can retrieve the top 25 results sorted by relevance in 3 s for all test queries. Tests 4$-$6 takes longer time than tests 1$-$3 because these structure queries contain more ap/tt descriptors, therefore need to merge more indexes.

In addition, we measure the latency of DCAIKU under stress test of 20 concurrent queries. Compared to idle latency, the latency under high load varies from 1.1$\times$ to 1.6$\times$, showing that DCAIKU can support tens of concurrent queries without much performance degradation. This is because ElasticSearch utilizes at most 4 cores per query and our two CPUs on the server has 40 cores, supporting 10 parallel computations. The concurrency is further improved to 12$-$18 by overlapped computation and disk I/O access.

Ⅷ. PERFORMANCE COMPARISON WITH OTHER SYSTEMS

For comparison, we benchmark several public chemical search engines, namely eMolecules [21], PubChem [19] and ChemSpider [22], using the same methodology as testing DCAIKU. The latency measurements have less than 10% standard deviation, so we only report the average latency of 10 repeated tests. For public services not supporting advanced filters, we convert each filter into a simple keyword search.

Our system has only one Linux server and contains 129 GiB data and 47 GiB index of 3.9 M records. The corpus size of eMolecules, PubChem and ChemSpider are 1.5, 91, and 58 M respectively; these services are also using multiple servers to provide their service, with load balancing and clustering, rather than single server. Therefore, although our corpus size is smaller than PubChem and ChemSpider, DCAIKU is able to seamlessly scale to a cluster of servers by splitting the corpus to multiple disjoint portions to be served by different servers. Merging the top N results (e.g. the first page of results) from each server is a cheap operation, thus the overall query latency and throughput would not differ much from a single node.

As shown in FIG. 5 (a) and (b), the query latency of DCAIKU is significantly lower than existing public services, especially in structural searching. Some services, like ChemSpider, takes about one minute to return a structural search result, while all queries to DCAIKU return within 3 s.

FIG. 5 Performance benchmark results. "s0.5" means similarity search threshold is 0.5. Higher similarity threshold implies lower com-putation complexity. For DCAIKU, the similarity threshold is always 0. "limit100" stands for limiting to first 100 results, otherwise ChemSpider would not respond in 5 min. (a) Keyword search latency (seconds), (b) structural search latency (seconds), (c) keyword search throughput (queries per second), (d) structural search throughput (queries per second).

As shown in FIG. 5 (c) and (d), the throughput of DCAIKU also outperforms other public services. For structural search, the throughput is only bench-marked for DCAIKU and eMolecules due to concurrency restrictions of other two services. The throughput of DCAIKU is bounded by the CPU performance of fetching and merging indexes in relevance order. The structural search throughput is an order of magnitude lower than keyword search, because a structural query may have tens of ap/tt descriptors while a keyword search typically only contains a few keywords or filters.

Ⅸ. CONCLUSION

This paper presents DCAIKU, a high-performance chemical structure & data search engine, built on CouchDB and ElasticSearch. DCAIKU converts the chemical structure similarity search problem into a general text search problem to utilize off-the-shelf full-text search engines. DCAIKU also supports flexible document structures and heterogeneous datasets with the help of schema-less document database, so that users can freely combine their demands about molecules, materials and other chemistry topics via a unified query interface.

Our evaluations show that DCAIKU can handle both keyword search and structural search against millions of records with both high accuracy and low latency. For queries that have no perfect match in the database, DCAIKU can return similar results as well. Our work makes a crystal-clear case showing the feasibility of large-scale molecule structure search through ap/tt descriptor matching.

The DCAIKU search engine will enable large-scale and cost-effective structural search in materials science and chemistry research. We plan to further improve the system in the following aspects: (ⅰ) Support more material types, especially in biochemistry, like proteins, drugs, etc. (ⅱ) Improve the accuracy of structural search by adding more descriptors to distinguish different lengths of carbon chains in topologically similar structures. (ⅲ) Support donor-acceptor search, which is very common in catalyst design and drug design.

Ⅹ. ACKNOWLEDGEMENTS

This work was supported by the National Natural Science Foundation of China, the Ministry of Science and Technology of China, and the Swedish Research Council.

Reference
[1] P. Willett, J. M. Barnard, and G. M. Downs, J. Chem. Inf. Comput. Sci. 38 , 983 (1998). DOI:10.1021/ci9800211
[2] G. M. Downs and P. Willett, Reviews in Com-putational Chemistry, K. B. Lipkowitz and D. B. Boyd Eds., Hoboken, New Jersey: Wiley-VCH, Inc., (2007).
[3] Y. Ke and R. Sukthankar, Proceedings of the 2004 IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition, Washington, DC, USA: IEEE, (2004).
[4] R. E. Carhart, D. H. Smith, and R. Venkata-raghavan, J. Chem. Inf. Comput. Sci. 25 , 64 (1985). DOI:10.1021/ci00046a002
[5] R. Nilakantan, N. Bauman, J. S. Dixon, and R. Venkataraghavan, J. Chem. Inf. Comput. Sci. 27 , 82 (1987). DOI:10.1021/ci00054a008
[6] V. Gutmann, The Donor-Acceptor Approach to Molecular Interactions. New York: Plenum Press (1978).
[7] K. Chodorow, MongoDB: the Definitive Guide, 2nd Ed, Sebastopol, US: O'Reilly Media, Inc., (2013).
[8] J. C. Anderson, J. Lehnardt, and N. Slater, CouchDB: the Definitive Guide, Cambridge: O'Reilly Media, Inc., (2010).
[9] M. McCandless, E. Hatcher, and O. Gospodnetić, Lucene in Action: Covers Apache Lucene 3. 0, 2nd Ed, Greenwich: Manning Publications Co., (2010).
[10] C. Gormley and Z. Tong, Elasticsearch: The Definitive Guide: A Distributed Real-Time Search and Analytics Engine, Sebastopol: O'Reilly Media, Inc., 328(2015).
[11] M. C. Burger, J. Cheminform. 7 , 35 (2015). DOI:10.1186/s13321-015-0085-3
[12] C. Marrin, WebGL Specification, Khronos WebGL Working Group, (2011).
[13] R. A. Sayle, and E. J. Milner-White, Trends Biochem. Sci. 20 , 374 (1995). DOI:10.1016/S0968-0004(00)89080-5
[14] W. Humphrey, A. Dalke, and K. Schulten, J. Mol. Graph. 14 , 33 (1996). DOI:10.1016/0263-7855(96)00018-5
[15] G. Salton and M. J. McGill, Introduction to Modern Information Retrieval, Auckland, Tokyo: McGraw-Hill, Inc., (1983).
[16] R. W. Floyd, Commun. ACM 5 , 345 (1962).
[17] M. Thorup, J. ACM 46 , 362 (1999). DOI:10.1145/316542.316548
[18] S. Josefsson, The Base16, Base32, and Base64 Data Encodings, US: Network Working Group, (2006).
[19] E. E. Bolton, Y. L. Wang, P. A. Thiessen, and S. H. Bryant, Annu. Rep. Comput. Chem. 4 , 217 (2008). DOI:10.1016/S1574-1400(08)00012-1
[20] S. Gražulis, A. Daškevič, A. Merkys, D. Chateigner, L. Lutterotti, M. Quir, N. R. Serebryanaya, P. Moeck, R. T. Downs, and A. Le Bail, Nucleic Acids Res. 40 , D420 (2011).
[21] https://www.emolecules.com/.
[22] http://www.chemspider.com/.
[23] http://jp-minerals.org/vesta/en/.
基于CouchDB和Elastic Search的高性能化学结构搜索引擎与数据库的构建
李任之, 李博杰, 张国桢, 江俊, 罗毅     
中国科学技术大学化学与材料科学科学学院, 合肥微尺度物质科学国家研究中心, 合肥 230026
摘要: 计算机辅助的化学结构搜索在化学信息学中地位十分重要,本文设计了一套高性能的化学结构和化学数据搜索系统,称为DCAIKU.DCAIKU基于CouchDB无模式数据库和ElasticSearch基础架构构建,通过将结构相似性搜索变换为文字搜索实现了高性能和高灵活性的检索引擎:在满足化学信息存储的高灵活性条件下,仍然可以做到低延迟和高准确性,同时拥有良好的伸缩性,可以大规模并行化和集群化.
关键词: 搜索引擎     化学信息学     结构检索     无模式数据库