BingoBo: A New Web 3.0 Platform is Born

SQL or NoSQL and search engine solutions

We are all familiar with search engines, even if we are not very clear about its definition. According to the video [3] from a Google engineer, major search engine functionalities involve the following components:

- Crawling
- Indexing
- Ranking

People are constantly debating in online forums whether using an SQL database is better using NoSQL database.  What type of search engine should be adopted?

Here is my opinion.

There are two major types of database solutions, one is a relational database usually using SQL, and the other is a full-text database. These are used by different applications. If the data is highly structured and organized, such as ERP* or CRM* data, a relational database is the best fit. Otherwise, if the data structure is based on documents with free text or key-value pairs, such as search engines, a high efficiency full-text database is a good fit.

In recent years, other than the existing traditional search engines, new types of search servers, such as Apache Solr and ElasticSearch based on Apache Lucene project, come into play to serve as "fast open source enterprise search platform". They provide "full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search."

Both traditional search engines and the latest and greatest search servers are using full-text database with specific information retrieval techniques. They both apply indexing and ranking algorithms to sort the information based on high relevancy. Both are handling large amounts of data using advanced algorithms to process in fast speed. 

However, they are different in the scope of the document acquisition. Traditional search engines require huge data centers to hold its server farms in order to collect or crawl the most up-to-date information from all over the internet. While search servers retrieve and index documents from enterprise's proprietary storage.

On the other hand, large ERP systems such as PeopleSoft, EBusniess and JD Edwards as well as SAP systems, are using Oracle Enterprise Manager or Microsoft SQL Server databases intensively and extensively to store and process financial data for the majority of Fortune 1000 companies. Oracle serves a huge customer base with a large consulting team working 24/7. Along with the vast adoption of MySQL databases among the small and startup businesses, there is no doubt that a SQL-based database solution is a winning strategy for that type of structured data processing.

There is still something sitting in between, such as Amazon S3* and other cloud-based services using their proprietary algorithms to handle fast data-accessing and processing needs, and also provide powerful data storage and retrieval capabilities. 

However, by the time when we entered the Big Data era, we have had a massive amount of data almost go beyond any pure database server capability. We are pretty much lost in the forest. We noticed that, although most of the above solutions declare that they have a distributed indexing mechanism built-in(which means data processing workload is shared by numerous server instances or clustering instead of a centralized server system), they are still under centralized configuration without communication between the peers.

That's still not enough. That is the reason why we keep debating on which is the best, since none of them can solve the fundamental huge data volume problem for a better search engine implementation. Usually a centralized solution means bottleneck, but distributed systems take high challenges for reliability based on the robustness of the communication protocols. We may find some true distributed solutions in later discussions.

ERP stands for Enterprise Resource Planning
CRM stands for Customer Relationship Management
S3 stands for Simple Storage Service