BingoBo: A New Web 3.0 Platform is Born

How to organize super large volumes of data

As we discussed in the previous blog post, depending on the nature of the data format, we have either a relational database solution or the NoSQL database solution. In terms of search engines, even NoSQL database solutions are not sufficient enough to handle the enormous amount of data being collected and indexed across the internet every day.

So we have two choices to build our search engines in order to provide users with high-quality of links rather than high-quantity of links.

1. Stick to the traditional search engine architecture

   By looking into more detailed architecture from a technical point of view, we found that Google's architecture provides reliability in the software level by replicating services across many different machines and automatically detecting and handling failures. They tailored the design for the best aggregate request throughput, instead of peak server response time. The replication of services and data significantly increases the computational workload and network traffic.

   Furthermore, on the software level, they save all the web pages collected across the internet onto the document servers before doing the inverted index and saving them on the index servers, they then replicate data among distributed locations to serve locally to the requesting users. In order to ensure a fast processing speed, they use MapReduce and BigTable technologies to deal with Pegabyte data sizes. But straggles are a serious problem. A straggler is a computation that is going slower than others and holds up everyone. The solution is to run multiples of the same computations and when one is done kill all the rest. Along with the data compression between the Map server and Reduce servers, it increases the processing workload.

   Therefore, the energy consumption and cooling issues cannot be resolved by using this type of architecture.

2. Switch to different architectures and solutions

   As we can see, the large number of server farms has a high cost to build the data centers as well as a high energy consumption. The critical question is: do we have to deal with data in such a way by invert indexing every word from every web page?

   For the document system in large corporations, content management and document indexing is a different story, where Big Data solutions may apply and produce good results. Regarding the search engine algorithms, it's necessary to find a more efficient way to index and organize the super large volumes of information from all websites.
			      	

That's why Semantic Web came into the picture, to make the data well-organized and more meaningful. This should be the direction to follow. The problem is what kind of solution is feasible to implement the feature requirements of the Web 3.0.

Among various available solutions, distributed search engine plays an interesting role. This sheds a light on finding the feasible solutions for the next generation of search engine.