Google is a simple but a massive search engine that always standby for us to search anything on the web. This multi-billion dollar company also one of the behind the scene players to power modern internet. But can we imagine how this search machine manages its operations?
Google depends on distributed computing system to provide users with the infrastructure they require to access, create and alter data. Distributed computing is a system that creates inter-network between different computers to unite their resources to execute any task. Each computer contributes some of its resources such as memory, processing power, hard drive space to the entire network. As a result, the whole network virtually acts as a massive computer whereas each individual computer acting as a processor and data storage device.
Search engine giant Google utilizes the advantage of distributed computing and developed a relatively cost effective arrangement that mainly encompasses inexpensive machines running on Linux operating systems. But how this technology big boss depends on the cheap hardware? It is because of Google File System (GFS) that integrates the capacity of off-the-shelf servers while compensating for any hardware weakness. Google uses GFS to handle huge files and to allow application developers the research and development resources.
Google developers frequently come across large files that can be difficult to manipulate using a traditional computer file system. Another crucial consideration is scalability which in practice refers to the ease of adding capacity to the system. A system is scalable if it incorporates any changes such us system’s capacity. Scalability is mandatory for Google as it maintains a robust network of computers to manage all its files.
Due to the wide scale network operation, monitoring and maintaining it is a critical process. Through GFS, developers decided to automate as much of the administrative duties required to keep the system alive. This is the core principle of automatic computing, a concept in which computers are able to detect and fix problems in real time without any human assistance. The challenge for the GFS team was to not only develop an automatic monitoring system, but also to design it so that it could function across a massive network of computers.
GFS handles large file near about multi-gigabyte (GB) range. Retrieve and manipulate files that magnitude would take up a lot of the network’s bandwidth. Bandwidth is the capacity of the system to move data from one location to another. The GFS addresses this problem by breaking files up into chunks of 64 megabytes (MB) each. Every chunk assigns a unique 64 bit identification number called a chunk handle. By requiring all the file chunks to be the same size, the GFS simplifies resource application. It’s easy to see which computers in the system are near capacity and which are unutilized. It’s also easy to move chunks from one resource to another to balance the workload across the system.
Google arranged cluster of computers to run GFS. A cluster is simply a bunch of computers. Each cluster might contain hundreds or even thousands of machines. Within GFS clusters there are three kinds of entities such as client, master server and chunk server. Client is an entity that places a file request. It can be other computers or computer applications and client is considered as a customer of GFS. Client request are ranging from retrieving and manipulating existing files to creating new files on the system.
Master server plays the role of a coordinator for the cluster. The operations of master include maintaining an operation log, which keeps track of the activities of the master’s cluster. The operation log helps service interruption to a minimum and if the master server crashes, a replacement server that has monitored the operation log can take its place. The master server also keeps the track of metadata, which is the information that describes chunks. The metadata tells the master server to which files the chunks belong and where they fit within the overall file.
In each cluster there is one master server though each cluster manipulates copies of the master server in case of a hardware failure. It seems that this kind of arrangement may lead to traffic congestion as just only one master server rule the cluster of thousands of computers. The GFS gets around this sticky situation by keeping the messages the master server sends and receives very small. The master server doesn’t actually handle file data at all. It leaves that up to the chunk servers.
Chunk servers are the workhorses of the GFS. They’re responsible for storing the 64 MB file chunks. The chunk servers do not send chunks to the master server. Instead, they send requested chunks directly to the client. The GFS copies every chunk multiple times and stores it on different chunk servers. Each copy is called a replica. By default, the GFS makes three replicas per chunk, but users can change the setting and make more or fewer replicas if necessary.
Google discloses little about its hardware platform to run the GFS. But in an official GFS report, Google revealed the specifications of the equipment it used to run some benchmarking tests on GFS performance. While the test equipment might not be a true representation of the current GFS hardware, it gives you an idea of the sort of computers Google uses to handle the massive amounts of data it stores and manipulates.
The test equipment included one master server, two master replicas, 16 clients and 16 chunk servers. All of them used the same hardware with the same specifications, and they all ran on Linux operating systems. Each had dual 1.4 gigahertz Pentium III processors, 2 GB of memory and two 80 GB hard drives. In comparison, several vendors currently offer consumer PCs that are more than twice as powerful as the servers Google used in its tests. Google developers proved that the GFS could work efficiently using modest equipment.
The network connecting the machines together consisted of a 100 megabytes-per-second (Mbps) full-duplex Ethernet connection and two Hewlett Packard 2524 network switches. The GFS developers connected the 16 client machines to one switch and the other 19 machines to another switch. They linked the two switches together with a one gigabyte-per-second (Gbps) connection.
CJ WEB
Feedback: edward.singha@gmail.com