What is a data?
Before talking about storage, let us stop for a moment on data itself. A data is the representation of a piece of information. There are two kinds of data:
- Analog data is represented in the form of a wave, with an infinity of values: it is a continuous signal. Physical supports allowing the recording of analog data include, for instance, a cassette tape or a vinyl.
- Numerical data, on the opposite, can be coded with only two values: 0 or 1. For instance, it is the technology that is used on CDs: the laser will read if there is a “hole” (0) or not (1) on the record, and interpret it consequently. That is what is most used in IT, and we will focus in this article on the storage of this kind of data.
Before all, let us clarify that we set aside “read-only” storage (for instance non-rewritable CDs and DVDs), to focus on “read-write” storage.
There are three major modern technologies for storing rewritable numerical data:
Magnetic tape can seem out of date, but it is still used. Its advantage is that the data will remain intact as long as the support is well preserved. However, the access time to this data is particularly long (tape recovery, data treatment…). That is why magnetic tapes are, today, only used for archiving. They conserve the data that companies rarely or never need, and for which a long recovery time is not a problem.
Two very different technologies can be used in hard disks:
- Mecanical hard disk: the functionning of these disks can be compared to the one of a record player. The data is recorded on a circular platter (“hole” or “no hole” for 0 or 1), and a head moves to read or change the sought data. It implies physical movements, and limits the reading speed. The drawback of these disks, besides their limited performance, is the risk of breakage, like with every physical mechanism. For instance, if a machine is shut down for the first time in a year, it might be likely for the mechanism, not used to being cooled down, to seize up. Their benefit, however, is that they offer the greatest storage capacities on the market: there are 10TB mechanical hard disks.
- SSD hard disk: these disks are based on the ROW RW (Read-Only Memory ReWrite) technology, that uses electricity to reach or modify data. This technology is thus much quicker than the previous one; the only operations slowing the disks down are the ones enabling the data to be permanent even when there is no electricity. Their storage capacity is, however, smaller than the one of mechanical drives, even if this disparity tends to lessen (in 2015, SSD hard disks of 6 and even 8TB were introduced to the market!)
RAM is the live, volatile memory of a machine. It uses electricity to tread data too; however, RAM is even faster than SSD disks, since it is not slowed down by any operations making the data permanent. That is why it is only used by programs to host temporary data: once the machine, and thus the current, is shut down, nothing maintains the data, which is lost.
Thus, hard disks are the most fitting technology for storing data. They offer a large choice of solutions, adapted to all situations, as we will see.
In every machine or computer (with some exceptions), there is an internal hard disk. We also know about external disks, a technology that is accessible to private individuals. However, technologies change depending on the use of the disk, and its localisation!
Internal hard disks
An internal hard disk has to be linked to the machine’s motherboard. For this, a bus (cf image on the left) links it to a controller card (in this case, a drive controller), whose role is to be a relay, an translator, between the motherboard and other components of the machine. Most of the buses use one of the two following format:
- SATA (Serial Advanced Technology Attachment) is the current standard format used for hard disks, both mechanical and SSD. It is cheap, offers good storing capacities, but do not allow the use of buses of more than 1 meter long.
- SAS (Serial Attached SCSI) is especially reserved for a professional use. Indeed, it is particularly efficient and adapted to production, since it limits the risk of data loss. This last point is caused by the use of SCSI commands, which provide a good error recovery and reporting. It also work with double ports, which limits the need of additional equipment; by the way, these ports can read and write data at the same time, which is not possible with SATA. With this format, it is also possible to use buses up to 10 meters long.
These formats slightly evolve over time, even if the changes they imply are minor. They allow to transfer 3 to 6GB/s of data, a superior speed compared to most mechanical disks: the latter thus set up a performance ceiling. That is not the case of SSD disks, which were quickly limited by the buses’ speed. It triggered the creation of a new standard called NVME for SSD disks: they are directly plugged into the motherboard, and communicate through the only protocol it understands: PCI-E. It enables to withdraw the intermediary (the disk controller), and thus to get better performances.
External hard disks
To link an external disk to a machine, there are two possible ways: physical liaison, or laison through a network. The disks physically linked to a machine are used by private individuals for instance. Actually, the protocols used internally by the machine do not change (SAS, SATA, PCI-E)! The fact that the disks are external only adds an intermediary: the external controller in which the device is plugged. There are two categories of external disks:
- USB disk: they are the most renowned! Their performances are average: indeed, the USB controller in which the disk is plugged in has to transform the data from USB to SATA or SAS. USB3, however, is beginning to provide good performances (400MB/s today, 800 MB/s to come with USB3.1).
- E-SATA (External-SATA) disk: this disk, as its name implies, directly sends data in SATA, which means that no data conversion is needed. It is, thus, more performant than USB disks.
There is a real difference, however, with the disks that are linked to a machine through a network. That is what NBS System uses for its clients.
Hard disks using the network
The reason why NBS System uses external hard disks linked to its servers via the network is mainly to gain space. To achieve a maximum density, the best solution is the mutualization of our storage equipments, in dedicated spaces. We thus do not have one disk per client, placed next to its server, but rather a disk plant, gathered in one place. Its allows to better play with the spaces, and to limit the number of empty ones. This configuration also enables us to organize our storage space using RAID. RAID consists in combining several hard disks, either to get better performances (by distributing the data of a project on two disks, for its treatment to be twice as short) or to limit the risks of loss (data redundancy on several disks).
However, to serve clients, these disks have to be linked to the matching servers! For this, rather than having a large number of cables, we use the network. Two protocols can be used:
- Fiber Channel: it is a specialized protocol allowing a high speed connection between a computer and its storage space. It offers good performances (up to 16GB/s), but is mainly interesting because it limits risks: indeed, it offers a unique latency and integrity guarantee. It is very expensive, however, and requests a dedicated infrastructure (adapted switchs and cards): it is thus only used by some professionals, after a meticulous study!
- Ethernet: a more renowned name… This protocol, on the opposite of Fiber Channel, is a simple packets transmission network protocol. It is accessible to all, for the following reasons: it is cheap, its use only requests average technical competences, and it offers good performances (1Gb/s for domestic use, up to 100Gb/s for professionals).
NBS System, to link its storage space and its servers, uses the Ethernet protocol. It allows us to benefit from a simple network architecture, with only one kind of equipment. There again, go for mutualization!
The network is thus used to transport data from the storage space to the servers. However, with mutualization, there is no need for one disk per client: there are several disks, each containing the data of several clients. However, this data has to be divided depending on who it belongs to! That is why the disks contain several volumes, which represent an association of data. In these volumes, information is organized depending on a File System: this term indicates “a way to store information and to organize it into files”. There are several File Systems: NTFS, FAT32 (Microsoft), ext3 & ext4 (Linux), ZFS (Solaris / FreeBSD), and many others.
This is how it works everywhere. However, to build a storage space, one can choose between two methods: SAN or NAS. Their difference lies in the way the volumes are exposed.
SAN, or Storage Area Network, is a method enabling to give the illusion that a local storage space is on the machine, which means that the exposed volumes are considered by the machine as physically plugged in hard disks. It is caused by the fact that with SAN, the data is not directly exposed: the whole volume is presented to the machine, like a block.
Usually, the ISCSI (Internet Small Computer System Interface) protocol is used to forward data with SAN. It comes from the SCSI protocol (just like SAS), which allors to use the network (especially the TCP/IP protocol). It assembles the data into packets, and link them to SCSI commands. When the data arrives, the packets are disassembled, and the SCSI commands give the illusion that the disk is physically plugged in.
SAN’s great advantage is that it leaves a certain freedom to administrators. Indeed, given the fact that the volume, rather than the data, is directly transmitted, the receiving machine does not know what it contains: it has to find out which File System is used to store the data. Consequently, it has the possibility to change this File System: the admin can master the organisation of the data. Another positive point: since the volumes are considered as physical disks by the machine, they can be organized using the RAID method.
The drawbacks of the SAN technique are, however, also linked to the fact that machines see the volumes as physical disks. On the one hand, an unplanned deconnection between the storage space and the machine (breakdown, network cut-off…) will not be handled well by the latter: just like when one unplugs a USB device without notifying their computer first, there are risks of data loss, material deterioration… On the other hand, it is impossible for several people to work at the same time on a same volume, since it is “sent” to the machine: it is as if it was no longer available on the storage space.
With NAS, or Network Access Storage, the data is directly exposed: here, the storage equipment handles the File System to treat the data before sending it. NAS generally uses the NFS (Network File System) or SMB protocols to forward the data. They enable to share data between systems, through the TCP/IP protocol.
With this method, the freedom offered by SAN disappears: since the storage equipment handles the File System, the server administrators have no control over it. They can not change the File System, and have to make do with the one used by the equipment. NBS System, for instance, uses NetApps for its SAN storage: the machines thus necessarily receive data that was stored and organized with WAFL. In the same way, it is impossible to use RAID with NAS, since the volumes are not considered by servers as physical equipments.
However, NAS also offers advantages: this technique fully uses the network capacities, especially its resilience. It means that if the connection is cut off, the machine will simply wait for the equipment to be reachable again, and will start the treatment again. There is thus no risk of data loss or equipment deterioration.
It is also possible for several people to work at the same time on a same volume when using NAS. It is a positive point, even if simultaneous writing can be a problem. However, a great number of simultaneous users on a same space can negatively impact the equipment’s performances: it has to handle the File System and treat the data for each user, which is a heavy work…
Another problem occuring with NAS: it is impossible to do disk caching. This method consists in mobilizing unused RAM on a machine to cache some data from the disk. Thus, requests involving this data can be treated more quickly, and performances are improved. This also lightens the load of the disk. But this is only possible with a physical hard disk, which is not the case of a volume sent with NAS. However, NBS System’s experts found a way to still benefit from this cache: on our NAS, we put images containing the data of a volume, and these are sent to the machine. It transforms an image into a volume, and consider is as an external hard drive. It also enables us to organize our volumes using the RAID method, even if they use NAS.
Storage thus offers a multitude of possibilities, all complementary. There are personal and professional solutions fitting each need and constraint!
Find all of our articles about infrastructure here.
Sources: Denis Pompilio & Benoît Depail