Voice/Data Comm 101
 
 
home page general  website information contact me at lamarheller@earthlink.net copyright information
 

 

Voice over IP (VoIP) Tutorial

CommWeb.com
07/09/01

In order for you to understand Voice over IP (VoIP), it would be a great idea if you understood the TCP/IP protocol suite. For those of you who didn't read the previous lesson, I humbly recommend that you consider clicking on TCP/IP Essentials. Heck, even if you were here, it wouldn't hurt to review it before we get started.

Now that we're all together again, let's examine what is perhaps the wackiest idea yet for voice networking. In some ways it's not as wacky as Voice over Frame Relay (VoFR), which we covered in a previous lesson, but it's wacky, alright.

Let's Review Some Basics

You'll recall that the TCP/IP protocol suite was conceived and developed as a means of gluing together disparate data networks, independent of host computers, hardware and operating systems, transmission media, and data link technologies. TCP/IP was originally developed for the ARPANET, a network that linked together government agencies and institutions of higher learning with a small group of supercomputers used for various advanced research and development projects.

It was a time-share application that involved interactive data communications between asynchronous terminal devices as host computers. Since connectivity based on either circuit switching or dedicated leased lines was way too inefficient and expensive, TCP/IP took a packet-switched approach.

As we discussed in a previous lesson, Circuits, Packets, Frames and Cells, packet networks are highly shared data networks that always involve some degree of variability and unpredictability in terms of levels of latency (i.e., delay), jitter (i.e., variability in latency) and loss. Some applications can tolerate considerable levels of such problems, since there is time to adjust and recover, perhaps through retransmission.

E-mail is a good example, as is a file transfer, perhaps associated with a database backup. Some applications don't tolerate much, if any, of this sort of thing. Realtime voice and video are good examples. It appears to be quite clear that voice over an IP-based packet network doesn't make sense. So let's call it quits for the day. I'll see you at the next class.

Whoa! Not so fast! VoIP can be made to work, and to work quite well.

Voice The Conventional Way

Before we get into the specifics of VoIP, let's review the basics of voice communications as it is handled in the conventional PSTN (Public Switched Telephone Network). You will recall that voice is analog in its native form and that the PSTN also was entirely analog for the first 75 years or so.

In networks worldwide, the analog PSTN provided for each voice conversation to be carried in a 4-KHz channel. (Note: Hz is an abbreviation for Hertz, which is a single waveform. In a voice application, it starts out as an audio compression wave, which then is converted into an electrical wave. All electromagnetic energy travels in waveform.)

In fundamental terms, that means a channel runs between 0 KHz and 4 KHz. In a multichannel analog carrier system, one channel might run at 0-4 KHz, the next at 4-8 KHz, the next at 8-12 KHz, and so on. Therefore, each voice-grade channel supports a range of frequencies that is 4 KHz wide. That's not enough for perfect voice transmission (we are capable of creating audio well above 4 KHz), but it's good enough.

Further, each channel supports a range of signal amplitude (i.e., signal strength) that relates to a volume level. The amplitude level also is limited, so your loudest screams can't quite be heard over the network, but that's probably just as well. Again, it's not enough for perfect voice transmission, but it's good enough. It's known as toll quality voice.

Around the end of WWII (World War II, for you youngsters), the networks began the transition from analog to digital technology. Digital offers a lot of advantages, including greater bandwidth, better error performance, and enhanced management and control.

Virtually all contemporary switches of all types are digital in nature, and so is a lot of terminal equipment. Most transmission facilities also are digital, with the notable exception of most copper local loops serving residential and small business applications. That makes the contemporary WAN virtually 100% digital, from edge-to-edge, at least in developed countries.

To support voice in its native analog form over a digital network, the analog signal has to be coded (i.e., converted) into a digital format at some point after leaving your lips and prior to entering the WAN. On the receiving end, the digital signal has to be decoded (i.e., reconverted) back into an analog format in order to be intelligible to the human ear.

Those conversion processes are accomplished by a matching pair of codecs (coder/decoders), with the traditional method being PCM (Pulse Code Modulation), standardized by the ITU-T as G.711. PCM is based on the Nyquist Theorem, developed by Harry Nyquist of Bell Telephone Laboratories in 1928.

The theorem (paraphrased) states that, in order to convert analog voice to a digital format, send it over a digital circuit, and reproduce high-quality analog voice at the receiving end, one must sample the amplitude of the analog sine wave at twice the highest frequency on the line.

If one samples at twice the highest frequency on the line, one samples, therefore, at a rate of 4,000 x 2 = 8,000 times a second. (It's necessary to sample only the amplitude. The frequency will take care of itself at that rate.) If you do the math, you see that 8,000 samples x 8 bits per byte = 64,000 bits per second, or 64 Kbps. That's a voice-grade digital channel.

PCM further specifies that the sampling process take place at precise intervals of 125ms (microseconds, or millionths of a second), which is exactly 1/8000th of a second. Each sample is coded into an eight-bit digital value. The resulting eight-bit bytes are interleaved by multiplexers, and sent across channelized digital circuits (e.g., T-carrier) to be directed and redirected by circuit switches, sent across circuits (e.g., SONET) that interconnect the switches in the network core and ultimately decoded on the receiving end of the transmission.

The decoded signal, now in analog form once again, is only an approximation of the original analog signal, but it's thoroughly understandable to the human ear. It's not quite that simple, of course. Timing is critical. The network must be in a position to accept, switch, transport, and deliver every voice byte precisely every 125ms. That means that latency (i.e., delay) must be minimal and jitter (i.e., variability in delay) must be virtually zero.

Notice that we didn't mention loss, because there is none. That translates into a network based on circuit-switching and channelized T-carrier or E-carrier and/or SONET. Taken together, this approach ensures that, once the call is set up, the associated bandwidth is committed for the entire duration of a circuit-switched call, absolutely and without question.

You really can't ask for more than this. It's virtually perfect. It performs beautifully. The unfortunate part of it is that it's inefficient. It's certainly inefficient for most data applications, but that's another story. It's also inefficient for voice, if we consider the fact that voice can be compressed in some interesting ways; we'll get to that later.

Voice The IP Way

Now that your memory is refreshed about both traditional voice and TCP/IP, it should be very clear that the two have very little in common. Voice speaks to circuit-switching, not to packet switching and routing. Voice speaks to channelized TDM (Time Division Multiplexing), not to unchannelized statistically multiplexed circuits. Voice speaks to sound bytes and silence bytes created, accepted, multiplexed, switched, transported and delivered at the frequent, regular and precise pace of every 125ms; not to packets that are created whenever there's some data around, and that move through the network whenever it's available. Voice speaks to committed bandwidth from call setup to call teardown, not to bandwidth that's available whenever it happens to be available.

Voice expects perfection; most data wouldn't even understand it, much less appreciate it. Packet data thrives over a voice network, but voice trembles at just the thought of traveling over a packet data network. VoIP? It just won't work!

Actually, it will work, and quite nicely -- if everything works just right. The trick is to keep latency, jitter and loss within limits. While that can be tough in a network that is built around applications that can tolerate all three, it's possible to pull a trick or two within the network, and at the endpoints, to make things work out.

The real key to sending voice over any packet data network is compression, and VoIP is no exception. Compression offers several advantages, one of which is the reduction of raw bandwidth required to support the information transfer. If we follow the trail of a voice signal over a typical digital network, we will see that it begins life as an analog signal, which is converted by a codec (coder/decoder) into a PCM format.

VoIP often makes use of DSPs (Digital Signal Processors, which are fancy codecs) that not only convert the analog signal into a digital format but also compress the signal, thereby requiring less than the 64 Kbps demanded by PCM. The standard approaches (there also are proprietary techniques) include ADPCM and CS-ACELP, which I'll briefly explain below. I'll start with a brief explanation of PCM, for the sake of comparison, and I'll throw in LD-CELP.

PCM (Pulse Code Modulation), the basic coding technique used in traditional PSTNs worldwide, requires bandwidth of 64 Kbps. The delay imposed by the coding/decoding process is 0.75ms (milliseconds, or thousandths of a second), which is imperceptible. (Note: voice communications are bidirectional. Therefore, delay impacts communications in both directions.) PCM yields what we have come to know as toll quality voice, which rates a 4.4 MOS (Mean Opinion Score) on a scale of 0.0-5.0, with 5.0 being perfect.

ADPCM (Adaptive Differential Pulse Code Modulation), a technique also used in some PSTN networks, offers voice coding at 40, 32, 24 and 16 Kbps. At the most common implementation rate of 32 Kbps, the compression rate is 2:1 (2 to 1), which exactly halves the bandwidth required for voice. ADPCM imposes delay of 1.0ms, which is imperceptible.

At 32 Kbps, ADPCM yields voice quality at a 4.2 MOS, which is very close to that of PCM. At the higher compression rates of 24 Kbps and 16 Kbps, there is a corresponding drop in quality.

CS-ACELP (Conjugate Structure-Algebraic Code Excited Linear Prediction) runs at 8 Kbps, a compression rate of 8:1. There are two versions, which vary in terms of computational complexity, either of which offers raw voice quality that is similar to that of ADPCM at 32 Kbps. If everything goes just right, CS-ACELP rates as high as a 4.2 MOS. Compression delay, however, is in the range of 10.0ms, which is decidedly perceptible.

LD-CELP (Low Delay-Code Excited Linear Prediction) runs at 16 Kbps, a compression rate of 4:1. Compression delay is 3.0ms-5.0ms, and raw voice quality is similar to that of ADPCM at 32 Kbps. That translates into a MOS of as much as 4.2, assuming that everything goes just right.

Regardless of the compression technique employed, the VoIP process is basically the same, although the specifics vary a bit. Just as we did with Voice over Frame Relay (VoFR), let's use CS-ACELP as the compression algorithm and build an example scenario.

On the transmit side, 80 PCM voice samples, representing 10ms of a voice stream, are gathered together to form a set of voice data comprising 640 bits. That data set is run through the CS-ACELP compression algorithm by a codec embedded in a VoIP gateway in the form of a highly intelligent router, and reduced to 160 bits (20 octets) to be transmitted. This is accomplished by consulting a code book that provides an abbreviated representation of an approximation of the data.

The result is packed inside an IP packet, along with an UDP header for purposes of multiplexing, header error control, and identification of the application by port number. RTP (Real-time Transport Protocol) is run for end-to-end delivery services such as payload type identification, packet sequence numbering, time-stamping, and delivery monitoring.

On the receiving end, the process is reversed, and all is well -- at least as far as compression and decompression are concerned. The complication, of course, comes from the fact that the packets are not delivered by the network at the same pace that they entered it. Additionally, some packets may be lost in transit.

Remember that an IP network is a highly-shared packet network characterized by unpredictable levels of congestion. Latency is guaranteed, as is variability in delay (i.e., jitter). Loss isn't guaranteed, but it's highly likely, over time.

Much like VoFR, VoIP adjusts to latency, jitter and loss through various intelligent continuity algorithms employed by the receiving codec. These are designed to fill in the voids by stretching the voice frames received earlier and blending them with those received later.

This logic is embedded in predictive decompression algorithms such as CS-ACELP and LD-CELP, which take advantage of the 10ms delay built into the compression/decompression processes to make the necessary predictions and do their stretching and blending. VoIP also makes use of various techniques for echo cancellation, as echo becomes perceptible when delay exceeds 15ms-20ms.

The actual end-to-end process is a little more involved, of course. In an enterprise-wide VoIP application, both the calling and called parties sit behind a PBX. The caller in San Francisco, for example, picks up the phone and dials the extension number of a co-worker in New York. The PBX checks its options for routing the call, courtesy of its LCR (Least Cost Routing) software.
Over a special link, the PBX hands the call off to a VoIP service provider. If the VoIP gateway has the sense that the call can be supported with acceptable QoS, the call is accepted, and the above scenario plays out. The gateway compresses the voice data, packs it into a packet every 10ms, and off we go.

If network congestion levels remain low during the course of the call, conversation quality remains pretty good, but never quite as good as it is over the good old PSTN. If network congestion levels increase, so do latency and jitter, and packet loss may result. Voice quality suffers, and fond memories of the PSTN haunt the balance of the conversation.

If, on the other hand, the ingress gateway has the sense that current congestion levels are such that the quality of the call is likely to be compromised, the call may be routed over the conventional PSTN. Not all service providers, of course, offer the PSTN backup option, but many do in order to support business-class users.

So What -- and Why?

So, voice data can be compressed in order to use shared bandwidth more efficiently; this can be done with little loss in voice quality, if everything goes just right.

Further, the decompression process can be sophisticated enough to smooth out some of the problems associated with latency, jitter, and loss of voice data over a packet data network -- within limits.

At this point, you have to ask yourself why in the world you would go to all of this trouble to run voice over a packet data network when the end result is uncertain quality that will never be as good as the PSTN, and which can be terrible when the network suffers congestion.

There are several answers, one of which is cost. VoIP is very inexpensive, at least in comparison to the PSTN. However, you still have to wonder if VoIP is worth the trouble in the face of PSTN voice calling at rates in the range of $0.04-$0.06 per minute. At rates perhaps as low as $0.02-$0.04 per minute, however, VoIP looks pretty good to the typical business enterprise. At rates in the general range of $0.04-$0.08 per minute, it looks real good to the small business and consumer market.

The cost issue takes on real significance in a multinational enterprise. Calls to Japan or South Africa, for example, may well be in the range $0.50-$0.60 per minute. VoIP starts to look real good at these prices. Unfortunately, many countries have declared VoIP and VoFR illegal, so you're running some real risks in even attempting such a thing.

In closing, I have to say that I'm not trying to pick a fight with the proponents of VoIP, any more that I did in the case of VoFR. (That column, in fact, did elicit comments from a slightly enraged consultant friend or two.) I'm just calling it the way I see it.

By the way, the way I see it is that VoIP will be huge, and for several reasons:

First, it will be inexpensive, even though quality will suffer a bit.

Second, many quality issues can be overcome by ever faster switches and routers, and ever faster transmission systems. (Admittedly, this is something of a brute force attack on a congestion issue, but bandwidth is pretty cheap these days.)

Third, VoIP offers tremendous advantages in terms of the fact that voice and data can be integrated over the same IP-based network, and through the same terminals in the form of integrated voice/data client workstations.

If you have any doubt about the integrated workstation thing (which so far has been very slow to develop), note that Microsoft announced VoIP support in its recent release of Windows XP. Imagine being able to exercise complete call control through your PC, share files with colleagues while discussing the contents in real-time, and in a truly collaborative mode over the same network at very low cost.

Imagine being able to negotiate with a call center agent to purchase an airline ticket or an automobile or a piece of furniture or clothing, all while looking at the same information on the Website, and all over the Internet, or other IP-based network. I repeat that this is going to be huge. There is very little doubt about that.

There, of course, is a lot more to VoIP than we've had a chance to discuss here. However, I'm out of my allotment of server memory, electrons, photons and such.