Voice/Data Comm 101
 
 
home page general  website information contact me at lamarheller@earthlink.net copyright information
 
Voice over ATM Tutorial

by Ray Horak
CommWeb.com
05/29/01

In order to understand Voice over ATM (VoATM), it would be a really good idea for you to understand Asynchronous Transfer Mode itself. So the recommended prerequisite for this lesson is the last lesson, ATM Fundamentals. For that matter, it wouldn't hurt you to read the earlier lesson on Voice over Frame Relay (VoFR); this so you can make an informed comparison. Heck, go back and reread the lesson on Frame Relay Basics, so you can put VoFR in context too.

ATM is a connection-oriented, fast-packet cell switching standard developed for broadband signals in the WAN (Wide Area Network). Specifically, it was developed for application in the backbone, or core, of the WAN in support of B-ISDN (Broadband ISDN).

B-ISDN, just like its predecessor N-ISDN (Narrowband ISDN), is intended to support all information types, including voice, data, video, image, facsimile and multimedia. Unlike N-ISDN, B-ISDN is intended to support each information type through one of several Service Classes in ITU-T terminology, or Service Categories in ATM Forum terminology. That way, each information type -- and underlying application type -- is afforded just the treatment it expects. That translates into differentiation in terms of QoS (Quality of Service).

N-ISDN treats each information type exactly the same -- like real-time uncompressed voice, which demands a temporary, continuous and exclusive connection over which data can be transmitted at a Constant Bit Rate (CBR). It doesn't get any better than that, and it doesn't need to be that good. This approach is very effective, but very inefficient, even for voice.

Contemporary voice compression algorithms support real-time compressed voice in a much more efficient way, without yielding much in terms of quality. When it comes to data transmission, most data protocols don't expect anything even approaching this level of quality.

So ATM was designed to deliver just exactly the QoS level that each data type expects -- no more and no less. (I just repeated myself, but that thought bears repeating.) Further, ATM guarantees QoS, at multiple levels, simultaneously. No other protocol currently does that.

Voice The Conventional Way

Before we get into the specifics of VoATM, let's review the basics of voice communications as it is handled in the conventional PSTN (Public Switched Telephone Network).

You will recall that voice is analog in its native form and that the PSTN also was entirely analog for the first 75 years or so. In networks worldwide, the analog PSTN provided for each voice conversation to be carried in a 4-KHz channel. (Note: HZ is an abbreviation for Hertz, which is a single waveform. In voice applications, it starts out as an audio compression wave, which is converted into an electrical wave. All electromagnetic energy travels in waveform.) In fundamental terms, that means a channel that runs between 0 KHz and 4 KHz.

In a multichannel analog carrier system, one channel might run at 0-4 KHz, the next at 4-8 KHz, the next at 8-12 KHz, and so on. Therefore, each voice-grade channel supports a range of frequencies that is 4 KHz wide. That's not enough for perfect voice transmission (we are capable of creating audio well above 4 KHz.), but it's good enough. Each channel supports a range of signal amplitude (i.e., signal strength) that relates to a volume level. The amplitude level also is limited, so your loudest screams can't quite be heard; but that's probably just as well. Again, it's not enough for perfect voice transmission, but it's good enough. It's known as toll quality voice.

Around the end of WWII, the networks began the transition from analog to digital technology. Digital offers a lot of advantages, including greater bandwidth, better error performance, and enhanced management and control. Virtually all contemporary switches of all types are digital in nature, and so is a lot of terminal equipment. Most transmission facilities also are digital, with the notable exception of most copper local loops serving residential and small business applications. That makes the contemporary WAN virtually 100% digital, from edge-to-edge, at least in developed countries.

To support voice in its native analog form over a digital network, the analog signal has to be coded (i.e., converted) into a digital format at some point after leaving your lips and prior to entering the WAN. On the receiving end, the digital signal has to be decoded (i.e., reconverted) back into an analog format in order to be intelligible to the human ear.

Those conversion processes are accomplished by a matching pair of codecs (coder/decoders), with the traditional method being PCM (Pulse Code Modulation), standardized by the ITU-T as G.711. PCM is based on the Nyquist Theorem, developed by Harry Nyquist of Bell Telephone Laboratories in 1928. The theorem (paraphrased) states that, in order to convert analog voice to a digital format, send it over a digital circuit, and reproduce high-quality analog voice at the receiving end, you must sample the amplitude of the analog sine wave at twice the highest frequency on the line.

If one samples at twice the highest frequency on the line, one samples, therefore, at a rate of 4,000 x 2 = 8,000 times a second. (It's necessary to sample only the amplitude. The frequency will take care of itself at that rate.) If you do the math, you see that 8,000 samples x 8 bits per byte = 64,000 bits per second, or 64 Kbps. That's a voice-grade digital channel.

PCM further specifies that the sampling process take place at precise intervals of 125ms (microseconds, or millionths of a second), which is exactly 1/8000th of a second. Each sample is coded into an eight-bit digital value. The resulting eight-bit bytes are interleaved by multiplexers, and sent across channelized digital circuits (e.g., T-carrier) to be directed and redirected by circuit switches, sent across circuits (e.g., SONET) that interconnect the switches in the network core and decoded on the receiving end of the transmission.

The decoded signal, now in an analog form once again, is only an approximation of the original analog signal, but it's thoroughly understandable to the human ear. It's not quite that simple, of course. Timing is critical.

The network must be in a position to accept, switch, transport, and deliver every voice byte precisely every 125ms. That means that latency (i.e., delay) must be minimal and jitter (i.e., variability in delay) must be virtually zero. That translates into a network based on circuit-switching and channelized T-carrier and/or SONET. Taken together, this approach ensures that, once the call is set up, the associated bandwidth is committed for the entire duration of a circuit-switched call, absolutely and without question.

Voice The ATM Way

Now that your memory is refreshed about traditional voice, Frame Relay, VoFR, and ATM, let's explore VoATM. First, we'll examine pure VoATM. We'll spend some time later examining ATM as a backbone switching technology for VoFR and VoIP.

In a pure ATM scenario, there are two basic ways to support VoATM: as CBR (Constant Bit Rate) traffic, and as rt-VBR (real-time Variable Bit Rate) traffic. Let's spend just a minute reviewing those Service Categories, as defined by the ATM Forum:

  • CBR (Constant Bit Rate) traffic is characterized by a continuous flow of data, as you will recall from the previous lesson on ATM Fundamentals. Such stream-oriented data is highly intolerant of loss, latency (delay), and jitter (variability in delay).

    Example applications include real-time uncompressed voice, audio and video. CBR traffic is supported by AAL 1 (ATM Adaptation Layer 1) and supports Class A traffic, in ITU-T terminology.

  • rt-VBR (real-time Variable Bit Rate) traffic is characterized as highly bursty in nature, but reliant on tight synchronization (i.e., timing) between transmitter and receiver. Examples include real-time compressed voice, audio, and video. rt-VBR traffic is supported by AAL 2 and supports Class B traffic, in ITU-T terminology.

VoATM originally was cast exclusively in a CBR mode, based on a traditional PCM model. This approach supports PCM voice through Circuit Emulation (CE), which describes the emulation (i.e., imitation) of a T1 circuit through an ATM-based network. In other words, a T1's worth of bandwidth (1.544 Mbps) is set aside over a path set up switch-to-switch (ATM operates only at Layers 1 and 2 of the OSI Reference Model), and end-to-end through perhaps many switches. In effect, an ATM network in CBR mode behaves like a circuit-switched network.

As each ATM cell is readied to transit the ATM network, it is packed with as many as 48 PCM samples, which fills up the 48-octet payload. The cell is fired across the network with very little delay, very little jitter, and very low likelihood of loss. Each ATM switch recognizes the high-priority of the cell from the notation in the cell header and because it's been alerted in the call setup process to expect this data.

On the receiving end, the cell is deconstructed to get to the payload of PCM samples, and those samples are spliced onto the tail end of the set of samples that arrived in the previous cell. This approach works beautifully, but it's inefficient. It's not quite as inefficient as true circuit-switching, which literally commits resources all through the network for each call, but it's close. In terms of perceived quality, VoATM in CBR mode still ranks as toll quality, earning a ranking of 4.4 on a scale of 5.0.

In an rt-VBR mode, VoATM is supported in burst mode, while still supporting the maintenance of tight synchronization between the transmitter and receiver. The nature of rt-VBR supports highly compressed voice, which generally makes use of a compression algorithm based on CELP (Code Excited Linear Prediction).

Such predictive techniques can mitigate the perceived loss in quality suffered by a lost set of voice samples, as the previous sets can provide sufficient information on which to predict those subsequent voice flows in a fairly predictable pattern, at least over the short term. Let's revisit the specifics of several common compression techniques based on CELP:

  • CS-ACELP (Conjugate Structure-Algebraic Code Excited Linear Prediction) runs at 8 Kbps, a compression rate of 8:1. There are two versions, which vary in terms of computational complexity, either of which offers a perceived raw voice quality ranking of 4.2, which is similar to that of ADPCM at 32 Kbps. Compression delay, however, is in the range of 10.0 ms, which is perceptible.
  • LD-CELP (Low Delay-Code Excited Linear Prediction) generally runs at 16 Kbps (rates as low as 12.8 Kbps can be achieved), a compression rate of 4:1. Compression delay is 3.0 ms-5.0 ms. Again, a perceived raw voice quality ranking of 4.2 can be achieved, similar to that of ADPCM at 32 Kbps (see my post below on ADPCM).

Now let's focus on LD-CELP, standardized by the ITU-T as G.728. This technique is described as low delay due to the fact that only five PCM samples, representing only 625ms (i.e., .625ms, or .00625 seconds) of a voice signal are gathered at a time and examined as a data block. LD-CELP then converts each eight-bit PCM byte into a two-bit compressed value, yielding a set of five PCM samples (40 bits, in total) converted into a 10-bit compressed value.

The next step varies, depending on the service provider's approach. Using the least efficient approach, the 10-bit value is placed into the payload of an AAL 2 cell, with stuff bits completing the fill of 48 octets, and the cell is sent on its way.

A more efficient approach might call for four 10-bit values to be linked together and placed into the cell, with stuff bits completing the fill, and the cell is sent on its way. Each set of 40 bits represents only 2.5ms of a voice stream, which is well within delay limits. As the number of 10-bit values increases and the number of stuff bits decreases, the technique becomes more and more efficient, although compression delay increases and the nature of the conversational voice communication can be affected in a very negative way. In any event, once filled, each cell is fired across the network.

Each ATM switch in that network recognizes the high priority of the cell through analysis of the cell header, and processes the cell with no delay. On the receiving end, the process is reversed, with the voice data being decompressed, reconverted into analog format, and played back for the listening pleasure of your conversational partner. Now, if all of this is done just right, cell #2 arrives at the receiver just in time to be decompressed and played back just as the voice data from cell #1 was finished playing.

Now, let's consider another dimension of AAL 2 as it applies to compressed voice. Voice is two-way, so circuits and networks have to support full-duplex communications. However, the basic human voice communications protocol suggests that we should take turns talking. (My mother does not subscribe to this theory, but we'll keep that between us, please.)

Therefore, 50% of the circuit and network is quiet virtually 100% of the time. Further, the active speaker is silent about 40% of the time, due to natural pauses and the requirement to stop and breathe once in a while.

A silence suppression mechanism supported by AAL 2 can be used to notify the receiving device to play comfort noise (i.e., white noise) to assure the listener that the connection remains active even if the speaker does not. This mechanism relieves the network of the requirement to transmit silence, as well as active audio. In other words, only sound bytes need be transmitted. Silence bytes are unnecessary -- they consume precious bandwidth, while adding absolutely no value.

When you put LD-CELP together with silence suppression over AAL 2 in an ATM network, the overall savings in network resources can be quite dramatic. Bandwidth is always limited and, at some level in any network, bandwidth is always shared. In a highly-shared, statistically-multiplexed ATM network, compressed voice over AAL 2 yields bandwidth savings that can be used for other purposes, like other compressed voice conversations, LAN-to-LAN data, Frame Relay data or compressed video.

So much for pure VoATM. Although I should point out that there's not much of it. The incumbent carriers generally have plenty of capacity in their legacy circuit-switched networks, and it's very inexpensive to add capacity as needed. The cost of circuit switches has come way down over the last few years, in the face of competition from ATM, Frame Relay and IP.

So circuit-switching generally remains cost-effective, even though it's inefficient. That's especially so when you consider that ATM requires a forklift upgrade (i.e., total replacement), and that's very expensive. Now, VoATM would make a lot of sense in a private enterprise ATM network provided, of course, by a public carrier. But there aren't a lot of those around either -- not at the moment, at least.

Voice over Packet over ATM

For the most part, ATM is positioned in the backbone of data networks. That's where the growth is, and that's where ATM really shines. Once so positioned, it's a relatively simple matter to support not only Frame Relay, but also VoFR, and with a higher -- and guaranteed -- QoS level. Also, once so positioned, it's a relatively simple matter to support not only IP, but also VoIP, and with a higher -- and guaranteed -- QoS level. For that matter, once so positioned, a service provider can support them all, and a lot more, all simultaneously and all with guaranteed QoS levels. Only ATM can do that.