Voice over ATM Tutorial
by Ray Horak
CommWeb.com
05/29/01
In order to understand Voice over ATM (VoATM), it would be a
really good idea for you to understand Asynchronous Transfer Mode
itself. So the recommended prerequisite for this lesson is the
last lesson, ATM
Fundamentals. For that matter, it wouldn't hurt
you to read the earlier lesson on Voice
over Frame Relay (VoFR); this so you can make an
informed comparison. Heck, go back and reread the lesson on Frame
Relay Basics, so you can put VoFR in context too.
ATM is a connection-oriented, fast-packet cell switching standard
developed for broadband signals in the WAN
(Wide Area Network). Specifically, it was developed for application
in the backbone, or core, of the WAN in support of B-ISDN
(Broadband ISDN).
B-ISDN, just like its predecessor N-ISDN (Narrowband ISDN), is
intended to support all information types, including voice, data,
video, image, facsimile and multimedia. Unlike N-ISDN, B-ISDN
is intended to support each information type through one of several
Service Classes in ITU-T terminology, or Service Categories in
ATM Forum terminology. That way, each information type -- and
underlying application type -- is afforded just the treatment
it expects. That translates into differentiation in terms of QoS
(Quality of Service).
N-ISDN treats each information type exactly the same -- like
real-time uncompressed voice, which demands a temporary, continuous
and exclusive connection over which data can be transmitted at
a Constant
Bit Rate (CBR). It doesn't get any better than
that, and it doesn't need to be that good. This approach is very
effective, but very inefficient, even for voice.
Contemporary voice compression algorithms support real-time compressed
voice in a much more efficient way, without yielding much in terms
of quality. When it comes to data transmission, most data protocols
don't expect anything even approaching this level of quality.
So ATM was designed to deliver just exactly the QoS level that
each data type expects -- no more and no less. (I just repeated
myself, but that thought bears repeating.) Further, ATM guarantees
QoS, at multiple levels, simultaneously. No other protocol currently
does that.
Voice The Conventional
Way
Before we get into the specifics of VoATM, let's review the basics
of voice communications as it is handled in the conventional PSTN
(Public Switched Telephone Network).
You will recall that voice is analog in its native form and that
the PSTN also was entirely analog for the first 75 years or so.
In networks worldwide, the analog PSTN provided for each voice
conversation to be carried in a 4-KHz channel. (Note: HZ is an
abbreviation for Hertz, which is a single waveform. In voice applications,
it starts out as an audio compression wave, which is converted
into an electrical wave. All electromagnetic energy travels in
waveform.) In fundamental terms, that means a channel that runs
between 0 KHz and 4 KHz.
In a multichannel analog carrier system, one channel might run
at 0-4 KHz, the next at 4-8 KHz, the next at 8-12 KHz, and so
on. Therefore, each voice-grade channel supports a range of frequencies
that is 4 KHz wide. That's not enough for perfect voice transmission
(we are capable of creating audio well above 4 KHz.), but it's
good enough. Each channel supports a range of signal amplitude
(i.e., signal strength) that relates to a volume level. The amplitude
level also is limited, so your loudest screams can't quite be
heard; but that's probably just as well. Again, it's not enough
for perfect voice transmission, but it's good enough. It's known
as toll quality voice.
Around the end of WWII, the networks began the transition from
analog to digital technology. Digital offers a lot of advantages,
including greater bandwidth, better error performance, and enhanced
management and control. Virtually all contemporary switches of
all types are digital in nature, and so is a lot of terminal equipment.
Most transmission facilities also are digital, with the notable
exception of most copper local loops serving residential and small
business applications. That makes the contemporary WAN virtually
100% digital, from edge-to-edge, at least in developed countries.
To support voice in its native analog form over a digital network,
the analog signal has to be coded (i.e., converted) into a digital
format at some point after leaving your lips and prior to entering
the WAN. On the receiving end, the digital signal has to be decoded
(i.e., reconverted) back into an analog format in order to be
intelligible to the human ear.
Those conversion processes are accomplished by a matching pair
of codecs (coder/decoders), with the traditional method being
PCM
(Pulse Code Modulation), standardized by the ITU-T as G.711. PCM
is based on the Nyquist Theorem, developed by Harry Nyquist of
Bell Telephone Laboratories in 1928. The theorem (paraphrased)
states that, in order to convert analog voice to a digital format,
send it over a digital circuit, and reproduce high-quality analog
voice at the receiving end, you must sample the amplitude of the
analog sine wave at twice the highest frequency on the line.
If one samples at twice the highest frequency on the line, one
samples, therefore, at a rate of 4,000 x 2 = 8,000 times a second.
(It's necessary to sample only the amplitude. The frequency will
take care of itself at that rate.) If you do the math, you see
that 8,000 samples x 8 bits per byte = 64,000 bits per second,
or 64 Kbps. That's a voice-grade digital channel.
PCM further specifies that the sampling process take place at
precise intervals of 125ms (microseconds, or millionths of a second),
which is exactly 1/8000th of a second. Each sample is coded into
an eight-bit digital value. The resulting eight-bit bytes are
interleaved by multiplexers, and sent across channelized digital
circuits (e.g., T-carrier) to be directed and redirected by circuit
switches, sent across circuits (e.g., SONET) that interconnect
the switches in the network core and decoded on the receiving
end of the transmission.
The decoded signal, now in an analog form once again, is only
an approximation of the original analog signal, but it's thoroughly
understandable to the human ear. It's not quite that simple, of
course. Timing is critical.
The network must be in a position to accept, switch, transport,
and deliver every voice byte precisely every 125ms. That means
that latency (i.e., delay) must be minimal and jitter (i.e., variability
in delay) must be virtually zero. That translates into a network
based on circuit-switching and channelized T-carrier and/or SONET.
Taken together, this approach ensures that, once the call is set
up, the associated bandwidth is committed for the entire duration
of a circuit-switched call, absolutely and without question.
Voice The ATM Way
Now that your memory is refreshed about traditional voice, Frame
Relay, VoFR, and ATM, let's explore VoATM. First, we'll examine
pure VoATM. We'll spend some time later examining ATM as a backbone
switching technology for VoFR and VoIP.
In a pure ATM scenario, there are two basic ways to support VoATM:
as CBR (Constant Bit Rate) traffic, and as rt-VBR (real-time Variable
Bit Rate) traffic. Let's spend just a minute reviewing those Service
Categories, as defined by the ATM Forum:
VoATM originally was cast exclusively in a CBR mode, based on
a traditional PCM model. This approach supports PCM voice through
Circuit Emulation (CE), which describes the emulation (i.e., imitation)
of a T1 circuit through an ATM-based network. In other words,
a T1's worth of bandwidth (1.544 Mbps) is set aside over a path
set up switch-to-switch (ATM operates only at Layers 1 and 2 of
the OSI
Reference Model), and end-to-end through perhaps many switches.
In effect, an ATM network in CBR mode behaves like a circuit-switched
network.
As each ATM cell is readied to transit the ATM network, it is
packed with as many as 48 PCM samples, which fills up the 48-octet
payload. The cell is fired across the network with very little
delay, very little jitter, and very low likelihood of loss. Each
ATM switch recognizes the high-priority of the cell from the notation
in the cell header and because it's been alerted in the call setup
process to expect this data.
On the receiving end, the cell is deconstructed to get to the
payload of PCM samples, and those samples are spliced onto the
tail end of the set of samples that arrived in the previous cell.
This approach works beautifully, but it's inefficient. It's not
quite as inefficient as true circuit-switching, which literally
commits resources all through the network for each call, but it's
close. In terms of perceived quality, VoATM in CBR mode still
ranks as toll quality, earning a ranking of 4.4 on a scale of
5.0.
In an rt-VBR mode, VoATM is supported in burst mode, while still
supporting the maintenance of tight synchronization between the
transmitter and receiver. The nature of rt-VBR supports highly
compressed voice, which generally makes use of a compression algorithm
based on CELP
(Code Excited Linear Prediction).
Such predictive techniques can mitigate the perceived loss in
quality suffered by a lost set of voice samples, as the previous
sets can provide sufficient information on which to predict those
subsequent voice flows in a fairly predictable pattern, at least
over the short term. Let's revisit the specifics of several common
compression techniques based on CELP:
- CS-ACELP (Conjugate Structure-Algebraic Code Excited Linear
Prediction) runs at 8 Kbps, a compression rate of 8:1. There
are two versions, which vary in terms of computational complexity,
either of which offers a perceived raw voice quality ranking
of 4.2, which is similar to that of ADPCM at 32 Kbps. Compression
delay, however, is in the range of 10.0 ms, which is perceptible.
- LD-CELP (Low Delay-Code Excited Linear Prediction)
generally runs at 16 Kbps (rates as low as 12.8 Kbps can be
achieved), a compression rate of 4:1. Compression delay is 3.0
ms-5.0 ms. Again, a perceived raw voice quality ranking of 4.2
can be achieved, similar to that of ADPCM at 32 Kbps (see my
post below on ADPCM).
Now let's focus on LD-CELP, standardized by the ITU-T as G.728.
This technique is described as low delay due to the fact that
only five PCM samples, representing only 625ms (i.e., .625ms,
or .00625 seconds) of a voice signal are gathered at a time and
examined as a data block. LD-CELP then converts each eight-bit
PCM byte into a two-bit compressed value, yielding a set of five
PCM samples (40 bits, in total) converted into a 10-bit compressed
value.
The next step varies, depending on the service provider's approach.
Using the least efficient approach, the 10-bit value is placed
into the payload of an AAL 2 cell, with stuff bits completing
the fill of 48 octets, and the cell is sent on its way.
A more efficient approach might call for four 10-bit values to
be linked together and placed into the cell, with stuff bits completing
the fill, and the cell is sent on its way. Each set of 40 bits
represents only 2.5ms of a voice stream, which is well within
delay limits. As the number of 10-bit values increases and the
number of stuff bits decreases, the technique becomes more and
more efficient, although compression delay increases and the nature
of the conversational voice communication can be affected in a
very negative way. In any event, once filled, each cell is fired
across the network.
Each ATM switch in that network recognizes the high priority
of the cell through analysis of the cell header, and processes
the cell with no delay. On the receiving end, the process is reversed,
with the voice data being decompressed, reconverted into analog
format, and played back for the listening pleasure of your conversational
partner. Now, if all of this is done just right, cell #2 arrives
at the receiver just in time to be decompressed and played back
just as the voice data from cell #1 was finished playing.
Now, let's consider another dimension of AAL 2 as it applies
to compressed voice. Voice is two-way, so circuits and networks
have to support full-duplex communications. However, the basic
human voice communications protocol suggests that we should take
turns talking. (My mother does not subscribe to this theory, but
we'll keep that between us, please.)
Therefore, 50% of the circuit and network is quiet virtually
100% of the time. Further, the active speaker is silent about
40% of the time, due to natural pauses and the requirement to
stop and breathe once in a while.
A silence suppression mechanism supported by AAL 2 can be used
to notify the receiving device to play comfort noise (i.e., white
noise) to assure the listener that the connection remains active
even if the speaker does not. This mechanism relieves the network
of the requirement to transmit silence, as well as active audio.
In other words, only sound bytes need be transmitted. Silence
bytes are unnecessary -- they consume precious bandwidth, while
adding absolutely no value.
When you put LD-CELP together with silence suppression over AAL
2 in an ATM network, the overall savings in network resources
can be quite dramatic. Bandwidth is always limited and, at some
level in any network, bandwidth is always shared. In a highly-shared,
statistically-multiplexed ATM network, compressed voice over AAL
2 yields bandwidth savings that can be used for other purposes,
like other compressed voice conversations, LAN-to-LAN data, Frame
Relay data or compressed video.
So much for pure VoATM. Although I should point out that there's
not much of it. The incumbent carriers generally have plenty of
capacity in their legacy circuit-switched networks, and it's very
inexpensive to add capacity as needed. The cost of circuit switches
has come way down over the last few years, in the face of competition
from ATM, Frame Relay and IP.
So circuit-switching generally remains cost-effective, even though
it's inefficient. That's especially so when you consider that
ATM requires a forklift upgrade (i.e., total replacement), and
that's very expensive. Now, VoATM would make a lot of sense in
a private enterprise ATM network provided, of course, by a public
carrier. But there aren't a lot of those around either -- not
at the moment, at least.
Voice over Packet over
ATM
For the most part, ATM is positioned in the backbone of data
networks. That's where the growth is, and that's where ATM really
shines. Once so positioned, it's a relatively simple matter to
support not only Frame Relay, but also VoFR, and with a higher
-- and guaranteed -- QoS level. Also, once so positioned, it's
a relatively simple matter to support not only IP, but also VoIP,
and with a higher -- and guaranteed -- QoS level. For that matter,
once so positioned, a service provider can support them all, and
a lot more, all simultaneously and all with guaranteed QoS levels.
Only ATM can do that.