Voice over IP (VoIP) Tutorial
CommWeb.com
07/09/01
In order for you to understand Voice over IP (VoIP), it would
be a great idea if you understood the TCP/IP protocol suite. For
those of you who didn't read the previous lesson, I humbly recommend
that you consider clicking on TCP/IP
Essentials. Heck, even if you were here, it wouldn't
hurt to review it before we get started.
Now that we're all together again, let's examine what is perhaps
the wackiest idea yet for voice networking. In some ways it's
not as wacky as Voice
over Frame Relay (VoFR), which we covered in a
previous lesson, but it's wacky, alright.
Let's Review Some Basics
You'll recall that the TCP/IP protocol suite was conceived and
developed as a means of gluing together disparate data networks,
independent of host computers, hardware and operating systems,
transmission media, and data link technologies. TCP/IP was originally
developed for the ARPANET, a network that linked together government
agencies and institutions of higher learning with a small group
of supercomputers used for various advanced research and development
projects.
It was a time-share application that involved interactive data
communications between asynchronous terminal devices as host computers.
Since connectivity based on either circuit switching or dedicated
leased lines was way too inefficient and expensive, TCP/IP took
a packet-switched approach.
As we discussed in a previous lesson, Circuits,
Packets, Frames and Cells, packet networks are
highly shared data networks that always involve some degree of
variability and unpredictability in terms of levels of latency
(i.e., delay), jitter (i.e., variability in latency) and loss.
Some applications can tolerate considerable levels of such problems,
since there is time to adjust and recover, perhaps through retransmission.
E-mail is a good example, as is a file transfer, perhaps associated
with a database backup. Some applications don't tolerate much,
if any, of this sort of thing. Realtime voice and video are good
examples. It appears to be quite clear that voice over an IP-based
packet network doesn't make sense. So let's call it quits for
the day. I'll see you at the next class.
Whoa! Not so fast! VoIP can be made to work, and to work quite
well.
Voice The Conventional
Way
Before we get into the specifics of VoIP, let's review the basics
of voice communications as it is handled in the conventional PSTN
(Public Switched Telephone Network). You will recall that voice
is analog in its native form and that the PSTN also was entirely
analog for the first 75 years or so.
In networks worldwide, the analog PSTN provided for each voice
conversation to be carried in a 4-KHz channel. (Note: Hz is an
abbreviation for Hertz, which is a single waveform. In a voice
application, it starts out as an audio compression wave, which
then is converted into an electrical wave. All electromagnetic
energy travels in waveform.)
In fundamental terms, that means a channel runs between 0 KHz
and 4 KHz. In a multichannel analog carrier system, one channel
might run at 0-4 KHz, the next at 4-8 KHz, the next at 8-12 KHz,
and so on. Therefore, each voice-grade channel supports a range
of frequencies that is 4 KHz wide. That's not enough for perfect
voice transmission (we are capable of creating audio well above
4 KHz), but it's good enough.
Further, each channel supports a range of signal amplitude (i.e.,
signal strength) that relates to a volume level. The amplitude
level also is limited, so your loudest screams can't quite be
heard over the network, but that's probably just as well. Again,
it's not enough for perfect voice transmission, but it's good
enough. It's known as toll quality voice.
Around the end of WWII (World War II, for you youngsters), the
networks began the transition from analog to digital technology.
Digital offers a lot of advantages, including greater bandwidth,
better error performance, and enhanced management and control.
Virtually all contemporary switches of all types are digital
in nature, and so is a lot of terminal equipment. Most transmission
facilities also are digital, with the notable exception of most
copper local loops serving residential and small business applications.
That makes the contemporary WAN virtually 100% digital, from edge-to-edge,
at least in developed countries.
To support voice in its native analog form over a digital network,
the analog signal has to be coded (i.e., converted) into a digital
format at some point after leaving your lips and prior to entering
the WAN. On the receiving end, the digital signal has to be decoded
(i.e., reconverted) back into an analog format in order to be
intelligible to the human ear.
Those conversion processes are accomplished by a matching pair
of codecs (coder/decoders), with the traditional method being
PCM (Pulse Code Modulation), standardized by the ITU-T as G.711.
PCM is based on the Nyquist Theorem, developed by Harry Nyquist
of Bell Telephone Laboratories in 1928.
The theorem (paraphrased) states that, in order to convert analog
voice to a digital format, send it over a digital circuit, and
reproduce high-quality analog voice at the receiving end, one
must sample the amplitude of the analog sine wave at twice the
highest frequency on the line.
If one samples at twice the highest frequency on the line, one
samples, therefore, at a rate of 4,000 x 2 = 8,000 times a second.
(It's necessary to sample only the amplitude. The frequency will
take care of itself at that rate.) If you do the math, you see
that 8,000 samples x 8 bits per byte = 64,000 bits per second,
or 64 Kbps. That's a voice-grade digital channel.
PCM further specifies that the sampling process take place at
precise intervals of 125ms (microseconds, or millionths of a second),
which is exactly 1/8000th of a second. Each sample is coded into
an eight-bit digital value. The resulting eight-bit bytes are
interleaved by multiplexers, and sent across channelized digital
circuits (e.g., T-carrier) to be directed and redirected by circuit
switches, sent across circuits (e.g., SONET) that interconnect
the switches in the network core and ultimately decoded on the
receiving end of the transmission.
The decoded signal, now in analog form once again, is only an
approximation of the original analog signal, but it's thoroughly
understandable to the human ear. It's not quite that simple, of
course. Timing is critical. The network must be in a position
to accept, switch, transport, and deliver every voice byte precisely
every 125ms. That means that latency (i.e., delay) must be minimal
and jitter (i.e., variability in delay) must be virtually zero.
Notice that we didn't mention loss, because there is none. That
translates into a network based on circuit-switching and channelized
T-carrier or E-carrier and/or SONET. Taken together, this approach
ensures that, once the call is set up, the associated bandwidth
is committed for the entire duration of a circuit-switched call,
absolutely and without question.
You really can't ask for more than this. It's virtually perfect.
It performs beautifully. The unfortunate part of it is that it's
inefficient. It's certainly inefficient for most data applications,
but that's another story. It's also inefficient for voice, if
we consider the fact that voice can be compressed in some interesting
ways; we'll get to that later.
Voice The IP Way
Now that your memory is refreshed about both traditional voice
and TCP/IP, it should be very clear that the two have very little
in common. Voice speaks to circuit-switching, not to packet switching
and routing. Voice speaks to channelized TDM (Time Division Multiplexing),
not to unchannelized statistically multiplexed circuits. Voice
speaks to sound bytes and silence bytes created, accepted, multiplexed,
switched, transported and delivered at the frequent, regular and
precise pace of every 125ms; not to packets that are created whenever
there's some data around, and that move through the network whenever
it's available. Voice speaks to committed bandwidth from call
setup to call teardown, not to bandwidth that's available whenever
it happens to be available.
Voice expects perfection; most data wouldn't even understand
it, much less appreciate it. Packet data thrives over a voice
network, but voice trembles at just the thought of traveling over
a packet data network. VoIP? It just won't work!
Actually, it will work, and quite nicely -- if everything works
just right. The trick is to keep latency, jitter and loss within
limits. While that can be tough in a network that is built around
applications that can tolerate all three, it's possible to pull
a trick or two within the network, and at the endpoints, to make
things work out.
The real key to sending voice over any packet data network is
compression, and VoIP is no exception. Compression offers several
advantages, one of which is the reduction of raw bandwidth required
to support the information transfer. If we follow the trail of
a voice signal over a typical digital network, we will see that
it begins life as an analog signal, which is converted by a codec
(coder/decoder) into a PCM format.
VoIP often makes use of DSPs (Digital Signal Processors, which
are fancy codecs) that not only convert the analog signal into
a digital format but also compress the signal, thereby requiring
less than the 64 Kbps demanded by PCM. The standard approaches
(there also are proprietary techniques) include ADPCM and CS-ACELP,
which I'll briefly explain below. I'll start with a brief explanation
of PCM, for the sake of comparison, and I'll throw in LD-CELP.
PCM (Pulse Code Modulation), the basic coding technique
used in traditional PSTNs worldwide, requires bandwidth of 64
Kbps. The delay imposed by the coding/decoding process is 0.75ms
(milliseconds, or thousandths of a second), which is imperceptible.
(Note: voice communications are bidirectional. Therefore, delay
impacts communications in both directions.) PCM yields what we
have come to know as toll quality voice, which rates a 4.4 MOS
(Mean Opinion Score) on a scale of 0.0-5.0, with 5.0 being perfect.
ADPCM (Adaptive Differential Pulse Code Modulation), a
technique also used in some PSTN networks, offers voice coding
at 40, 32, 24 and 16 Kbps. At the most common implementation rate
of 32 Kbps, the compression rate is 2:1 (2 to 1), which exactly
halves the bandwidth required for voice. ADPCM imposes delay of
1.0ms, which is imperceptible.
At 32 Kbps, ADPCM yields voice quality at a 4.2 MOS, which is
very close to that of PCM. At the higher compression rates of
24 Kbps and 16 Kbps, there is a corresponding drop in quality.
CS-ACELP (Conjugate Structure-Algebraic Code Excited Linear
Prediction) runs at 8 Kbps, a compression rate of 8:1. There
are two versions, which vary in terms of computational complexity,
either of which offers raw voice quality that is similar to that
of ADPCM at 32 Kbps. If everything goes just right, CS-ACELP rates
as high as a 4.2 MOS. Compression delay, however, is in the range
of 10.0ms, which is decidedly perceptible.
LD-CELP (Low Delay-Code Excited Linear Prediction) runs
at 16 Kbps, a compression rate of 4:1. Compression delay is 3.0ms-5.0ms,
and raw voice quality is similar to that of ADPCM at 32 Kbps.
That translates into a MOS of as much as 4.2, assuming that everything
goes just right.
Regardless of the compression technique employed, the VoIP process
is basically the same, although the specifics vary a bit. Just
as we did with Voice
over Frame Relay (VoFR), let's use CS-ACELP as
the compression algorithm and build an example scenario.
On the transmit side, 80 PCM voice samples, representing 10ms
of a voice stream, are gathered together to form a set of voice
data comprising 640 bits. That data set is run through the CS-ACELP
compression algorithm by a codec embedded in a VoIP gateway in
the form of a highly intelligent router, and reduced to 160 bits
(20 octets) to be transmitted. This is accomplished by consulting
a code book that provides an abbreviated representation of an
approximation of the data.
The result is packed inside an IP packet, along with an UDP header
for purposes of multiplexing, header error control, and identification
of the application by port number. RTP (Real-time Transport Protocol)
is run for end-to-end delivery services such as payload type identification,
packet sequence numbering, time-stamping, and delivery monitoring.
On the receiving end, the process is reversed, and all is well
-- at least as far as compression and decompression are concerned.
The complication, of course, comes from the fact that the packets
are not delivered by the network at the same pace that they entered
it. Additionally, some packets may be lost in transit.
Remember that an IP network is a highly-shared packet network
characterized by unpredictable levels of congestion. Latency is
guaranteed, as is variability in delay (i.e., jitter). Loss isn't
guaranteed, but it's highly likely, over time.
Much like VoFR, VoIP adjusts to latency, jitter and loss through
various intelligent continuity algorithms employed by the receiving
codec. These are designed to fill in the voids by stretching the
voice frames received earlier and blending them with those received
later.
This logic is embedded in predictive decompression algorithms
such as CS-ACELP and LD-CELP, which take advantage of the 10ms
delay built into the compression/decompression processes to make
the necessary predictions and do their stretching and blending.
VoIP also makes use of various techniques for echo cancellation,
as echo becomes perceptible when delay exceeds 15ms-20ms.
The actual end-to-end process is a little more involved, of course.
In an enterprise-wide VoIP application, both the calling and called
parties sit behind a PBX. The caller in San Francisco, for example,
picks up the phone and dials the extension number of a co-worker
in New York. The PBX checks its options for routing the call,
courtesy of its LCR (Least Cost Routing) software.
Over a special link, the PBX hands the call off to a VoIP service
provider. If the VoIP gateway has the sense that the call can
be supported with acceptable QoS, the call is accepted, and the
above scenario plays out. The gateway compresses the voice data,
packs it into a packet every 10ms, and off we go.
If network congestion levels remain low during the course of
the call, conversation quality remains pretty good, but never
quite as good as it is over the good old PSTN. If network congestion
levels increase, so do latency and jitter, and packet loss may
result. Voice quality suffers, and fond memories of the PSTN haunt
the balance of the conversation.
If, on the other hand, the ingress gateway has the sense that
current congestion levels are such that the quality of the call
is likely to be compromised, the call may be routed over the conventional
PSTN. Not all service providers, of course, offer the PSTN backup
option, but many do in order to support business-class users.
So What -- and Why?
So, voice data can be compressed in order to use shared bandwidth
more efficiently; this can be done with little loss in voice quality,
if everything goes just right.
Further, the decompression process can be sophisticated enough
to smooth out some of the problems associated with latency, jitter,
and loss of voice data over a packet data network -- within limits.
At this point, you have to ask yourself why in the world you
would go to all of this trouble to run voice over a packet data
network when the end result is uncertain quality that will never
be as good as the PSTN, and which can be terrible when the network
suffers congestion.
There are several answers, one of which is cost. VoIP is very
inexpensive, at least in comparison to the PSTN. However, you
still have to wonder if VoIP is worth the trouble in the face
of PSTN voice calling at rates in the range of $0.04-$0.06 per
minute. At rates perhaps as low as $0.02-$0.04 per minute, however,
VoIP looks pretty good to the typical business enterprise. At
rates in the general range of $0.04-$0.08 per minute, it looks
real good to the small business and consumer market.
The cost issue takes on real significance in a multinational
enterprise. Calls to Japan or South Africa, for example, may well
be in the range $0.50-$0.60 per minute. VoIP starts to look real
good at these prices. Unfortunately, many countries have declared
VoIP and VoFR illegal, so you're running some real risks in even
attempting such a thing.
In closing, I have to say that I'm not trying to pick a fight
with the proponents of VoIP, any more that I did in the case of
VoFR. (That column, in fact, did elicit comments from a slightly
enraged consultant friend or two.) I'm just calling it the way
I see it.
By the way, the way I see it is that VoIP will be huge, and for
several reasons:
First, it will be inexpensive, even though quality will suffer
a bit.
Second, many quality issues can be overcome by ever faster switches
and routers, and ever faster transmission systems. (Admittedly,
this is something of a brute force attack on a congestion issue,
but bandwidth is pretty cheap these days.)
Third, VoIP offers tremendous advantages in terms of the fact
that voice and data can be integrated over the same IP-based network,
and through the same terminals in the form of integrated voice/data
client workstations.
If you have any doubt about the integrated workstation thing
(which so far has been very slow to develop), note that Microsoft
announced VoIP support in its recent release of Windows XP. Imagine
being able to exercise complete call control through your PC,
share files with colleagues while discussing the contents in real-time,
and in a truly collaborative mode over the same network at very
low cost.
Imagine being able to negotiate with a call center agent to purchase
an airline ticket or an automobile or a piece of furniture or
clothing, all while looking at the same information on the Website,
and all over the Internet, or other IP-based network. I repeat
that this is going to be huge. There is very little doubt about
that.
There, of course, is a lot more to VoIP than we've had a chance
to discuss here. However, I'm out of my allotment of server memory,
electrons, photons and such.