Audio Processing

Sound is an important aspect of nature. For us, it is a major source of information and entertainment. Words form an important part of auditory signals. But there is a lot more to it than just words. Our mind is receptive to various different kinds of sounds and it can use them to draw important conclusions. In order to understand and use this aspect of human intelligence, we must start with understanding what is sound.

What is Sound?

Sound is defined as an "Oscillation in pressure, stress, particle displacement, particle velocity, etc., propagated in a medium with internal forces (e.g., elastic or viscous), or the superposition of such propagated oscillation or an auditory sensation evoked by these oscillations. Sound can propagate through a medium such as air, water and solids as longitudinal waves and also as a transverse wave in solids

Sound can be viewed as a wave motion in air or other elastic media. In this case, sound is a stimulus. Sound can also be viewed as an excitation of the hearing mechanism that results in the perception of sound. In this case, sound is a sensation.

The waveform of oscillations caused by sound waves is quite close to simple harmonic motion. In the absence of any decay, the amplitude and frequency of the sound remains unchanged with time and it travels at a constant velocity specific to the medium. Scientists have calculated, measured and documented the speed of sound in different types of medium. In air, is about 343 meters per second.


Any simple harmonic motion has a defined frequency and amplitude. The frequency of sound refers to the frequency of oscillations that it creates in the medium. It is defined in Hertz (Hz). Any common sound that we hear is not just one frequency - but a mixture of various frequencies - often multiples of a base frequency. This causes a non-sinusoidal waveform.

Sound passes across mediums through a phenomenon called resonance. Each object has a defined resonance frequency and a range of frequencies that it can carry. When it is influenced by sound waves in this range, it picks up that oscillation and passes it forward. Since the resonance frequency for each carrier is different from others, each carrier object enhances a subset of the frequencies - thus altering the net waveform. Typically, higher the frequency of the sound, it is less susceptible to decay in the medium.


The amplitude refers to the extent of oscillation caused by the sound waves. It is an indicator of the amount of energy pushed into the sound wave. As we noted above, any sound wave is a combination of different frequencies. Each of these individual waves has a different amplitude. The net sound pressure level is measured on a logarithmic scale the ratio of the highest pressure in the sound wave to the reference sound pressure in the carrier medium. It is denoted in decibels (dB).

    Lp = 20log10(p/pref)

Perception of Sound

That was the physical aspect of sound. In order to understand human perception of this sound, we need to understand some physiological aspects of our auditory system. The human ear is receptive to a large range of sound frequencies - 20Hz to 20000Hz. Anything below 20Hz is called infrasonic and above 20000Hz, it is called ultrasonic. The upper limit decreases with age. That is the frequency range that the ear responds and passes on to the brain via the nervous system. Of course, these are not hard limits. The responsiveness reduces gradually on either side. Thus, response to 19kHz is much less than 5kHz.


Pitch is what we perceive as how "low" or "high" of the sound. In a typical sound signal that is composed of several harmonic frequencies, it relates to the lowest frequency in the sound (called the fundamental harmonic). Pitch perception may vary based on the frequency response of our ear. Individuals may identify different pitches for the same sound - based on their response to the particular sound patterns. For example: white noise (random noise spread evenly across all frequencies) sounds higher in pitch than pink noise (random noise spread evenly across octaves) as white noise has more high frequency content. Selection of a particular pitch is determined by pre-conscious examination of vibrations, including their frequencies and the balance between them. Pitch is a continuous (though musicians refer to it in steps). Any sound can placed on a pitch continuum from low to high.


This is the perception "long" or "short" a sound is. It relates to onset and offset signals created by nerve responses to sounds. The is the time between when the sound is first noticed until the sound is identified as having ceased or changed. This is not necessarily related to the physical duration of a sound. For example; in a noisy environment, we "learn" to ignore the noise and listen to the other sounds.


This is how "loud" or "soft" a sound is perceived. It relates to the amount of auditory nerve stimulation over short cyclic time periods. This means that at short duration, a very short sound can sound softer than a longer sound even though they are presented at the same intensity level. Past around 200 ms this is no longer the case and the duration of the sound no longer affects the apparent loudness of the sound. Louder signals create a greater 'push' on the Basilar membrane and thus stimulate more nerves, creating a stronger loudness signal. A more complex signal also creates more nerve firings and so sounds louder (for the same wave amplitude) than a simpler sound. Also, some waveform seem more louder than others because of the composition. For example a shrill sound seems louder than a melody.


Also known as tone color or tone quality is the perceived quality of the sound. Timbre distinguishes different types of sound production, such as choir voices and musical instruments, such as string instruments, wind instruments, and percussion instruments. It also enables listeners to distinguish between the sound sources.

The physical characteristics of sound that determine the perception of timbre include spectrum and envelope of the waveform. In simple terms, timbre is what makes a particular musical sound have a different sound from another, even when they have the same pitch and loudness. For instance, it is the difference in sound between a guitar and a piano playing the same note at the same volume. Both instruments can sound equally tuned in relation to each other as they play the same note, and while playing at the same amplitude level each instrument will still sound distinctively with its own unique tone color.


Timbre relates to the sound from one single source. But most often we concurrently hear multiple sounds from multiple sources. All of them present a combined effect. This defines the texture of the sound.

Even if the individual sounds are very pleasing to our ears, the combination may not be pleasing if the texture is not good.

Spacial Location

Based on what we hear, the mind makes a reasonable estimate about the direction of the source. The distance of sound source from either ear is slightly different. Hence there is a gap in the reception at the two ears. Also the kind of echo it generates in the middle ear depends upon the direction of the source. Based on these two, the mind estimates the direction and distance of source of the sound.

Software Model

Having understood the physical and mental model of the sound, let us look into the way we deal with sound in the technical world. In very simple words, the microphone contains a diaphragm that vibrates when sound vibrations are incident onto it. The diaphragm has a magnet attached to it - surrounded by a copper coil. As the diaphragm moves, the magnet moves with respect to the copper coil. This generates electrical vibrations in the copper coil - as per Faraday's laws. This electrical signal is what represents the sound waves.

Such a raw signal cleanly represents high voltage when the air pressure is high and low voltage when the pressure is low. When we tap these values into a device driver and convert the numbers to binaries, we get sound as a PCM (Pulse Code Modulation). When this is packaged with appropriate headers and scaled appropriately, we get the data that can be stored as a .wav file.

There is a lot of redundant information in this wav file that can be eliminated to generate compressed a file. Audio compression has been a major research domain that has led to many different formats for compressed files. The mp3 and wma formats are the common ones.

A major part of open source audio files are encoded in mp3 format. When we work on an audio file, we need to decode it first and get an array of numbers representing the PCM. There is no dearth of libraries that help us do this job.

Time & Frequency Domain

As per the Fourier transforms, any periodic signal can be expressed as a sum of sine and cosine functions. For example, if f(x) is a periodic function; it can be proved that

    f(x) = a0/2 + Σ1ansin(nx) + Σ1bncos(nx)


    an = (1/π)∫πf(x)cos(nx)dx
    bn = (1/π)∫πf(x)cos(nx)dx

With this in place, any audio signal can be represented in time domain as a sequence of numbers representing the strength of the pulse at that point. In the frequency domain, it can be represented as a two dimensional array of the amplitude related to each frequency at the given time.

This calculation may seem pretty complex, and one might wonder what is the point in going through all this - only to increase the size of the data? But this Fourier transform reduces a lot of complication in processing the signal. Also, researchers have developed efficient algorithms for doing this job for us. The Fast Fourier Transform (FFT) is built into the GPU and gives us the output in a couple of clock cycles.

We know that frequencies above 20kHz are redundant for human ear and they can be ignored - thus reducing the load. It is a lot easier to model the human perception in frequency domain and thus it helps us deal with sound in the frequency domain.