Dr. Dolittle Project - Ortolan Bunting

HMM-based song-type classification and individual identification of ortolan bunting (Emberiza Hortulana L)

Kuntoro Adi and Marek Trawicki

This research presents a method for automatic song-type classification and individual identification of ortolan bunting (Emberiza Hortulana L). This method is based on Hidden Markov Models (HMMs) commonly used in the signal processing and automatic speech recognition research communities. The features used for classification and identification include both fundamental frequency and spectral characteristics. Spectral features are derived from frequency-weighted cepstral coefficients. Using these features one HMM is trained for each type of vocalization both for individual bird and across the entire population. Preliminary results indicate accuracies of above 90% for both song-type classification and individual identification tasks.

Many bird studies require identification of bird vocalization. Most of these studies based on manual inspection and labeling of sound spectrographs. They sometimes involve a large corpus that are extremely labor extensive. Manual inspection of multiple vocalization sometimes prone to error.

Automated classification based on well defined acoustic features would improve the quality of measurements. Recent progress in automated speech recognition encourages to achieve a reliable automated recognition for bird vocalizations.

This research describes method for automatic song-type classification and individual identification of the ortolan bunting (Emberiza Hortulana L). The method is based on Hidden Markov Models (HMMs) commonly used in the signal processing and automatic speech recognition research community.

The subjects for this study are ortolan bunting bird (Tomasz S. Osiejuk, 2003). The ortolan bunting Emberiza hortulana is an age-limited bird that has a relatively simple song and small repertoire size (typically 2-3 song types). Songs of ortolan buntings are described in terms of their syllable, song-type and song variant. In total, there are 63 different song types and 234 different song variants, composed of 20 different syllables.

Figure 1. Norwegian ortolan bunting

a. Syllable

A syllable is a minimal unit of song production. A song is described by using letter notation, e.g. aaaabb or hhhhuff, where letters denote particular syllables. Syllables of the same category have the same shape on sonograms, but they might differ in length and frequency between individuals

Figure 2. Syllable types of ortolan bunting (Osiejuk, 2003)

b. Song-type

Song-type indicates a group of songs that consists of the same syllables arranged in the same order.

Figure 3. Song-type AB, CD and EB

c. Song-variant

Song-variants are songs of the same type, with differ only in the number of syllables within the songs. As an example: song-type gb has song variants: gggb, ggbbbb, gggbb The initial and final syllables may slightly differ in amplitude and frequency, probably because of sound production mechanisms.

Figure 4. Song-variant AB: aaabb, aaaaab, aaaabb

a. HMM for song-type classification

Hidden Markov Models

a markov model is a finite state machine which changes state once every unit time
states represent different sound with different features
we can think of features as being generated by a state, according to some probability distribution
so each state would be a random variable, and song production would be a random process
if we can identify what sequence of states would most likely generate the input song, then that identifies the song-type

Markov Property

first order markov assumption: the feature distribution is only determined by the current state
the likelihood of the current state going to any new state is determined solely by the current state (= transition probability)

b. HMM Scheme

Figure 5. HMM models for song-types

First, an HMM is trained for each song-type using a number of examples of that song. In this case the song-type examples consist of song-type ab, cd, eb, gb and huf. To recognize some unknown song, the likelihood of each model generating that song-type is calculated; and the most likely model identifies the song-type.

Figure 6. Block diagram of song-type recognizer

c. Results

Song-type classification

The song-type classification experiment is similar to speech recognition experiment on human speech. Five different common ortolan bunting song-types are classified, namely, ab, cd, eb, gb, and huf.

No Features Accuracy (%)

1
2
3
4
5
6
7

Pitch
Cepstral coefficient (MFCC)
MFCC + pitch
MFCC + pitch + relative pitch
MFCC + pitch + relative pitch + energy

MFCC_E_D_A (MFCC + energy + delta + delta-delta)

MFCC_E_D_A with cepstral variance normalization

71.20
93.60
92.00
92.00
94.40
92.10
96.90

Table 1. Song-type classification accuracy with various feature vectors

Figure 7. Confusion Matrix of 10 Commonly Sung Song-Types

Individual identification

No Features Accuracy (%)

1
2
3
4

MFCC
MFCC_O
MFCC_O_D
MFCC_O_D_A

94.00
93.33
93.33
94.00

Table 2. Individual identification accuracy with various feature vectors

Figure 8. Speaker Identification (Song-Type Dependent)

HMMs have been successfully applied to modeling sequences of spectra in song-type recognition systems and individual identification. HMMs can model sequence of discrete symbol and sequence vectors. However, we can not apply both the discrete and continuous HMMs to observation which consists of continuous values and discrete symbols. The alternative use of multi-space probability distribution HMM to handle the unvoiced regions of the vocalization

Include temporal features of the vocalization
Expand analysis on syllable recognition and classification, individual recognition etc.
Introduce multi-space probability distribution HMM for recognition and classification
Determine how we can use individually distinct ortolan bird vocalization to perform censusing and monitoring tasks.