top of page

Comparing models of music perception

Theoretical divergence in cognitive science: comparisons of neuroscientific and computational models of music perception

Music, much like language, is often seen as a uniquely human phenomenon. It appears to be present in human societies across both culture and time with the oldest musical instruments discovered dating back to at least 30,000 B.C.E. (Lawergren, 1988). Often treated with an almost magical reverence, the question of what music actually is, has baffled scholars throughout the centuries. Possible steps toward solving this timeless question have been presented with the advent of modern brain imaging and computer technology. Although contested across different disciples, recent developments (in neuroscience and computational psychology in particular) have led to the emergence of different theoretical approaches which will be summarized, compared and contrasted in the subsequent discussion. Two (or three) papers in particular, will be of relevance to this discussion. The first being “towards a neuroscience of music perception- a review and updated model” (Koelsch, 2011) and the second (and third) being “computational models of music perception and cognition, part I (and II)” (Purwins et al., 2008a,b).

Paper I

Koelsch (2011) appears to approach music perception from a phylogenetic perspective (although this is not explicitly stated). Throughout the paper, the relationship between music and language perception (and production) is closely linked along with music’s role in physiological and social domains. In characterizing the problem, Koelsch (2011) states;

Music perception involves acoustic analysis, auditory memory, auditory scene analysis, processing of interval relations, of musical syntax and semantics and activation of (pre)motor representations of actions. Moreover, music perception potentially elicits emotions, thus giving rise to modulation of emotional effector systems such as the subjective feeling system, the autonomic nervous system, the hormonal, and the immune system.

Although somewhat contested, close links have been drawn between the phylogenetic development of music and language (Wallins et al., 2000), with Koelsch (2011) even going so far as to state that when perceiving both music and language, the brain actually “treats language as a special case of music” rather than the more intuitive other way round. The paper is broken into ten broad sections dealing with various functional aspects of the neurobiology of music, ranging from simple auditory feature extraction (i.e. the transformation of physical acoustic signals into electrical and neurochemical ones) to hierarchical, complex processing of meaning, and embodied aspects such as vitalization in the autonomic nervous system. Due to word requirements only a brief summary of the findings will be presented here.

One parallel between language and music is how prosodic information may lay the foundations for language development in infants (Jusczyk, 1999). Auditory feature extraction is presented as both a bottom-up and top-down process involving the auditory brainstem which may be involved in the communication of emotion as Strait et al., (2009) found that musical training effected the decoding of affective vocalizations (e.g. an infants cry). Phonemic and musical sounds are characterized by spectrum and amplitude envelopes (as relevant to the musical concept of timbre) and are of particular relevance to tonal languages (such as mandarin Chinese).

The primary auditory complex may play a more fine-grained role in acoustic feature analysis alongside and precept transformation leading to (auditory) gestalt formation and echoic memory. Mismatch negativity (MMN) (e.g. Naatanel et al., 2001) studies are one of the primary methods of investigating such areas of musical perception. As well as informing the process of (auditory) gestalt formation, these studies have contributed to investigation of the effects of musical training on the operations of auditory sensory memory as well as its response properties to musical and speech stimuli. Some MMN studies revealed “larger right-hemispheric responses to chord deviants than to phoneme deviants… and different neural generators of the MMN elicited by chords compared to… phonemes” (Koelsch, 2011). Such operations are relevant for recognising and following objects as well as establishing cognitive representations of the environment. Interval analysis was presented as being closely linked to gestalt formation with melodic and temporal intervals being processed separately. In terms of musical intelligence, syntax processing was the central theme. Leherdal and Jackendoff’s (1999) generative theory of tonal music (GTTM) models the major-minor tonal syntax using tree structures which were noted to parallel those as proposed by Patel (2008) for linguistic syntax. Other models presented for syntax modelling were the tonal pitch space theory (Lerdhal, 2001) which builds upon the GTTM by using an algorithm to model tension-resolution patterns from the tree structures. The generative syntax model (GSM, Rohrmeier, 2007) (again building from the GTTM) presents four levels of increasing abstraction. In GSM, surface, scale degree, functional and phrase levels allow for generative rules of phrase-structure grammar to emerge building to recursive derivation of complex sequences. The tree structure “provides a weighting of elements, leading to an objectively derivable deep structure” where “chord functions are a result of their position within the branches of the tree” (Koelsch, 2011). The neural correlates of music syntactic processing are mostly measured by early right anterior negativity (ERAN). Tillman et al., (2003) suggested that the pars opercularis of the inferior frontal gyrus (BA44) is involved in this process due to the hierarchical processing of syntactic information leading to the speculation that

at least some cognitive operations of music-syntactic and language syntactic processing… overlap, and are shared with the syntactic processing of actions, mathematical formulas and other structures based on long-distance dependencies involving hierarchical organization (phrase structure grammar).

(Koelsch, 2011)

Koelsch (2001) provides many examples (e.g. Slevc et al., 2009) that present strong evidence for shared neural resources between music- and language-syntactic processing with interaction between the ERAN elicited by irregular chords and the left anterior negativity (LAN) elicited by morpho-syntactic (linguistic) violations as well as with the processing of the phrase-structure of a sentence.

Meaning in music was presented in three classes (extra-musical, intra-musical and musicogenic meaning). Extra-musical meaning encompasses iconic (e.g. linguistically onomatopoetic) sound associations, indexical (action-related patterns indicating emotional or intention) and symbolic (culturally enactive e.g. national anthem or wedding march) meanings. Juslin and Laukka (2003) found that speech and music encode emotional expression in their acoustic properties in very similar ways. Steinbeis and Koelsch (2008c) showed that social cognition is automatically engaged in music listeners. In both abstract and concrete ways, Koelsch (2011) suggests that the cognitive operations of decoding meaningful information in music listening “can be identical to those that serve semantic processing during language perception”. See Koelsch (2011) for descriptions of intra and musicogenic meaning.

Koelsch (2010) describes music making as an activity that simultaneously engages the “seven C’s” (contact, social cognition, co-pathy, communication, co-ordination of movement, co-operation, social cohesion) leading to the “emotional power of music” and possible “evolutionarily advantageous social aspects” (Koelsch, 2011).

To conclude, Koelsch (2011) calls for the development of a theoretical framework, synthesizing the cognitive aspects of both music and language into what he calls the “music-language continuum” as a result of the pervasive overlap and shared cognitive and neural processes.

Paper II (and III)

The second paper “computational models of music perception and cognition” is broken into part I (the perceptual and cognitive processing chain) and part II (domain specific music processing) (Purwins et al., 2008a,b). One note made by this author was that Purwins et al., (2008a,b) at no point in their paper presented a solid definition of what it was that they meant by music, whilst simultaneously mentioning the musical abilities of animals and claiming that

if we could grasp the universal principles of musical intelligence, we would get an idea of how our music understanding gets refined and adapted to a particular music style as a result of a developmental process triggered by stimuli of… musical culture

(Purwins et al., 2008a)

for purposes related to “music information retrieval”, “education” and “emotional engineering” (Purwins et al., 2008b). This may be due to a rationalist, empirical, materialistic approach and reluctance of the authors to adopt the musicological terminology of the “global music monoculture” situated primarily in “19th century European music” (Purwins et al., 2008a). Whilst maintaining that the socio-cultural phenomenon of music is inhomogeneous, the authors adopt the perspective of empirical, systematic, computational, and cognitive musicology. This

move(s) away from the classic notion of music as an art-form to be interpreted historically or aesthetically, towards music as a phenomenon that is to be primarily studied by measurement and modelling, as in any empirical science.

(Purwins et al., 2008b)

Part I primarily deals with the methodologies used in developing a viable cognitive representation of music which should “be evaluated by its predictive and generalizable power, it’s simplicity and it’s relation to existing theories” (Puwins et al., 2008a). As with Koelsch (2011) the first step is to model the transduction of physical acoustic signals into digital data. This is primarily done using Hebbian learning mechanisms where algorithms such as the self-organizing feature map mimic the auditory principle of tonotopy beginning with the hair cell neurons of the inner ear. Auditory discrimination may be modelled using the Weber-Fechner law (Roedrer, 1995, Hirsch et al., 1990) with spectral, temporal and spectotemporal approaches treating tones as approximately periodic signals. Sounds may be grouped based on proximity (distances between onsets, pitch, loudness), similarity (timbre), closure (gestalt) and common fate (frequency components grouped together “when similar changes occur synchronously” (Purwins et al., 2008a)). One possible explanation for the “human fascination with music” may lie in the challenge of identifying sound sources in music where there is not a “one-to-one correspondence between sounds and sound sources” as “a single instrument can play several tones simultaneously” (Purwins et al., 2008a), although this is speculative. Some solutions to this “binding problem” are presented in hierarchical organization (e.g. context free grammars and integration by anatomical convergence although this is biologically implausible) and neural synchronization (with some evidence of this being provided by Engel (1997) and Eggermont (2000) though controversial). In neural synchronization, Hebbian learning and oscillator models may mimic the principle of proximity. Computational auditory scene analysis (CASA) may model the separation of sound sources and acoustical source separation (which either mimics aspects of the auditory system or “use digital signal processing without reference to biology”) (Purwins et al., 2008a).

In terms of memory and attention, Deliége (1987,1989) proposes “the operation of a cue abstraction mechanism” which may reduce the load on memory storage. Wrigley and Brown (2004) extend the oscillator grid proposed by Wang (1996) using attentional leaky integrators (ALI’s) to form segmented streams by correlating across channels but this has only been “validated with artificial stimuli” (Purwins et al., 2008a). Higher level processing involves long term memory and schemata with the definition of schemata as presented by Mandler (1978) allowing for them to be represented by “frames, scripts and conditional rules” (Purwins et al., 2008a). Categories (e.g. musical scales) appear to be a different form of musical knowledge to schemata but the authors mention that it is unclear if these two types of knowledge (as relevant to higher level processing) are cultural constructs or truly universal (objectively real).

Information theory and expectancy also play a role in higher level processing with comparisons available between some of the subjective states and perceptible qualities of music to entropy and relative entropy leading to Pearce and Wiggins (2004) development of computational models able to recognise the structure of a musical piece “in agreement with the judgement of a human listener”. Other ways of modelling expectation have been proposed by Dubnov et al., (2007) where suffix trees build predictions and Hazan et al., (2007) who developed an unsupervised system which creates symbolic expectation systems of music information theoretic computations.

Part II of the paper (Purwins et al., 2008b) attempts to use the models presented in part I to develop empirical formulations of rhythm, melody and tonality while noting that these “computational models should be regarded as a starting point for comparing different hypotheses or theories”. They differentiate between meter and rhythm using Londons (2006) description, this being “meter is how you count time, and rhythm is what you count- or what you play while counting”. Inter-onset intervals (IOI’s) may be categorized by meter with rhythm serving to group the sequences of these events. These events are differentiated early in auditory processing with Phillips-Silver and Trainor (2005) showing how motor, vestibular and proprioceptive systems in infants may be influenced by meter. Pulse finding however appears to be more thoroughly studied than rhythmic grouping with oscillator, graphical neural network and primal sketch models being used to describe the pulse finding process.

In terms of melody, monophony, polyphony and homophony offer distinct varieties of phenomena for study, with expectation playing a key role in the development of such schemata. A bottom-up and top-down model is presented by the implication-realization model (Narmour, 1990, 1992) accurately predicting (~5y/o) children’s expectations although self-organising architectures have also shown promise in modelling the “emergence of mental concepts” from simple exposure to musical stimuli (Purwins et al., 2008b).

Tonality is the final concept investigated, which appears to be defined from the major-minor tonality period of western music (~1685-1857). Probe tone ratings (Krumhansl, 1979) and CQ profiles (Purwins et al., 2000) are provided as being the most promising concepts for the computational quantization of such concepts.

The authors argue that their paper should provide insight into the development of a (currently non-existing) holistic approach.

Such an approach starts from the recognition of the inherent embodiment of cognition: rather than being a static feature, cognitive behaviour is the result of a developmental process of continuous interaction of an agent with its environment. In this view, perception is not a passive act of acquiring information but an active process that is an essential element of cognition

(Purwins et al., 2008b)


Koelsch (2011) and Purwins et al., (2008a,b) approach the concept of music perception in rather different manners. The phylogenetically based, linguistically modelled approach of Koelsch (2011) contrasts greatly to the empirical, (perhaps possibly considered more objective) approach presented by Purwins et al., (2008a,b).

Koelsch (2011) appears to assume that the value of music lies in its (apparent) phylogenetic role as a communicative tool and his paper would indicate that music has a far greater tangible value to the human species than previously believed, in particular in the context of communication and language development. Without dismissing the aesthetic value of music, Koelsch (2011) manages to present a coherent, non-self-referential argument for the intrinsic value of music perception as the foundational grounding for the development of language (which has a largely accepted intrinsic phylogenetic value). The mode of questioning presented by Koelsch (2011) reflects this embodied social concept of music from the initial line of questioning centred around the linguistic development of infants through to the vitalizing processes reflected in bodily functions and communication of meaning from iconic, indexical and symbolic qualities. The finding of Strait et al., (2008) that musical training can increase the ability “to decode the acoustic features of… affective vocalizations” of infants suggesting that subcortical processing benefits from such training could have serious implications for both healthcare and parenting practice/policy. Extra musical (music semantic) meaning as processed in the N400 area of the brain revealed no difference to the classic semantic priming effects between music and language. This has implications for healthcare areas such as speech and language, and music therapy. The finding that particular parts of the Broca’s area (BA44) overlap in its processing of both music and language syntactic processing (as well as possibly the organization of mathematical formulas and termini) has implications for educational policy and the teaching profession. The possible overlap between late stage processing of music perception and early stages of premotor function (e.g. action planning) has relevance to rehabilitative healthcare practices. Possibly the most promising finding however is the ability of music to simultaneously engage the “seven C’s” of social activity revealing music to be an intrinsically social- phenomenon. Music engages the subjectivity of an individual human brain and connecting it, in an embodied fashion to the subjectivities of other participants in the music process (be they composers, musicians or listeners). This gives solid grounding for a clear phylogenetic advantage for music, providing a coherent and credible solution to an age-old question with implications for a wide variety of fields of enquiry (e.g. healthcare, education, social sciences, etc).

Purwins et al., (2008a,b) in contrast, attempt to define music as an empirical, objective phenomenon and in doing so, appear to neglect the communicative and social aspects of music altogether. Whilst attempting to remove themselves from the conception of music as an aesthetic artform, the computational models they review appear to have their most tangible value in developing artificial intelligences that could potentially create original musical pieces that would be aesthetically pleasing to the human ear. Their goals of conceiving of a computational representation of music as relevant to “music information retrieval… education… (and) emotional engineering” (Purwins et al., 2008b) leads to a mode of questioning that objectifies music in a way that seems to characterize it as a disembodied, theoretical, cultural construct (although they do note that cognition is an inherently embodied phenomenon). This mode of questioning relies heavily on the field of musicology despite their attempts to remove themselves from it in giving an empirical account. While assuming a musicological formulation of music, they criticize it as “predominantly self-referential instead of referring to a world that is outside it’s own symbols, as is the case with language” (Purwins et al., 2008a). Apparently neglecting to give an account of the transfer of meaning in music. The computational models presented vary between strictly artificial constructs and more biologically plausible models and do give tangible computational accounts of the domains of rhythm, melody and tonality. These may be seen as very relevant to the music industry in particular and music education (to a somewhat lesser extent).

In conclusion, these two theoretically divergent conceptions of music perception give rise to two very different conceptualizations of what the phenomenon of music actually is. The implications for each conceptualization vary wildly in their potential applications, with each being relevant in their own right, in their own domains. This essay may reveal how seemingly convergent research questions may give rise to two completely divergent conclusions depending on the theoretical perspective of any given researcher. This highlights the critical importance of an awareness of theoretical perspective in the field of cognitive science.

Refernces; (APA format)

Deliège, I. (1989). A perceptual approach to contemporary musical forms. Contemporary Music Review, 4(1), 213230. doi:10.1080/07494468900640301

Deliege, I. (1987). Le parallélisme, support d'une analyse auditive de la musique: Vers un modèle des parcours cognitifs de l'information musicale. Analyse musicale, 6, 73-79.

Dubnov, S., Assayag, G., & Cont, A. (2007). Audio oracle: A new algorithm for fast learning of audio structures.

Eggermont, J. J. (2000). Sound-Induced Synchronization of Neural Activity Between and Within Three Auditory Cortical Areas. Journal of Neurophysiology, 83(5), 27082722. doi:10.1152/jn.2000.83.5.2708
Engel, A. (1997). Role of the temporal domain for response selection and perceptual binding. Cerebral Cortex, 7(6), 571582. doi:10.1093/cercor/7.6.571

Hazan, A., Brossier, P., Marxer, R., & Purwins, H. (2007). What/when causal expectation modelling in monophonic pitched and percussive audio. In NIPS music, brain and cognition workshop. Whistler, CA.

Hirsh, I. J., Monahan, C. B., Grant, K. W., & Singh, P. G. (1990). Studies in auditory timing: 1. Simple patterns. Perception & Psychophysics, 47(3), 215226. doi:10.3758/bf03204997

Jusczyk, P. (1999). How infants begin to extract words from speech. Trends Cogn. Sci. (Regul. Ed.) 3, 323–328.

Juslin, P., and Laukka, P. (2003). Communication of emotions in vocal expression and music performance: different channels, same code? Psychol. Bull. 129, 770–814.

Koelsch, S. (2010). Towards a neural basis of music-evoked emotions. Trends Cogn. Sci. (Regul. Ed.) 14, 131–137.

Koelsch, S. (2011). Toward a Neural Basis of Music Perception – A Review and Updated Model. Frontier in Psychology, 2. doi:10.3389/fpsyg.2011.00110

Krumhansl, C. L., & Shepard, R. N. (1979). Quantification of the hierarchy of tonal functions within a diatonic context. Journal of experimental psychology: Human Perception and Performance, 5(4), 579.

Lawergren, B. (1988). The Origin of Musical Instruments and Sounds. Anthropos, 83(1/3), 31-45. Retrieved December 18, 2020, from

Lerdahl, F., and Jackendoff, R. (1999). A Generative Theory of Music. Cambridge: MIT.

Lerdahl, F. (2001). Tonal Pitch Space. NewYork, NY: Oxford University Press.

London, J. (2006). How to talk about musical meter. In Colloquium talks, Carleton College, MN.

Mandler, J. M. (1978). Categorical and schematic organization in memory. Center for Human Information Processing, Department of Psychology, University of California, San Diego.

Näätänen, R., Tervaniemi, M., Sussman, E., Paavilainen, P., and Winkler, I. (2001). Primitive intelli-gence’in the auditory cortex. Trends Neurosci. 24, 283–288.

Narmour, E. (1990). The analysis and cognition of basic melodic structures: The implication-realization model. University of Chicago Press.

Narmour, E. (1992). The analysis and cognition of melodic complexity: The implication-realization model. University of Chicago Press.

Patel, A. (2008). Music, Language, and the Brain. New York, NY: Oxford University Press.

Pearce, M., & Wiggins, G. (2004). Improved methods for statistical modelling of monophonic music. Journal of New Music Research, 33(4), 367-385.

Phillips-Silver, J., & Trainor, L. J. (2005). Feeling the beat: movement influences infant rhythm perception. Science, 308(5727), 1430-1430.

Purwins, H., Blankertz, B., & Obermayer, K. (2000, July). A new method for tracking modulations in tonal music in audio data format. In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium (Vol. 6, pp. 270-275). IEEE.

Purwins, H., Herrera, P., Grachten, M., Hazan, A., Marxer, R., & Serra, X. (2008a). Computational models of music perception and cognition I: The perceptual and cognitive processing chain. Physics of Life Reviews, 5(3), 151168. doi:10.1016/j.plrev.2008.03.004
Purwins, H., Grachten, M., Herrera, P., Hazan, A., Marxer, R., & Serra, X. (2008b). Computational models of music perception and cognition II: Domain-specific music processing. Physics of Life Reviews, 5(3), 169182. doi:10.1016/j.plrev.2008.03.005
Roederer, J. G. (1995). Music, Physics, Psychophysics, and Neuropsychology: An Interdisciplinary Approach. The Physics and Psychophysics of Music, 114. doi:10.1007