Soundscapes: how can ML teach a computer to reflect the sounds of a space?

My goal in this project is to have a computer digitally digest my own music based on objects and re-create the space I composed a song in (my kitchen). This is the first step in a larger journey of mine to want to study how the computer as an object plays into our musical consciousness, along with my general desire to give a computer greater agency over its own sound by leveraging image data as opposed to music files as data sources.

Introduction: can music be a reflection of how we perceive objects?

I think that music is an experience of time/space separate from other languages like English. There may be some scientific support of this idea, but the neuroscience will (probably) not fully be there for another century at least. For the time being, it seems like the best way we can try to understand how humans process music is through non-scientific approaches.

It seems to me that music is like dreaming or remembering in the sense that its transferral: you occupy a separate time and space (while still existing in the present), a reality belonging to the story you are trying to tell, are experiencing (does someone washing dishes in the background make its way into my piano playing?), or that you are listening to. For instance, I remember Liszt’s Un Sospiro making me think of a hat flying off in the wind (like in Studio Ghibli’s A Wind Rising, a mixture of these two things).

I think that that experience is based off noises we associate with objects; in the above example, I associate arpeggios (hands scattering on a piano) with wind and, for some reason, hats.

Objects in the image may inspire in you some different musical ideas: maybe the purple flowers sound like Schubert’s Abschied or something like that.

While it’s impossible to prove, is generally unscientific, and is totally farfetched, it’s interesting to think that this image is composed of a bunch of objects which carry their own musicality and that an image is a composite of these things which becomes a “soundscape”.

Understanding this hypothetical music-object relationship through data — can a computer interpret things this way?

I used a neural network Autoencoder to explore this question. Here’s what I did:

I recorded music about a bunch of objects and had a computer learn from that via Autoencoder

Some songs that I recorded, the training data, sounded like this:

Building an Autoencoder

The Autoencoder itself works by encoding our input MIDI files down to a latent dimensional space, then reconstructing them by using a decoder. In this process of training, the neural network will get better at this reconstruction as stipulated by our hyperparameters (such as learning rate, epochs, optimizers for loss functions, etc.). In our context, if each object is a part of a composite image, the Autoencoder is pixelating that object and figuring out how we relate to them.

""" Seeding """

""" Hyperparameters """
latent_dim = 2
input_sample = Notes_.shape[0]
input_notes = Notes_.shape[1]
input_dim = input_notes * sequence_length
num_epochs = 10
def get_autoencoder(input_sample, input_notes, input_dim, latent_dim):

EncInput = Input(shape= (input_dim))
Enc = Dense(latent_dim, activation = 'tanh')(EncInput)
encode = Model(EncInput, Enc)

DecInput = Input(shape= (latent_dim))
Dec = Dense(input_dim, activation = 'sigmoid')(DecInput)
decode = Model(DecInput, Dec)

autoencoder = Model(EncInput, decode(Enc))
autoencoder.compile(loss = 'binary_crossentropy',

return autoencoder, decode

autoencoder, Decode_r = get_autoencoder(input_sample, input_notes,
input_dim, latent_dim)
fit_on = Notes_.reshape(input_sample, input_dim), fit_on, epochs=num_epochs)

After fitting the autoencoder, we want to save the decoder (named Decode_r in my code) to use later on when we feed our algorithm a random starting image. This is different from using something like an LSTM, where you input a starting point and let the algorithm unfold the story — here we are giving it a random input image. It’s almost like a variation on a theme, like a random sketch, as opposed to treating songs like texts.

If we give it a random place to go from, np.random.normal(size=(1, latent_dim))).numpy() and reshape that to our input vector .reshape(input_notes, sequence_length) then take the index using .argmax, we can pick out just one input vector as a starting image.

We can use our decode function to translate that input numbers, and use our dictionary to swap those numbers back to notes or chords.

def decode(int_to_note, Decode_r):

##Computer's melody
ComputersMelody = Decode_r(np.random.normal(size=(1,
print('Output shape (the length of your song) is
print('Raw output (prior to passing back to dictionary) is',
MelodyNotes = [get_key(c, int_to_note) for c in ComputersMelody]

print('After decoding int->note, we get \n',MelodyNotes)

return MelodyNotes

MelodyNotes = decode(int_to_note, Decode_r)

Results: the computer as an observer, and the absence of rhythm

What’s interesting about breadknife + coffee grinder is that the computer, acting as an observer of me playing the piano, is re-creating the physical space I composed the song breadknife in. I was in a room playing breadknife and someone started to make coffee — you can hear the chords, but they are rudely interrupted by the industrial machine of a coffee grinder we have.

The computer, in re-creating music its given by the composer, can attempt to recreate the experience of the composer by taking into account both the composer’s song but also the objects that were around them, almost like a memory of the space the song comes from.

Here, the computer, as it does not yet understand rhythm, is repurposing the slow chordal structure it was given and creating a faster, more upbeat song with the same notes.

The computer can appropriate the same melody its given into an entirely new emotional affect, one thats more jagged and constant. While a result of algorithmic shortcomings, the weird positivity of the computer is an interesting recollection of an object. The computer might have its own style.

The future of soundscapes — is a computer right for the job of echoing our physical reality?

Soundscape music needs to be dynamic, as it is the echoes of the fluid physical spaces that we inhabit. The computer may be the most suited instrument for the task of echoing our environments, since it can see and interpret so much in such a short amount of time (so long as the data is constructed well). The computer can have a focus by taking objects as a medium of musical expression of a space that it is able to see: no longer can we just label objects in an image, but through music we might be able to let the computer “experience” that.

This project, though, barely scratches the surface of that idea. I think by incorporating computer vision, we can start to enable this active reflection of a space, almost like a sound mirror. I think using my music (and my own opinion of objects) is not a bad start for playing around with things, but maybe scraping data about thousands of sounds of a particular kind of object might be interesting too, where we can start to get at a computer’s ability to merge a bunch of different instances of experience. Or maybe we can even record the sounds of San Francisco’s Golden Gate park for a month or so, and ‘reconstruct’ the place of Golden Gate park that way.

Thank you

Thank you for reading. Please feel free to listen to the music the algorithm and me created here and stay tuned for future updates.

If you’re curious about the source code used to create this project, you can see my github repository.

To see more of my work, please check out my website or find me on Linkedin or Github.

Note: this project was completed as part of the Metis data science intensive 3-month bootcamp program, defined by a focus in project-oriented skill application of machine learning, statistical design to my own inquiries about the world and its data.

叶秋 pianist & data scientist,