Audio Search

Open In Colab Open nbviewer Open Github

Audio Similarity Search

This notebook shows how to use Pinecone as the vector DB within an audio search application. Audio search can be used to find songs and metadata within a catalog, finding similar sounds in an audio library, or detecting who's speaking in an audio file.

We will index a set of audio recordings as vector embeddings. These vector embeddings are rich, mathematical representations of the audio recordings, making it possible to determine how similar the recordings are to one another. We will then take some new (unseen) audio recording, search through the index to find the most similar matches, and play the returned audio in this notebook.

Install Dependencies

!pip install -qU pinecone-client panns-inference datasets librosa

Load Dataset

In this demo, we will use audio from the ESC-50 dataset — a labeled collection of 2000 environmental audio recordings, which are 5-second-long each. The dataset can be loaded from the HuggingFace model hub as follows:

from datasets import load_dataset

# load the dataset from huggingface model hub
data = load_dataset("ashraq/esc50", split="train")
    features: ['filename', 'fold', 'target', 'category', 'esc10', 'src_file', 'take', 'audio'],
    num_rows: 2000

The audios in the dataset are sampled at 44100Hz and loaded into NumPy arrays. Let's take a look.

# select the audio feature and display top three
audios = data["audio"]
[{'path': None,
  'array': array([0., 0., 0., ..., 0., 0., 0.]),
  'sampling_rate': 44100},
 {'path': None,
  'array': array([-0.01184082, -0.10336304, -0.14141846, ...,  0.06985474,
          0.04049683,  0.00274658]),
  'sampling_rate': 44100},
 {'path': None,
  'array': array([-0.00695801, -0.01251221, -0.01126099, ...,  0.215271  ,
         -0.00875854, -0.28903198]),
  'sampling_rate': 44100}]

We only need the Numpy arrays as these contain all of the audio data. We will later input these Numpy arrays directly into our embedding model to generate audio embeddings.

import numpy as np

# select only the audio data from the dataset and store in a numpy array
audios = np.array([a["array"] for a in data["audio"]])

Load Audio Embedding Model

We will use an audio tagging model trained from PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition paper to generate our audio embeddings. We use the panns_inference Python package, which provides an easy interface to load and use the model.

from panns_inference import AudioTagging

# load the default model into the gpu.
model = AudioTagging(checkpoint_path=None, device='cuda') # change device to cpu if a gpu is not available
Checkpoint path: /root/panns_data/Cnn14_mAP=0.431.pth
GPU number: 1

Initialize Pinecone Index

The Pinecone index stores the audio embeddings, which we can later retrieve using another vector embedding (a query audio vector). We first need to initialize our connection to Pinecone and create our vector index. For this, we need a free API key and then we initialize the connection like so:

import pinecone

# connect to pinecone environment
    environment="YOUR_ENV"  # find next to API key

Now we create our index. We need to give it a name (you can choose anything, we use "audio-search-demo" here). The dimension is set to 2048 as the model we use to generate audio embeddings output 2048-dimension vectors. Finally, we use cosine as our similarity metric as the model is trained to embed audio into a cosine metric space.

index_name = "audio-search-demo"

# check if the audio-search index exists
if index_name not in pinecone.list_indexes():
    # create the index if it does not exist

# connect to audio-search index we created
index = pinecone.Index(index_name)

Generate Embeddings and Upsert

Now we generate the embeddings using the audio embedding model. We must do this in batches as processing all items at once will exhaust machine memory limits and API request limits.

from import tqdm

# we will use batches of 64
batch_size = 64

for i in tqdm(range(0, len(audios), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(audios))
    # extract batch
    batch = audios[i:i_end]
    # generate embeddings for all the audios in the batch
    _, emb = model.inference(batch)
    # create unique IDs
    ids = [f"{idx}" for idx in range(i, i_end)]
    # add all to upsert list
    to_upsert = list(zip(ids, emb.tolist()))
    # upsert/insert these records to pinecone
    _ = index.upsert(vectors=to_upsert)

# check that we have all vectors in index
{'dimension': 2048,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 2000}},
 'total_vector_count': 2000}

We now have 2000 audio records indexed in Pinecone, we're ready to begin querying.


Let's first listen to an audio from our dataset. We will generate embeddings for the audio and use it to find similar audios from the Pinecone index.

from IPython.display import Audio, display

# we set an audio number to select from the dataset
audio_num = 400
# get the audio data of the audio number
query_audio = data[audio_num]["audio"]["array"]
# get the category of the audio number
category = data[audio_num]["category"]
# print the category and play the audio
print("Query Audio:", category)
Audio(query_audio, rate=44100)
Query Audio: car_horn

We have got the sound of a car horn. Let's generate an embedding for this sound.

# reshape query audio
query_audio = query_audio[None, :]
# get the embeddings for the audio from the model
_, xq = model.inference(query_audio)
(1, 2048)

We have now converted the audio into a 2048-dimension vector the same way we did for all the other audio we indexed. Let's use this to query our Pinecone index.

# query pinecone index with the query audio embeddings
results = index.query(xq.tolist(), top_k=3)
{'matches': [{'id': '400', 'score': 1.0, 'values': []},
             {'id': '1667', 'score': 0.842124522, 'values': []},
             {'id': '1666', 'score': 0.831768811, 'values': []}],
 'namespace': ''}

Notice that the top result is the audio number 400 from our dataset, which is our query audio (the most similar item should always be the query itself). Let's listen to the top three results.

# play the top 3 similar audios
for r in results["matches"]:
    # select the audio data from the databse using the id as an index
    a = data[int(r["id"])]["audio"]["array"]
    display(Audio(a, rate=44100))

We have great results, everything aligns with what seems to be a busy city street with car horns.

Let's write a helper function to run the queries using audio from our dataset easily. We do not need to embed these audio samples again as we have already, they are just stored in Pinecone. So, we specify the id of the query audio to search with and tell Pinecone to search with that.

def find_similar_audios(id):
    print("Query Audio:")
    # select the audio data from the databse using the id as an index
    query_audio = data[id]["audio"]["array"]
    # play the query audio
    display(Audio(query_audio, rate=44100))
    # query pinecone index with the query audio id
    result = index.query(id=str(id), top_k=5)
    # play the top 5 similar audios
    for r in result["matches"]:
        a = data[int(r["id"])]["audio"]["array"]
        display(Audio(a, rate=44100))

Here we return a set of revving motors (they seem to either be vehicles or lawnmowers).


And now a more relaxing set of birds chirping in nature.

Let's use another audio sample from elsewhere (eg not this dataset) and see how the search performs with this.

--2022-09-25 20:47:00--
Resolving (,,, ...
Connecting to (||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 215546 (210K) [audio/x-wav]
Saving to: ‘miaow_16k.wav.1’

miaow_16k.wav.1     100%[===================>] 210.49K  --.-KB/s    in 0.004s  

2022-09-25 20:47:00 (54.1 MB/s) - ‘miaow_16k.wav.1’ saved [215546/215546]

We can load the audio into a Numpy array as follows:

import librosa

a, _ = librosa.load("miaow_16k.wav", sr=44100)
Audio(a, rate=44100)

Now we generate the embeddings for this audio and query the Pinecone index.

# reshape query audio
query_audio = a[None, :]
# get the embeddings for the audio from the model
_, xq = model.inference(query_audio)

# query pinecone index with the query audio embeddings
results = index.query(xq.tolist(), top_k=3)

# play the top 3 similar audios
for r in results["matches"]:
    a = data[int(r["id"])]["audio"]["array"]
    display(Audio(a, rate=44100))

Our audio search application has identified a set of similar cat sounds, which is excellent.

Delete the Index

Delete the index once you are sure that you do not want to use it anymore. Once the index is deleted, you cannot use it again.