Mastering Vector Embeddings: Search Text, Audio, Video, and Images with Ease

9 min readJul 29, 2024

Introduction

Vector embeddings are fundamental in artificial intelligence (AI). Unlike humans, computers cannot process words, text, or images directly. They can only process numbers in binary formats. This is where embeddings come into play. Embeddings convert complex data such as text, images, audio, and videos into numerical representations, enabling machines to understand and process this data.

It is important to understand what embeddings are to appreciate vector embeddings. Embeddings represent high-dimensional data in a lower-dimensional space that captures the meaning, semantics, and relationships between the data. The purpose of this article is to explain how to create vector embeddings for various data types and demonstrate similarity search using Qdrant.

Overall, working with vector embeddings entails:

Choosing an appropriate embedding model for the data type.
Generating embeddings from the raw data.
Storing the embeddings in a vector database (in this guide, we’ll use Qdrant).
Performing similarity searches or other operations on the embeddings.

The Importance of Vector Embeddings

The following advantages of vector embeddings make them a vital tool in many data science and artificial intelligence domains:

Semantic Understanding: Embeddings capture meaning and context, enabling more nuanced analysis.
Feature Extraction: Embeddings automatically learn important features from raw data.
Similarity Comparisons: They allow for efficient similarity calculations between data points.
Transfer Learning: Pre-trained embeddings can be used across different tasks and domains.
Dimensionality Reduction: By using vector embeddings, complicated and high-dimensional data can be represented in a more comprehensible format. Processing and evaluating big collections require this effective representation.
Personalization and Recommendation Systems: Embeddings facilitate the development of complex recommendation systems in e-commerce and content platforms. By seeing trends and preferences in user behavior, they may provide recommendations that are more pertinent and tailored to the individual.
Data Visualization and Clustering: High-dimensional data can be visualized in lesser dimensions using vector embeddings. In exploratory data analysis, where finding clusters and patterns is crucial, this is helpful.

Types of Vector Embeddings

1. Text Embeddings

These are vector representations of words, phrases, or entire documents. They capture semantic relationships between textual elements, allowing machines to understand and process human language more effectively. In this category, we have word embeddings, the sentence embeddings, and the document embeddings. Words and sentences are converted into numbers for the computer to understand.

Creating Text Embeddings

There are several ways this can be done, such as the Word2Vec model or the GloVe model. However, we’ll be looking at some modern methods using the Hugging Face library. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova proposed the BERT model in their paper “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. On a large dataset that includes the Toronto Book dataset and Wikipedia, a bidirectional transformer was pre-trained using a combination of masked language modeling objectives and next-sentence prediction. Unlike other language representation models, BERT is intended to pre-train deep bidirectional representations from unlabeled text by conditioning both left and right contexts in all layers. As a result, the pre-trained BERT model may be fine-tuned with just one additional output layer to generate state-of-the-art models for a wide range of tasks, such as question answering and language inference, without making significant task-specific architecture alterations.

BERT, which stands for “Bidirectional Encoder Representations from Transformers,” is a sophisticated approach to teaching language comprehension. It is a self-supervised transformers model that has been pre-trained on a huge corpus of English data. This implies it was pre-trained on raw texts only, with no human labeling (which allows it to handle a large amount of publicly available data), and it used an automatic mechanism to generate inputs and labels from them. Specifically, it was pre-trained with two objectives:

Masked Language Modeling (MLM): Given a sentence, the model randomly masks 15% of the words in the input before running the complete masked text through the model and predicting the masked words.
Next Sentence Prediction (NSP): During pre-training, the models combine two masked sentences as inputs. The model then has to predict whether the two sentences follow each other or not.
Sentence Embeddings: They represent the meaning of sentences or paragraphs. Examples of models used for these include Google Bidirectional Encoder Representations from Transformers (BERT) and InferSent.

Example 1

# install the model; (this assumes you’re using python)
pip install transformers


# initialize the model

Example 1 with Sentence BERT
from transformers import BertModel, BertTokenizer
import torch

# Load pre-trained model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

# Define sentences to embed
sentences = ["Hugging Face provides a variety of models for text embeddings.",
             "BERT is one of the most popular models for generating embeddings."]

# Tokenize sentences
inputs = tokenizer(sentences, return_tensors='pt', padding=True, truncation=True, max_length=128)

# Generate embeddings
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state

# Extract the embeddings for the [CLS] token (representative of the whole sentence)
cls_embeddings = embeddings[:, 0, :]

print(cls_embeddings)

Example 2

from sentence_transformers import SentenceTransformer

def get_sbert_embedding(text):
    # Load the SBERT model
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    # Generate embedding
    embedding = model.encode(text)
    return embedding

# Example usage
text = "Vector embeddings are fascinating!"
embedding = get_sbert_embedding(text)
print(f"Sentence-BERT embedding shape: {embedding.shape}")

Example 3

from sentence_transformers import SentenceTransformer

# Load the pre-trained model
model = SentenceTransformer("infgrad/stella_en_1.5B_v5", trust_remote_code=True).cuda()

# Define sentences to embed
sentences = [
    "Hugging Face provides a variety of models for text embeddings.",
    "Sentence transformers are optimized for generating high-quality sentence embeddings."
]
# Generate embeddings
embeddings = model.encode(sentences)

# Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print(f"Sentence: {sentence}")
    print(f"Embedding: {embedding[:10]}...")  # Display the first 10 dimensions for brevity
    print()

2. Audio Embeddings

These are audio data. They are the vector embeddings of the signals from the audio that capture the relevant features and characteristics of the audio. Examples of this kind of data include music embeddings, speech embeddings, and so on. Popular embedding models include VGGish and YAMNET.

Creating Audio Embeddings; We can do this using the TensorFlow framework and the YAMNET model.

Example 1

import tensorflow as tf
import tensorflow_hub as hub

# Load the YAMNET model
model = hub.load('https://tfhub.dev/google/yamnet/1')

# Load an audio file
audio, sample_rate = tf.audio.decode_wav(tf.io.read_file('audio_sample.wav'))
audio = tf.squeeze(audio, axis=-1)

# Generate embeddings
scores, embeddings, log_mel_spectrogram = model(audio)

print(f"Audio embedding shape: {embeddings.shape}")

Example 2

import tensorflow as tf
import tensorflow_hub as hub
import numpy as np

def get_vggish_embedding(audio_path):
    # Load the VGGish model
    model = hub.load('https://tfhub.dev/google/vggish/1')

    # Load and preprocess audio file
    audio, sample_rate = tf.audio.decode_wav(tf.io.read_file(audio_path))
    audio = tf.squeeze(audio, axis=-1)
    
    # Ensure audio is the correct length (0.96 seconds)
    audio = audio[:int(0.96 * sample_rate)]
    if len(audio) < int(0.96 * sample_rate):
        audio = tf.pad(audio, [[0, int(0.96 * sample_rate) - len(audio)]])

    # Generate embedding
    embedding = model(audio)
    return embedding.numpy()

# Example usage
audio_path = "sample_audio.wav"
vggish_embedding = get_vggish_embedding(audio_path)
print(f"VGGish embedding shape: {vggish_embedding.shape}")

3. Image Embeddings

They are vector representations of images that capture visual features and semantic information. These embeddings convert image data into numbers for the models. Examples of models that can be used for this task include VGG16, ResNet50, VisualBERT, CLIP, etc.

Creating Image Embeddings: We can create image embeddings using various models such as CLIP, VisualBERT, and so on.

VisualBERT: VisualBERT is a model that understands and performs tasks requiring both visual and textual information. It is trained on datasets that include images alongside written descriptions or questions. It learns about the relationship between an image’s visual content and its textual content. They are also known as multimodal embeddings because, when given a picture and text, VisualBERT generates embeddings that capture both visual and linguistic information.
CLIP: CLIP (Contrastive Language-Image Pretraining) is an OpenAI model capable of understanding both images and text. It learns how to connect visual and textual information. It is trained using a large collection of photos and associated text descriptions. It learns to match images to descriptions and distinguish between various images and words. When you enter a picture or text, CLIP generates a numerical representation of it (embedding). These embeddings can subsequently be utilized to identify similarities between images and texts, as well as to perform image categorization and retrieval.

Example 1

# install the model
pip install transformers torchvision

# Implementation

from transformers import VisualBertModel, BertTokenizer, VisualBertForQuestionAnswering, VisualBertConfig
from transformers import ViTFeatureExtractor
from PIL import Image
import torch
import requests

# Load pre-trained VisualBERT model and feature extractor
visualbert_model_name = 'uclanlp/visualbert-vqa'
feature_extractor_model_name = 'google/vit-base-patch16-224'
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
visualbert_model = VisualBertModel.from_pretrained(visualbert_model_name)
feature_extractor = ViTFeatureExtractor.from_pretrained(feature_extractor_model_name)

# Load and preprocess the image
url = 'https://example.com/path/to/your/image.jpg'
image = Image.open(requests.get(url, stream=True).raw)
inputs = feature_extractor(images=image, return_tensors="pt")

# Tokenize the text input
text = "Describe the image"
text_inputs = tokenizer(text, return_tensors='pt')

# Combine image and text inputs
inputs.update(text_inputs)

# Generate embeddings
with torch.no_grad():
    outputs = visualbert_model(**inputs)
    embeddings = outputs.last_hidden_state

print(embeddings)

Example 2

# install necessary libraries
pip install transformers torchvision ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git

# Implementation
import torch
import clip
from PIL import Image
import requests

# Load the pre-trained CLIP model
model, preprocess = clip.load("ViT-B/32")

# Load and preprocess the image
url = 'https://example.com/path/to/your/image.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image_input = preprocess(image).unsqueeze(0)

# Generate image embeddings
with torch.no_grad():
    image_features = model.encode_image(image_input)

print(image_features)

4. Video Embeddings

They represent the content of video clips in a compact vector form, capturing both spatial and temporal information. In other words, video embeddings turn videos into numbers. In this type of embedding, we have video-frame embeddings and video-sequence embeddings.

Video-frame Embeddings: These represent individual frames or images from videos. Models like VGG16 and ResNet50 can be used.
Video Sequence Embeddings: These represent sequences of video frames. Models like C3D and two-stream can be used.

Example 1

import torch
from torchvision.models.video import r3d_18
from torchvision import transforms
import cv2

def get_i3d_embedding(video_path):
    # Load pre-trained I3D model
    model = r3d_18(pretrained=True)
    model = torch.nn.Sequential(*list(model.children())[:-1])
    model.eval()

    # Preprocess video
    transform = transforms.Compose([
        transforms.ToPILImage(),
        transforms.Resize((112, 112)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.43216, 0.394666, 0.37645], std=[0.22803, 0.22145, 0.216989]),
    ])

    # Load video and extract frames
    cap = cv2.VideoCapture(video_path)
    frames = []
    while len(frames) < 16:  # I3D expects 16 frames
        ret, frame = cap.read()
        if not ret:
            break
        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        frame = transform(frame)
        frames.append(frame)
    cap.release()

    # Pad if necessary
    if len(frames) < 16:
        frames += [frames[-1]] * (16 - len(frames))

    # Stack frames and generate embedding
    video_tensor = torch.stack(frames).unsqueeze(0).permute(0, 2, 1, 3, 4)
    with torch.no_grad():
        embedding = model(video_tensor)

    return embedding.squeeze().numpy()

# Example usage
video_path = "sample_video.mp4"
i3d_embedding = get_i3d_embedding(video_path)
print(f"I3D embedding shape: {i3d_embedding.shape}")

Example 2

import torch
from torchvision.models.video import c3d
from torchvision import transforms
import cv2

def get_c3d_embedding(video_path):
    # Load pre-trained C3D model
    model = c3d(pretrained=True)
    model = torch.nn.Sequential(*list(model.children())[:-1])
    model.eval()

    # Preprocess video
    transform = transforms.Compose([
        transforms.ToPILImage(),
        transforms.Resize((112, 112)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.43216, 0.394666, 0.37645], std=[0.22803, 0.22145, 0.216989]),
    ])

    # Load video and extract frames
    cap = cv2.VideoCapture(video_path)
    frames = []
    while len(frames) < 16:  # C3D expects 16 frames
        ret, frame = cap.read()
        if not ret:
            break
        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        frame = transform(frame)
        frames.append(frame)
    cap.release()

    # Pad if necessary
    if len(frames) < 16:
        frames += [frames[-1]] * (16 - len(frames))

    # Stack frames and generate embedding
    video_tensor = torch.stack(frames).unsqueeze(0).permute(0, 2, 1, 3, 4)
    with torch.no_grad():
        embedding = model(video_tensor)

    return embedding.squeeze().numpy()

# Example usage
video_path = "sample_video.mp4"
c3d_embedding = get_c3d_embedding(video_path)
print(f"C3D embedding shape: {c3d_embedding.shape}")

Storing Embeddings in Qdrant

What Is Qdrant?

Qdrant is a vector similarity engine, an open-source database, or simply put, a tool that helps us store, manage, and search for vector embeddings efficiently. It can be used in various AI applications and sectors such as e-commerce, healthcare, education, semantic analysis, and so on.

What is Similarity Search?

This is quite important for machines as it is a powerful way for them to find the most similar vectors to a query vector. This is particularly useful in recommendation systems, search engines, semantic analysis, and so on.

Similarity search measures for vector embeddings using these methods:

Cosine Similarity
Euclidean Distance
Dot Product

# Setting Up Qdrant

This can be done using Docker.

docker pull qdrant/qdrant
docker run -p 6333:6333 qdrant/qdrant

Inserting Embeddings into Qdrant

This can be done using the Qdrant Python client.

from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, PointStruct

client = QdrantClient("localhost", port=6333)

# Create a collection
client.recreate_collection(
    collection_name="my_embeddings",
    vectors_config=VectorParams(size=768, distance="Cosine")
)

# Insert embeddings
client.upsert(
    collection_name="my_embeddings",
    points=[
        PointStruct(id=1, vector=embedding1.tolist(), payload={"text": "Sample text 1"}),
        PointStruct(id=2, vector=embedding2.tolist(), payload={"text": "Sample text 2"}),
    ]
)

Conclusion

Vector embeddings have revolutionized how we process and understand data, and there will only be more improvement as the world of AI evolves. In this article, you learned what vector embeddings are, the different types, and how they are created. We demonstrated how to store different types of vector embeddings in Qdrant and showed how to perform similarity searches using Qdrant. These techniques enable AI applications like search engines, recommendation systems, educational apps, etc. By mastering the techniques used in creating and using vector embeddings across different data types, you’ll be well-equipped to tackle a wide range of modern machine learning and information retrieval (IR) tasks.

Mastering Vector Embeddings: Search Text, Audio, Video, and Images with Ease

Introduction

The Importance of Vector Embeddings

Types of Vector Embeddings

1. Text Embeddings

Creating Text Embeddings

2. Audio Embeddings

3. Image Embeddings

4. Video Embeddings

Storing Embeddings in Qdrant

What Is Qdrant?

What is Similarity Search?

Inserting Embeddings into Qdrant

Conclusion

References

Written by Samuel Damilare

No responses yet