Vector embeddings transform words, images, or other inputs into multidimensional numeric vectors, enabling semantic similarity comparisons. This blog explores embedding models like word2vec and OpenAI’s latest, similarity metrics such as cosine similarity, and practical vector search techniques for efficient data retrieval. Unique :

A Visual Introduction to Vector Embeddings: What You Need to Know
What Are Vector Embeddings?
Vector embeddings convert inputs like words or images into lists of floating-point numbers. These lists represent the input in a multidimensional space, called the embedding space. For example, a vector with 1024 numbers has 1024 dimensions. This numerical representation helps machines understand semantic meaning beyond simple text or pixels.
Major Embedding Models Explained
Word2vec was the pioneer, focusing on single words with 300 dimensions. It’s lightweight and great for semantic meaning. Then came OpenAI’s text-embedding-ada-002
in 2022, which supports up to 8192 tokens and outputs 1536 dimensions. It’s faster, cheaper, and widely used.
Interestingly, this model always produces a “downward spike” at dimension 196, regardless of input. As Pamela Fox notes,
“For whatever reason, this model always produces a vector with that spike. Very peculiar!”
In 2024, OpenAI released text-embedding-3-small
and text-embedding-3-large
. These models maintain the 8192 token input and 1536 output dimensions but are even faster and cheaper. Unlike its predecessor, the “spike” disappears, and values are well distributed.
Understanding Similarity Spaces and Metrics
Once inputs are embedded, comparing vectors reveals their semantic similarity. Each model has its own similarity space, so results vary. For example, the text-embedding-ada-002
model oddly ranks “god” as very similar to “dog,” likely due to spelling patterns.
As Fox points out,
“Many developers find that unintuitive, as we might see a value of 0.75 initially and think it indicates a very similar value, when it actually is the opposite for this model.”
Choosing the right similarity metric is crucial. Cosine similarity is popular, measuring the angle between vectors. Dot product can also work, especially for unit vectors, offering performance benefits in vector databases.
Vector Search: The Real Power of Embeddings
Vector search lets you find semantically similar items, not just lexically similar ones. This works across languages and even image-text combinations with multimodal models. For small datasets, exhaustive search is fine. But large databases need Approximate Nearest Neighbors (ANN) algorithms like HNSW for efficient querying.
Popular ANN algorithms and tools include:
- HNSW (hnswlib, PostgreSQL pgvector, Azure AI Search)
- DiskANN (diskannpy, Cosmos DB)
- IVFFlat (faiss, PostgreSQL)
Why This Matters
Vector embeddings are reshaping how machines understand language and images. They enable smarter search, better recommendations, and cross-lingual applications. Staying updated on embedding models and similarity metrics helps developers build more intuitive AI systems.
From the New blog articles in Microsoft Community Hub