Feature Vectors and Embeddings

Seeking clarity, it's been a while

Published on 2024/05/18

I've been trying to brush up on my ML knowledge and get up to speed. Things have changed so much in the last 5 years it almost feels like a completely new industry. Contrary to how I'm used to working, I started with a practical project. I wanted to play with MongoDB Vector Search and thought this was a good chance to explore the ML space again. When I was doing this type of work, the building blocks were Feature Vectors.

I'll keep this very light and go over my mental model for them. Let's say I want my computer to tell me if a picture is portraying a dog. I could compare my picture to a picture of a dog pixel by pixel. That's not only computationally expensive but also not very robust. By not "robust" I mean that a small change in color, rotation, or whatnot could yield very different results. How about I find some significant details and see if those are present in the image? Keeping things simple, I can say that a dog has 4 legs, 1 tail, and 2 floppy ears. These details I just listed can be the features of a dog. For a computer to understand this, I can represent it as an array of numbers.

Here it goes: [4, 1, 2]

The value at position 0 represents the legs, at position 1 we have the tail, and at position 2 we have the floppy ears. Now I look at the image and detect 4 legs, 1 tail, and 1 floppy ear.

So I have: [4, 1, 1]

Hey, this looks like it could be a dog! Intuitively it seems numerically pretty close (I won't go into details about the distance between two vectors). This is called a feature vector! The reason is that we extracted 3 dimensions (details, or features) that I can use to detect a dog. Now pretend you have a black box that you give an image to, and it spits out 3 values. You can use those values, compare them to how you defined a dog, and know if that image is a dog or not. Eventually, you keep iterating on this until you extract enough robust details that you can tell with high confidence that an image is showing a dog. Thinking in terms of visual cues is easier for me since I worked more closely in the computer vision realm. Cutting this short, this is where I was before dipping my toes into generative AI.

I kept reading about embeddings and wasn't sure what they referred to at first. I went through a tutorial that dealt with natural language processing (NLP). A document or abstract was represented with a vector of values. My first reaction was, isn't this a feature vector? It turns out that feature vectors are often high-dimensional and sparse, while embeddings capture semantic relationships in a lower-dimensional space (fewer dimensions).

I think my gap was that I was stuck in the computer vision world from a while ago. Feature Vectors are high-dimensional and sparse, often coming from feature engineering. For example, you look for specific patterns, textures, and color histograms. Their occurrences are then encoded in a high-dimensional feature vector. It is sparse because some of those values could just be empty if, for example, a specific pattern is never encountered in that image. Then CNN layers do their work to reduce this representation into something that is the most relevant. This is a necessary step to make it computationally feasible to use and compare to other ones. The output then is a low-dimensional and dense vector. In the NLP world, that's what an embedding is.

Thoughts

Practically speaking, the way Feature Vectors and Embeddings are represented by a computer is the same, an array of values. My knowledge gap comes mostly from not digging that much into the NLP world. I am familiar with bag-of-words and TF-IDF (Term Frequency-Inverse Document Frequency) which are feature vectors. Times changed and with the advent of word embeddings, we represent words in a lower-dimensional space, capturing semantic relationships. These embeddings are learned from large amounts of text without any feature engineering. Or at least this is how I understand it.

← Go Back