Lecture 2: Supervised, Self-Supervised, & Weakly Supervised Learning

When you think about AI models being deployed in the wild, you can really come down to there's two files somewhere on the cloud, one that describes the architecture of the model, one that describes the parameters that are part of this architecture.

The deeper the network is, the more capacity it has. A network that is super deep with a billion parameters trained on one million pictures will just overfit to those pictures. It's not going to learn what a cat is, it's just going to learn by heart those million pictures because its capacity is way bigger than the data set it's fed.

Encoding is any vector representation. An embedding is when an encoding has meaning. The distance between two encodings has meaning. They might be close or far from each other, and it tells you something.

The number one mistake that we see in projects is that people add more data, but forget to adjust the labels.

The neurons in the first layers are looking at pixels, so they're going to be good at stitching those pixels together. The first neuron will be good at detecting diagonal edges, the second neuron at vertical edges, the third at horizontal edges. One layer deeper, those neurons are not seeing pixels, they're seeing the output of the first layer, which is already slightly more complex.

Great deep learning researchers are very creative when it comes to designing loss functions.

We would print pictures in different resolutions and run it through our friends and say, can you tell if it's the day or the night. It turns out that below a certain resolution, they can't anymore. Those type of insights we could get over the next 10 minutes, literally by using humans as a proxy.

The calculation is really about how fast our iteration cycles can be. It's not about necessarily the performance of the model. You want to be able to iterate super quickly. If your model takes one year to train, you're not going to be able to iterate.

If the labeling scheme is easier for you, it's easier for the model.

We went around campus and literally recorded people saying activate and other words. We created a Python script that randomly inserts words. The trick is that because the Python script did that, the Python script knows where activate was put so it can label automatically. Within three hours, we had millions of data points.

These hacks are valuable. Why does Meta go out and give crazy offers to a few researchers? Because they know stuff. They know those hacks, literally.

You take the top left pixel on one image, it's bright. You take the top left pixel on another image of the same person, it's dark. The difference between these two pixels is massive, close to 255, yet the pixel doesn't even matter. So why would you use that for comparison?

Contrastive learning: you take a picture, you make a variance of it with noise, rotation, cropping, translation, whatever you want. Then you put these two in the data set and say these should have the same vector. That's why it's called self-supervised learning. You don't have labels, you just create a learning environment that makes the network learn from the patterns of the data directly itself.

The shift from supervised triplets to self-supervised pairs is why modern models are trained on billions of unlabeled images. You can literally write a script and scrape and it will auto-label the images and put them in pairs, do variations, and then you'll end up with a very powerful pre-trained model. Much simpler than people think.

This is not typically called supervised learning, it will be called weakly supervised learning because you're not actually labeling images with captions, you are benefiting from naturally occurring pairings in the world.

Emergent behaviors are unexpected capabilities that arise from simple training objectives at scale without being explicitly taught or labeled.

Most things connect through a single modality. For example, thermal data connects to imaging. Imaging connects to text. So text is going to connect through images with thermal data. Most things connect to text. So that's what you want to use as your shared space typically.