It was Spring ‘22. The snow was meltin’, the birds were singin’, and my fellow ape Xi Chen was deep in the rabbit hole of crypto & NFTs. As he navigated this labyrinth, he often found himself screaming “ooooo-oo-ah-ah-oooo” which is ape-speak for — “Yo, why can’t I simply search for NFTs by describing what’s in the image”. Why? I can only speculate but I presume he wanted to search for something like “ape driving a lambo”. I mean, I know that’s what I’d do!

As a bonus, if there were none, when we did get our first Lambo trading NFTs, we could sell a picture of us driving it as an NFT to get a second Lambo! An ape can dream!

Xi’s ape noises increasingly drew the ire of his wife but luckily for him, he was friends with another ape, Reza Sohrabi, who was deep into NLP. He recalled Reza’s passionate howls “ooooo-oo-oo-ooooooo” which roughly translates to — “Yo, this OpenAI’s CLIP model is pretty neat and lets you embed images & text in the same space with a few lines of code”.

Intrigued, he decided to try it out on a few thousand NFT images and was so impressed with how easy it was, rumor has it that his howls that night rivaled those coming out of Oakland Zoo. Needless to say, he slept on the couch that week.

When I met Xi & Reza back in April and also saw how simple it was to build visual semantic search for NFTs, I jumped up on the chair like Tom Cruise on Oprah and screamed “ah-ah-oooo-oo-ah-ah” which, in this case, translates to — “Yo, I can easily scale this up to a million images and put it behind a web app”. Immediately, the three of us started to thump our chests and run around in circles while howling and screaming at each other. After a good fifteen minutes, we finally settled down to chart our course and got to work.

And, in just a couple of weekends, we stood up NFTopia that let users search from roughly 1.1 million NFTs by simply describing the image (analogous to Google Photos) and also browse visually similar NFTs. While we expected self-driving Lambos to magically show up in our living room the following day, it never did but until then — Ape. Together. Strong. 🦍 🦍 🦍

On a serious note, while there is some emphasis on NFTs as a source for images in this post, I primarily use it as a use case to outline the process of building a simple visual semantic search engine. You can easily swap out NFTs with an image set of your choice like an e-commerce catalog or AI-generated images.

More broadly, this post is split into three sections — (1) getting a million NFT images using Alchemy (2) embedding them with CLIP & (3) powering search with Pinecone. Along the way, I’ll also cover techniques, tooling & infrastructure to build & scale this while emphasizing the ease of doing so thanks to the rapidly maturing tech stack underneath!

1. alchemy + nfts = Apegasm

The first step on our road to NFTopia was to get NFT images. And for a meaningful glimpse into this world, we wanted a sizeable number. Now I had no idea how many NFTs there were in the world, but that didn’t stop me from instinctively yelling out “one million NFTs” while holding my pinky like Dr. Evil cos obviously!

Dr. Evil say 'One Million NFTs' — pictured: visual approximation of me

1.1. now what the hell’s an nft?

Before diving into the logistics of getting a million NFTs, I’ll say just enough about them for what follows to make sense. Now, I am by no means an expert on NFTs but I presume a lot of folks who talk about them aren’t either so I’ll fit right in. The obvious starting point on this topic is Pete Davidson’s SNL video.

If you watched it and still have questions, here’s more mumbo-jumbo to add to it. To add some cringe, I’ll also frame it as a hypothetical conversation between you and me. Again, to reiterate, this is my partially informed perception of NFTs so don’t quote me on it!

You: That rap shiz was cool but Jack Harlow dropped his verse too fast and I didn’t understand a thing!
Me: No worries! Let’s start with the “token” part in Non-Fungible Tokens. This token is simply a number. In ERC-721, a popular Ethereum based standard for NFTs, these tokens are 256-bit unsigned integers (uint256) — commonly represented as integer strings ("2563") or hexadecimal strings ("0x000000000000000000000000000000000000000000000000c26cb1a2660d01ba").
You: Cool I guess. What makes these numbers so special?
Me: The blockchain!
*A light shines through a cloudy sky & an angel appears out of nowhere and sings for 10 minutes.*
You: Wait what? How exactly?
Me: Blockchain can simply be thought of as a spreadsheet that you can only add rows to. You can’t edit or delete any previous rows! You should probably watch this amazing video by 3Blue1Brown to know how exactly this is achieved.
You: Okay, cool. And this helps how exactly?
Me: There is no one ~~ring~~ person/organization to rule them all! Trust is distributed so your record of ownership in this spreadsheet can’t be altered.
You: but…
Me: Also, the underlying, open-source protocol ensures all new tokens for a given contract & blockchain are unique and hence, non-fungible.
You: Okay. I just bought an NFT. Does that mean a row with my wallet address & the token was added to this spreadsheet?
Me: First of all, congratulations! You just bought a number! Not just any number, an unsigned 256-bit integer! Secondly… yes.
You: But I thought I bought the image of a pixelated monkey flinging poop which, in my view, is a highly-priced piece of art!
Me: Well you, my friend, kinda did! ermm sorta.. actually, never mind!
You: Wait what?
Me: So storing data on the blockchain is quite expensive so when you bought the NFT, you probably bought some metadata the token is immutably linked to. This metadata typically has a URI that then points to the actual location of the image, video, audio, etc. These URIs can point to centralized stores accessed through http://, gs://, etc. or decentralized stores like ipfs://.
You: But Abhay, if an NFTs mostly have URIs, does that mean the underlying file where the actual image is stored can be replaced with something else bearing the same name?
Me: OMG! Is that Kanye West behind you with a baseball bat?
.. and I run away
Me: (from far away) .. but apparently for ipfs, content hashes are used as tokens so I guess they’re kinda coupled?

The last bit to add here is that, at its highest aspiration, NFTs don’t even need a URI and can point to physical objects like cars, houses, and perhaps, planets and galaxies! How exactly this gets enforced in the real world is beyond me since I haven’t been smoking what they’ve been smoking but I do know that Dennis Hope has already bought most of the solar system and sold several lunar plots. So if someone is trying to sell you a 256-bit integer representing a slice of the moon, do your due diligence and make sure it doesn’t conflict with a piece of paper already issued by the Lunar Embassy.

I mean, no one wants a uint256 vs. cellulose fiber supreme court case!

1.2. whatevs, just gimme nfts.

In case you’re wondering if you wasted a minute of your life reading the previous section, you’d be mostly right, especially with that last bit. But remember why you’re here? To build a visual semantic search engine for NFT images! And where are these images? We just found out that their URIs are in the token metadata!

Now you probably just cracked open a can of Red Bull ready to yank this data out directly from the blockchain. Brave, you are, but a lot to learn, you have!

In a classic crypto-contrarian fashion, we’ll instead use a centralized, for-profit organization to chew through all that raw blockchain data and have it feed us some easy-to-digest crypto-nft-goo. And, for hungry baby birds like us, there’s no momma quite like Alchemy and in this centralized mommy, we trust!

To their credit, Alchemy makes this process comically simple by offering a single endpoint to get all NFTs & metadata for a given collection (getNFTsForCollection). And in addition to simplicity, it is also free & fast with a generous free tier and capacity for high concurrency. So, a seed set of about 22K collections translated to roughly 370K requests that pulled in metadata for a whopping 34 million NFTs in just a few hours!

Collection IDs

You can download the full set of ~22K collection ids from here (~1.5MB)

1.3 downloading a million images!

Thanks to Alchemy, I now had 34 million NFT locations, 26 million of which pointed to media assets.

IPFS, a distributed file store, served as the most popular host, following which were base64 strings as inline data and a mix of different servers for the rest. Here’s a quick look at the top sources & for a more granular perspective, you can download the raw file here (~35KB)

Source distribution for NFT media - mobile view — Source distribution for NFT media

At this point, the obvious instinct was to scream #yolo and start downloading but before that, it was important to be mindful of a few things that I’ll briefly cover here.

The process environment: When pulling a lot of files with little knowledge of what it is or does, knowing the environment the process runs in is pretty important. At the very least, the process should not have sudo access and preferably, it should run in an isolated environment. Also, it is important to understand how exactly these downloaded bytes are processed.
Mime type & file size: NFTs can point to anything from small images (.png, .jpg) to large media (.gif, .mp4, .wav etc.) so its best to check both mime type and file size before downloading. I mean, it’s not fun to accidentally download 1GB videos thinking they’re 1MB images especially when yolo-ing with a million files!
Retry strategy: Servers can be fickle, especially when you’re hammering them, so a retry strategy is a must.
Storage: From a quick sample, it looked like the average file was a few MBs which meant a million files would roughly translate to a few TBs. So, resizing & compressing images before saving to disk was a no-brainer.
Concurrency: Perhaps the most important thing to consider was concurrency. While it goes without saying that downloading a million files needs it, a bit of thought on “how” would go a long way to optimize throughput. But, as an eager ape, I simply decided to throw a LOT of threads at

And with that, I fired away my code and the logs went brrr making me feel like one of those operators from the Matrix.

Logs that made me feel like cool stuff was happening.

..and here’s my machine humming on all ~~cylinders~~ cores.

After a few initial hiccups, I pulled 1.1 million images amounting to 1.6TB of data in just a few days. In fact, I hit my Xfinity data cap of 1.2TB first but luckily, I only had to wait a couple of days for my next billing cycle to start. During this, I learned a few new things that I think are worth sharing.

Xfinity’s “Advanced Security” was actually throttling my downloads because it likely thought my machine was part of some botnet. I found this out much later and had to disable it to get my full speed back.
Decompression bombs are apparently files that, when loaded into memory, can occupy several orders of magnitude more space than the original file and crash your system. Since I used Pillow to read downloaded bytes as an image, I noticed this error several times. I never dug in to see if it was indeed a malicious file masquerading as an NFT, but it was interesting nonetheless!
IPFS, at first look, reminded me of torrents! Also, probably not a surprise but it was almost an order of magnitude faster at night than during the day (consider this an anecdote at best).
Since a large chunk of NFTs were like cartoonish avatars, resizing (800x800px) & compression significantly reduced storage footprint. In this case, the downloaded 1.6TB translated to just ~65GB as compressed jpgs on disk!
Initially, to optimize CPU usage, I had async calls inside multiple thread pools inside a process pool (matching the number of cores). However, since it seemed like it didn’t do all that much due to a heavy I/O skew, I decided to use a single threadpool to favor simplicity. It’s very likely that this bit me later when I increased the thread count to several thousands due to GIL & threading overhead.

2. CLIP, a heavenly union between text & images.

By the following weekend, I had over a million images ready to be embedded by CLIP, a boringly straightforward process! In fact, this is all it takes to embed an image (source here).

import clip
import torch
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

with torch.no_grad():
    embedding = model.encode_image(preprocess(Image.open("nft.jpg")).unsqueeze(0).to(device))
    embedding /= embedding.norm(dim=-1, keepdim=True)

# pickle embeddings
# ...

And, just like that, 1.1 million NFT images were batched, embedded and pickled in about half a day on my old 980Ti GPU!

Visual semantic search would now be as simple as embedding a user’s query and then computing its dot product with that of NFTs! Better yet, given any NFT, retrieving similar NFTs would be just as simple and would simply mean using that NFT’s embedding as the query. Further out, it would be just as trivial to retrieve similar NFTs based on arbitrary images!

clip?

I won’t go into the details of CLIP but at a high-level, it is a model that maps both images & text into the same vector space. It does this pretty well after having chewed through a LOT of images & associated captions from the web. To learn more, here’s a good reference from Roboflow.

3. thou shalt easily find thy neighbor with pinecone

Computing similarity over all NFTs and retrieving nearest neighbors is somewhat bearable when working with thousands and even tens of thousands of NFTs. But when you have millions, things start to slow down.

Enter approximate nearest neighbor (ANN) search. At a very high level, this can be thought of as “bucketing” the entire embedding space and computing similarity over a handful of representative points from these buckets instead. Then, by only considering buckets close to the query, the resulting search space would be just a fraction of what it would’ve been otherwise.

There are several, off-the-shelf, solutions to do this with varying degrees of accuracy and speed. An early hero (also my first introduction) from almost a decade ago is Annoy from Spotify while the current popular choice is Faiss from Facebook. Using any of these usually means mananging both indicies & deployment which I had zero interest in doing.

So, while on the prowl for managed services, I stumbled onto Pinecone which did just that while offering a simple interface and a generous free tier!

Heres how simple it is to add vectors once you’ve created a project & an index (more here).

import pinecone

pinecone.init(api_key="api_key")
index = pinecone.Index("nftopia")

# load embeddings
# ...

index.upsert(
    vectors=vectors,
    namespace="image-embeddings"
)

To make things even better, it supports both batching & async calls translating to me loading 1.1 million embeddings very quickly!

Once the index is ready, querying it is as simple as this (more here).

# embed query
# ...

results = index.query(
    namespace="image-embeddings",
    top_k = 10,
    queries = [query_embedding],
    include_metadata=True,
)

.. and retrieving similar NFTs is as simple as this.

# get anchor nft id
# ...

nft_embedding = index.fetch([anchor_nft_id], namespace="image-embeddings")

results = index.query(
    namespace="image-embeddings",
    top_k = 10,
    queries = [nft_embedding],
    include_metadata=True,
)

4. putting it all together

With the core pieces in place, it was time to put it all together. Functionally, this meant a few things.

A web GUI that would allow users to (i) type in text and view relevant NFTs (ii) browse visually similar NFTs.
A service to embed user queries with CLIP in real-time.
A storage solution to host & serve the NFT images.

Since most of my familiarity was with Google Cloud Platform, I decided to use a subset of their services for deployment. Also, while (1) & (2) could be bundled, I decided to keep them separate for simplicity.

For (1), I used Django on App Engine, which may have been overkill, but was fast to get up & running given my familiarity. For (2), I simply wrapped CLIP with FastAPI and deployed it on Cloud Run (this is trivial and I’ve put the code up on Github if you’re curious). Finally, for (3), I used Cloud Storage.

That’s it! So when you head over to Nftopia and type in a query, Django first fetches it’s embedding from the CLIP service. Then, it hits Pinecone for nearest neighbors and renders them. For similar NFTs, Django hits Pinecone to first fetch the anchor NFT’s embedding and then retrieves it’s nearest neighbors.

All this happens pretty fast - CLIP takes about ~150-200ms (GCR only supports CPU) & Pinecone takes another ~100-150ms. Also, both App Engine & Cloud Run automatically scale with traffic and are pretty cheap to run while Pinecone requires moving into a paid tier to scale up and needs manual management.

If you have any questions, feedback or want any data/code, feel free to leave a comment or hit me up on X or LinkedIn.

Visual semantic search for a million NFTs with CLIP