A while ago I discussed the Photo Search tool that I’ve created and that I use to index all my photos. One thing that had bothered me from the beginning was the need to use Python to load and use the models. I’m sure that there are some cases where using Python is not the worst choice, but those use cases typically involve rapid prototyping and not so much production-like scenarios where things like efficiency and resource consumption matter more.

While for my personal use it may not matter much, I was interested in options for improving the Python related components nevertheless, even if perhaps only out of curiosity. I kept looking around periodically to see what’s happening in the space of AI/ML and using models more efficiently. After all, other surely must have realized too, what a pretty bad idea it would be to run Python code in production.

That’s when a few weeks back I ran across candle, a minimalist ML framework for Rust by the people over at Hugging Face. Very excited about writing some more code in Rust, I started playing with candle but soon had to realized that the documentation is somewhat poor outside of the many examples they had created.

After digging my way through some of the Python implementation for sentence transformers and looking for corresponding implementations in candle, I managed to build a small Rust-based replacement for the embedding server that I had previously written in Python. Packaging it up into a container image was a piece of cake, and with Rust making static linking almost as easy as go, I ended up with an image of <4MB in size (from scratch!). The thing that made me happiest though was the comparison of the memory footprint between the old Python-based embedding server, which required more than 1.3GiB memory at runtime, and the new Rust-based server, which runs with about 540MiB of memory. Please note that this is pretty much the amount of memory it needs to hold the model used to create the embeddings in memory, so one should assume that that’s really the minimum. It won’t get any better than this, unless the model is unloaded from memory after use.

I remember early on in my endeavor to build the photo search tool, I also checked what it would take to containerize the Python-based embedding server. I quickly gave up on this when I realized that the container image would be way over 1GB in size, even without the actual model. That’s just sad but probably no surprise for anybody familiar with Python. By the way, the small container image using Rust as I mentioned above does not include the model either. That’s because the model doesn’t change and can easily be mounted read-only at runtime from a central location. In fact, that’s exactly what I do for my installation of photo search, running in Kubernetes: I simply mount the necessary files in the container running the embedding server. So whenever there are changes to the embedding server, I just need to download <4MB for the container image, and have K8s replace those parts that actually have changed.

There are still some things left to do, though. I also want to remove Python from the photo indexing process, for which I’m hoping candle will also help me. Plus, I still want to abstract the actual deployment to Kubernetes such that others can benefit from this too. Stay tuned!