Introduction

Privacy concerns about cloud storage are growing, with 75% of users worried about photo leaks. This has led many to look for alternatives to cloud-based photo processing. Cloud APIs, while common for image processing, have clear drawbacks. They require internet connectivity and expose your photos to potential security risks through external server transmission. This solution processes everything on your device using MediaPipe. You get advanced photo organization without sending anything to the cloud. Your photos stay on your device, giving you full control. This local-first approach puts privacy first. Here's how it works and why it matters for managing your photos.

Computer Vision Meets Unsupervised Learning

At the heart of our zero-cloud clustering solution lies a sophisticated interplay between computer vision and unsupervised learning algorithms. Our system uses MediaPipe's MobileNet V3 architecture to create image embeddings - numerical representations that capture the essence of each image. These embeddings work like digital fingerprints, converting visual information such as shapes, colors, textures, and patterns into a 1024-dimensional vector of numbers. This mathematical representation allows computers to measure similarity between images by comparing their embedding vectors. For example, two photos of dogs would have similar embedding patterns, while a photo of a dog and a car would have very different patterns. This powerful yet efficient process runs completely on your device using TensorFlow Lite optimization, processing each image in less than a millisecond on modern CPUs, ensuring both speed and privacy.

import mediapipe as mp
from mediapipe.tasks import python
from mediapipe.tasks.python import vision

# Create options for Image Embedder
model_file = "models/mobilenet_v3_small.tflite"
base_options = python.BaseOptions(model_asset_path=model_file)
l2_normalize = True  # @param {type:"boolean"}
quantize = True  # @param {type:"boolean"}
options = vision.ImageEmbedderOptions(
    base_options=base_options, l2_normalize=l2_normalize, quantize=quantize
)

# Create Image Embedder
with vision.ImageEmbedder.create_from_options(options) as embedder:
    # Format images for MediaPipe
    mp_images = [
        {"file_name": name, "mp_image": mp.Image.create_from_file(f"demo_data/{name}")}
        for name in IMAGE_FILENAMES
    ]

    # Embed images
    embedding_results = [
        {
            "file_name": mp_image["file_name"],
            "embeddings": embedder.embed(mp_image["mp_image"]),
        }
        for mp_image in mp_images
    ]

    # Print embeddings
    for result in embedding_results:
        print(f"File name: {result['file_name']}")
        print(f"Embeddings: {result['embeddings']}")

How Image Similarity Really Works

Our image comparison system leverages cosine similarity to measure the similarity between MediaPipe-generated embeddings. Since MediaPipe normalizes these 1024-dimensional vectors during the embedding process (setting their magnitudes to 1), cosine similarity becomes particularly elegant - it effectively captures the angular relationships between vectors while being naturally magnitude-agnostic. This means we don't compare pixels directly, but rather focus on the directional relationships between these normalized mathematical representations, making the comparison particularly effective for high-dimensional feature spaces. The similarity is calculated using a straightforward formula that measures the cosine of the angle between two vectors:

similarity = cos(θ) = (A·B)/(||A||·||B||)

Range: -1 (opposite) to 1 (identical)
Practical range: 0-1 for images (0 = unrelated, 1 = duplicates)

In our implementation, we leverage MediaPipe's built-in cosine similarity calculation:

similarity = vision.ImageEmbedder.cosine_similarity(
    embedding_results[i]["embeddings"].embeddings[0],
    embedding_results[j]["embeddings"].embeddings[0],
)

While cosine similarity is a powerful tool for comparing image embeddings, it comes with several important limitations. First, since it only considers the angle between vectors and not their magnitudes, images with vastly different intensities might be grouped together if they share similar patterns. The method also heavily relies on the quality of embeddings generated by the underlying neural network (MobileNet V3 in our case), making its effectiveness dependent on the model's ability to capture meaningful features. Additionally, cosine similarity assumes linear relationships between features and doesn't directly account for spatial relationships within images. This means that two images might be considered similar even if their spatial arrangements differ significantly. Despite these constraints, cosine similarity remains highly effective for comparing high-dimensional embeddings, offering computational efficiency and relative insensitivity to feature scaling.

Understanding K-Means Clustering Implementation

K-means clustering serves as one of our primary methods for organizing similar images. The process begins by converting image embeddings into a numpy array for efficient computation. Using the Elbow Method, we determine the optimal number of clusters by analyzing distortion values across different cluster counts. The algorithm then follows an iterative process: it starts by randomly placing k centroids in the embedding space, assigns each image to its nearest centroid, updates centroid positions based on the mean of assigned points, and continues this cycle until reaching convergence or hitting the maximum iteration limit.

The K-means implementation has several key strengths worth highlighting. By converting image data to numpy arrays, we achieve rapid matrix operations for efficient processing. The algorithm removes uncertainty in cluster selection through automated optimization via the Elbow Method. Additionally, its iterative approach, involving multiple refinement passes, ensures the most optimal grouping of similar images.

The K-means clustering algorithm offers several key advantages and limitations. On the positive side, it is straightforward to implement and understand, provides excellent computational efficiency when dealing with large datasets, and creates clusters of consistent size. However, it also comes with notable drawbacks: it requires users to specify the number of clusters beforehand, shows sensitivity to the initial placement of centroids, and struggles to effectively handle clusters with irregular shapes. It also may sometimes create artificial groupings to meet the specified number of clusters. These characteristics make it important to carefully consider whether K-means is the right choice for a specific image clustering application.

Here's the core implementation: