APIClaw
FeaturesSkillsUse CasesPricingBlogDocs
APIClaw

The data layer for AI agents.

Product

  • Features
  • Skills
  • Pricing
  • Docs

Community

  • Discord
  • GitHub

Company

  • About
  • Contact

Legal

  • Privacy
  • Terms
  • Acceptable Use

© 2026 APIClaw. All rights reserved.

Third-party platform names are referenced for descriptive purposes only and do not imply affiliation.

Back to Blog

Image Embedding API: Fashion Item Detection & Visual Embedding

APIClaw TeamApril 22, 20268 min read
image-embeddingfashioncomputer-visionembeddingapivisual-search

Finding visually similar products across millions of listings is one of the hardest problems in ecommerce. A customer photographs a pair of shoes on the street and wants to find where to buy them. A seller wants to monitor competitors selling visually identical products under different names. Traditional text-based search fails here — you can't keyword-search a visual style.

Our Image Embedding API solves this with two endpoints: one that detects fashion items in any image, and one that generates visual embeddings for similarity search. Both are powered by GensmoRetro Pro (GR-Pro), a continuously optimized production variant of the model described in the LookBench tech report.

Why Visual Search Matters for E-commerce

The gap between what shoppers see and what they can search for has always been the core limitation of text-based product discovery. A consumer spots a handbag on Instagram, a jacket in a magazine, or a pair of sneakers on the street — and their only option has been to guess keywords and scroll through hundreds of irrelevant results. This gap costs retailers billions in missed conversions every year.

For sellers, the problem is equally acute. Counterfeit detection, catalog deduplication, and competitive monitoring all require comparing product images at scale. Manually reviewing thousands of listings for visual similarity is not feasible, and text-based matching fails when sellers use different titles and descriptions for visually identical products.

Visual embedding technology closes this gap by converting images into numerical vectors that capture perceptual similarity. Two images of the same red sneaker — photographed from different angles, with different backgrounds, by different sellers — will produce embeddings that are close together in vector space. This makes nearest-neighbor search over millions of products both fast and accurate.

The key challenge is domain specificity. General-purpose vision models like CLIP were trained on broad internet data and lack the fine-grained understanding of fashion attributes — fabric textures, silhouette shapes, hardware details, color gradients — that distinguish one product from another. A model trained specifically on fashion imagery captures these nuances, which is why purpose-built fashion embeddings consistently outperform general models on retrieval benchmarks.

The Model Behind the API

GR-Pro is a high-capacity Vision Transformer (ViT) with ~0.3B parameters, trained on 6.5M curated fashion images spanning 1.9M unique product identities. It produces L2-normalized embeddings that capture fine-grained visual features — textures, patterns, silhouettes, accessories — that matter for fashion retrieval.

On the LookBench benchmark, GR-Pro achieves 67.4% Fine Recall@1 overall, outperforming the best public baselines (Marqo-FashionCLIP at 63.2%, SigLIP2 at 59.4%) by a significant margin. On real-world street-style images — the hardest subset — the gap widens to 5.5 points over the next best model. On the legacy Fashion200K benchmark, GR-Pro reaches 88.7% Recall@1.

Key technical details:

  • ArcFace training with Partial FC for scalable identity-based metric learning across 1.9M classes
  • Pure vision encoder — no text input needed, yet outperforms multimodal CLIP-family models on fashion retrieval
  • 10 fashion categories — Bag, Cap, Down-Clothing, Glasses, Jewelry, Others, Shoes, Sock, Up-Clothing, Watch
  • Continuously optimized — the API serves a production variant that is regularly updated beyond the published paper results

Two Endpoints

1. Image Embedding — Generate Visual Feature Vectors

Extract embeddings from fashion items detected in an image. Use these vectors for similarity search, product deduplication, visual recommendations, or clustering.

Endpoint: POST https://api.apiclaw.io/openapi/v2/model/image-embedding

Credits: 1 credit per request.

Request

{
  "image": "https://example.com/photo.jpg",
  "withTag": true,
  "withEmbedding": true,
  "text": ["red dress"]
}

Fields:

FieldTypeRequiredDescription
imagestringYesPublic URL of the image to analyze
topKintegerNoMaximum number of items to detect in the image
boundingBoxesint[][]NoRegions of interest as pixel coordinates [[x1, y1, x2, y2], ...]. Skip to auto-detect.
textstring or string[]NoSearch text(s) to score how well each detected item matches
withTagbooleanNoReturn category labels like "Shoes" or "Bag" for each item (default: false)
withEmbeddingbooleanNoReturn feature vectors for similarity search (default: true)

Response

{
  "success": true,
  "data": {
    "boundingBoxes": [[120, 45, 380, 520]],
    "tags": [["Up-Clothing", 0.92]],
    "embeddings": [[0.0312, -0.0156, 0.0478, ...]],
    "textRelevanceScores": [0.847]
  },
  "meta": {
    "requestId": "req_a1b2c3d4e5f67890",
    "timestamp": "2026-04-22T10:30:00.000000Z",
    "creditsRemaining": 998,
    "creditsConsumed": 1
  }
}

2. Image Detection — Locate Fashion Items

Detect and classify fashion items in an image with bounding boxes and confidence scores. Filter by category to detect only specific item types.

Endpoint: POST https://api.apiclaw.io/openapi/v2/model/image-detection

Credits: 1 credit per request.

Request

{
  "image": "https://example.com/street-photo.jpg",
  "topK": 5,
  "classes": [6, 8],
  "returnImage": false
}

Fields:

FieldTypeRequiredDescription
imagestringYesPublic URL of the image to analyze
topKintegerYesMaximum number of items to detect (1-50)
classesint[]NoWhich categories to look for (omit to detect all). See category IDs below.
returnImagebooleanNoReturn the image with detection boxes drawn on it (default: false)

Category IDs: 0=Bag, 1=Cap, 2=Down-Clothing, 3=Glasses, 4=Jewelry, 5=Others, 6=Shoes, 7=Sock, 8=Up-Clothing, 9=Watch

Response

{
  "success": true,
  "data": {
    "detections": [
      {
        "areaRatio": 0.15,
        "centerDistanceRatio": 0.08,
        "classId": 8,
        "score": 0.96,
        "boundingBox": [120, 45, 380, 520]
      },
      {
        "areaRatio": 0.07,
        "centerDistanceRatio": 0.35,
        "classId": 6,
        "score": 0.91,
        "boundingBox": [200, 480, 350, 620]
      }
    ],
    "annotatedImage": null
  },
  "meta": {
    "requestId": "req_b2c3d4e5f6789012",
    "timestamp": "2026-04-22T10:30:00.000000Z",
    "creditsRemaining": 997,
    "creditsConsumed": 1
  }
}

Start with 1,000 free API credits — sign up here.

Code Recipes

Recipe 1: Detect Fashion Items (curl)

curl -s -X POST https://api.apiclaw.io/openapi/v2/model/image-detection \
  -H "Authorization: Bearer hms_live_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "image": "https://example.com/outfit.jpg",
    "topK": 10
  }'

Recipe 2: Python — Visual Similarity Search

import httpx
import numpy as np

APICLAW_URL = "https://api.apiclaw.io/openapi/v2/model/image-embedding"
APICLAW_KEY = "hms_live_YOUR_API_KEY"

def get_fashion_embeddings(image_url: str) -> list[list[float]]:
    """Extract fashion item embeddings from an image."""
    resp = httpx.post(
        APICLAW_URL,
        headers={"Authorization": f"Bearer {APICLAW_KEY}"},
        json={"image": image_url, "withEmbedding": True, "withTag": True},
        timeout=30.0,
    )
    result = resp.json()
    if not result["success"]:
        raise RuntimeError(f"Embedding failed: {result}")
    return result["data"]["embeddings"]

def cosine_similarity(a: list[float], b: list[float]) -> float:
    """Compute cosine similarity between two embedding vectors."""
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Get embeddings for a query image and a catalog image
query_embeds = get_fashion_embeddings("https://example.com/query-shoe.jpg")
catalog_embeds = get_fashion_embeddings("https://example.com/catalog-shoe.jpg")

# Compare the first detected item from each
similarity = cosine_similarity(query_embeds[0], catalog_embeds[0])
print(f"Visual similarity: {similarity:.3f}")
# => Visual similarity: 0.927

Recipe 3: Python — Detect + Embed Pipeline

import httpx

BASE_URL = "https://api.apiclaw.io/openapi/v2/model"
HEADERS = {"Authorization": "Bearer hms_live_YOUR_API_KEY"}

def detect_and_embed(image_url: str, categories: list[int] | None = None):
    """Detect fashion items, then generate embeddings with bounding boxes."""
    # Step 1: Detect items
    detect_resp = httpx.post(
        f"{BASE_URL}/image-detection",
        headers=HEADERS,
        json={"image": image_url, "topK": 10, "classes": categories},
        timeout=30.0,
    )
    detections = detect_resp.json()["data"]["detections"]
    if not detections:
        return []

    # Step 2: Get embeddings using detected bounding boxes
    bboxes = [d["boundingBox"] for d in detections]
    embed_resp = httpx.post(
        f"{BASE_URL}/image-embedding",
        headers=HEADERS,
        json={
            "image": image_url,
            "boundingBoxes": [[int(x) for x in bb] for bb in bboxes],
            "withEmbedding": True,
            "withTag": True,
        },
        timeout=30.0,
    )
    embed_data = embed_resp.json()["data"]

    # Combine detection metadata with embeddings
    results = []
    for i, det in enumerate(detections):
        results.append({
            "classId": det["classId"],
            "score": det["score"],
            "boundingBox": det["boundingBox"],
            "tag": embed_data["tags"][i] if embed_data.get("tags") else None,
            "embedding": embed_data["embeddings"][i],
        })
    return results

# Detect shoes and clothing in a street photo
items = detect_and_embed(
    "https://example.com/street-fashion.jpg",
    categories=[6, 8],  # Shoes + Up-Clothing
)
for item in items:
    print(f"  Class {item['classId']} (conf={item['score']:.2f}): {item['tag']}")

Recipe 4: TypeScript — Visual Search API Route

import { NextRequest, NextResponse } from "next/server";

const BASE_URL = "https://api.apiclaw.io/openapi/v2/model";
const APICLAW_KEY = process.env.APICLAW_API_KEY!;

interface Detection {
  areaRatio: number;
  centerDistanceRatio: number;
  classId: number;
  score: number;
  boundingBox: number[];
}

interface EmbeddingResult {
  boundingBoxes: number[][];
  tags: [string, number][] | null;
  embeddings: number[][] | null;
  textRelevanceScores: number[] | null;
}

async function detectFashionItems(imageUrl: string, topK: number = 10): Promise<Detection[]> {
  const res = await fetch(`${BASE_URL}/image-detection`, {
    method: "POST",
    headers: {
      Authorization: `Bearer ${APICLAW_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ image: imageUrl, topK }),
  });
  const data = await res.json();
  if (!data.success) throw new Error("Detection failed");
  return data.data.detections;
}

async function getEmbeddings(imageUrl: string, text?: string): Promise<EmbeddingResult> {
  const res = await fetch(`${BASE_URL}/image-embedding`, {
    method: "POST",
    headers: {
      Authorization: `Bearer ${APICLAW_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      image: imageUrl,
      withEmbedding: true,
      withTag: true,
      text: text ? [text] : undefined,
    }),
  });
  const data = await res.json();
  if (!data.success) throw new Error("Embedding failed");
  return data.data;
}

// Usage: find visually similar products
export async function POST(req: NextRequest) {
  const { imageUrl, query } = await req.json();
  const embeddings = await getEmbeddings(imageUrl, query);
  // Use embeddings.embeddings[0] for nearest-neighbor search in your vector DB
  return NextResponse.json({ embeddings });
}

See the full endpoint reference in our API documentation.

Use Cases

ScenarioEndpointHow
Visual product searchBothDetect items in query photo, embed them, find nearest neighbors in your catalog
Competitor monitoringEmbeddingEmbed your product images and competitors' — flag high-similarity pairs
Catalog deduplicationEmbeddingCluster products by embedding similarity to find duplicates
Fashion trend analysisDetectRun detection across social media images to count category frequencies
Outfit recommendationBothDetect all items in an outfit, embed each, recommend complementary pieces
Content moderationDetectVerify product listings contain the claimed fashion category

Visual Product Search in Practice

The most common integration pattern combines both endpoints into a real-time visual search flow. A user uploads or photographs an item — say, a leather crossbody bag spotted at a café. The detection endpoint identifies and localizes the bag within the image, filtering out background noise, other people, and irrelevant objects. The embedding endpoint then converts that cropped region into a 768-dimensional vector. Your application runs a nearest-neighbor query against a pre-indexed catalog of product embeddings stored in a vector database, returning the top-k most visually similar products in under 50ms on the retrieval side. The entire pipeline — from raw photo to ranked results — completes in well under a second, making it viable for consumer-facing search bars and mobile apps alike.

Competitor Monitoring and Counterfeit Detection

For brands and marketplace sellers, visual similarity search is a powerful enforcement tool. By embedding your own product catalog and periodically embedding new listings from competitors or third-party sellers, you can automatically flag listings whose images exceed a similarity threshold — often a cosine similarity above 0.85 indicates a near-identical product. This approach catches cases that text-based monitoring misses entirely: sellers who use different titles, translate descriptions into other languages, or deliberately alter keywords to avoid detection. Paired with scheduled batch jobs, you can build a continuous monitoring pipeline that surfaces potential counterfeits or unauthorized resellers within hours of listing.

Building a Fashion Trend Dashboard

The detection endpoint opens up a category of analytics that was previously only available to large fashion houses with in-house computer vision teams. By running detection across thousands of social media images, street-style photography feeds, or influencer content, you can count the frequency of each fashion category over time — tracking, for example, the seasonal rise of down-clothing in autumn or the growing prevalence of specific accessory types. Aggregating these detection results by week or month gives you a quantitative trend signal that complements traditional keyword-based trend analysis. Retailers can use these signals to inform buying decisions, adjust inventory allocation, or time marketing campaigns to align with emerging style shifts.

Performance and Latency

Both endpoints are optimized for production workloads. Image embedding requests typically complete in 200-400ms for a single image, including detection and feature extraction. Image detection alone runs in under 200ms. These latencies make it practical to embed images inline during catalog ingestion or to power real-time visual search in consumer-facing applications.

For batch processing — such as embedding an entire product catalog — you can parallelize requests across multiple images. The API handles concurrent requests without degradation, and at 1 credit per request, embedding a 100,000-image catalog costs 100,000 credits. The resulting vectors can be stored in any vector database (Pinecone, Weaviate, Qdrant, pgvector) for sub-millisecond nearest-neighbor retrieval.

References

  1. Gao, C.*, Xue, S.*, et al. "LookBench: A Live and Holistic Open Benchmark for Fashion Image Retrieval". arXiv:2601.14706, 2026.
  2. Oquab, M., et al. "DINOv2: Learning Robust Visual Features without Supervision". arXiv:2304.07193, 2023.
  3. Deng, J., et al. "ArcFace: Additive Angular Margin Loss for Deep Face Recognition". arXiv:1801.07698, 2018.

Ready to build with APIClaw?

View API DocsGet Started