Quickstart
==========

PGCuts provides an sklearn-compatible API. If you've used
``KMeans`` or ``SpectralClustering``, you already know how to use PGCuts.

Basic usage
-----------

.. code-block:: python

   from pgcuts import HyCut

   # X is an (N, D) numpy array of features (e.g., from DINOv2)
   labels = HyCut(n_clusters=10).fit_predict(X)

That's it. Under the hood, H-Cut:

1. Builds a KNN similarity graph on ``X``
2. Trains a linear model to minimize the hypergeometric NCut bound
3. Returns hard cluster assignments via argmax

Comparison with sklearn
-----------------------

PGCuts is a drop-in alternative to sklearn's clustering:

.. code-block:: python

   from sklearn.cluster import KMeans, SpectralClustering
   from pgcuts import HyCut

   X = ...  # (N, D) feature matrix

   # K-Means
   labels_km = KMeans(n_clusters=10).fit_predict(X)

   # Spectral Clustering
   labels_sc = SpectralClustering(n_clusters=10).fit_predict(X)

   # PGCuts (Hypergeometric NCut)
   labels_hycut = HyCut(n_clusters=10).fit_predict(X)

Choosing the objective
----------------------

PGCuts supports three graph cut objectives:

.. code-block:: python

   # Hypergeometric NCut (default) — best for most cases
   HyCut(n_clusters=K, objective="hyp_ncut")

   # Hypergeometric RatioCut — simpler, no degree binning
   HyCut(n_clusters=K, objective="hyp_rcut")

   # Probabilistic RatioCut — PRCut baseline
   HyCut(n_clusters=K, objective="prcut")

**When to use which:**

- ``hyp_ncut``: Default. Best when cluster sizes are unbalanced or
  the graph has heterogeneous degree distribution.
- ``hyp_rcut``: Simpler variant. Works well when clusters are
  roughly equal-sized.
- ``prcut``: Original PRCut objective. Useful as a baseline.

Common options
--------------

.. code-block:: python

   model = HyCut(
       n_clusters=10,
       objective="hyp_ncut",    # "hyp_ncut", "hyp_rcut", or "prcut"
       n_neighbors=50,          # KNN graph neighbors (default: 50)
       steps=3000,              # optimization steps (default: 3000)
       lr=1e-3,                 # learning rate (default: 1e-3)
       m=512,                   # polynomial degree for 2F1 bound
       device="cuda",           # "cuda" or "cpu"
   )
   labels = model.fit_predict(X)

   # Access cut values after fitting
   print(f"NCut: {model.ncut_:.4f}, RCut: {model.rcut_:.4f}")

Evaluation
----------

.. code-block:: python

   from pgcuts import evaluate_clustering

   results = evaluate_clustering(y_true, labels, n_clusters=10)
   print(f"ACC: {results['accuracy']:.4f}")
   print(f"NMI: {results['nmi']:.4f}")

Working with embeddings
-----------------------

PGCuts works best with pre-extracted embeddings from foundation models
(DINOv2, CLIP, etc.):

.. code-block:: python

   import numpy as np
   from pgcuts import HyCut

   # Load pre-extracted features
   X = np.load("features.npy")       # (N, D) float32
   y = np.load("labels.npy")         # (N,) int, for evaluation only

   model = HyCut(n_clusters=len(np.unique(y)))
   labels = model.fit_predict(X)

Graph quality
-------------

Before clustering, check if the KNN graph captures class structure:

.. code-block:: python

   from pgcuts.graph import build_rbf_knn_graph
   from pgcuts.metrics import compute_rcut_ncut
   import numpy as np

   W = build_rbf_knn_graph(X, n_neighbors=50)

   # Graph quality Q: 1.0 = perfect, 0.0 = random
   T = W.toarray() / W.sum(axis=1)
   q = np.mean([T[i] @ (y == y[i]) for i in range(len(y))])
   q_chance = np.sum((np.bincount(y) / len(y)) ** 2)
   Q = (q - q_chance) / (1 - q_chance)
   print(f"Graph quality Q = {Q:.3f}")

If ``Q < 0.3``, the embeddings don't separate classes well enough
for any graph-cut method to work. Try a better embedding model.