Community Detection

Discover clusters and communities using Louvain, spectral clustering, and other algorithms.

Community detection algorithms partition the graph into groups of densely connected nodes. In Lean projects, communities often correspond to mathematical domains or related proof clusters.

Modularity

Modularity measures the quality of a graph partition. It compares the number of edges within communities to the expected number in a random graph.

Formula

$Q = \frac{1}{2m} \sum_{ij} \left[ A_{ij} - \frac{k_i k_j}{2m} \right] \delta(c_i, c_j)$

Where:

A_ij is the adjacency matrix
k_i, k_j are node degrees
m is the total number of edges
δ(c_i, c_j) = 1 if nodes i, j are in the same community

Interpretation

Modularity	Meaning
Q > 0.3	Strong community structure
0.1 < Q < 0.3	Moderate structure
Q < 0.1	Weak or no structure

Louvain Algorithm

The Louvain algorithm optimizes modularity through a greedy hierarchical approach. It's fast and produces high-quality partitions.

Algorithm

Local optimization: Move each node to the community that maximizes modularity gain
Aggregation: Build a new graph where nodes are communities
Repeat until modularity no longer improves

Parameters

Parameter	Default	Description
resolution	1.0	Higher values produce more communities
random_state	42	Random seed for reproducibility

Complexity

O(n log n) for most real-world graphs

Resolution Parameter

The resolution parameter γ modifies the modularity formula:

$Q_\gamma = \frac{1}{2m} \sum_{ij} \left[ A_{ij} - \gamma \frac{k_i k_j}{2m} \right] \delta(c_i, c_j)$

Resolution	Effect
γ < 1	Fewer, larger communities
γ = 1	Standard modularity
γ > 1	More, smaller communities

Spectral Clustering

Spectral clustering uses the eigenvectors of the graph Laplacian to embed nodes in a low-dimensional space, then applies k-means clustering.

Algorithm

Compute Laplacian: L = D - A (combinatorial) or L = I - D⁻¹/²AD⁻¹/² (normalized)
Eigendecomposition: Find the k smallest eigenvectors (skip the trivial first)
Embed: Use eigenvectors as coordinates
Cluster: Apply k-means in the embedding space

Parameters

Parameter	Default	Description
n_clusters	2	Number of clusters to find

Fiedler Vector

The second-smallest eigenvector (Fiedler vector) provides a 1D embedding that optimally separates the graph.

Positive values: One side of the partition
Negative values: Other side
Near zero: Nodes between communities

Spectral Gap

$\text{gap} = \lambda_2 - \lambda_1$

Where λ₁ = 0 (always). A large spectral gap indicates strong community structure.

Label Propagation

Label propagation is a fast, simple algorithm that iteratively assigns each node the most common label among its neighbors.

Algorithm

Assign unique labels to all nodes
For each node, update label to the majority among neighbors
Repeat until convergence

Characteristics

Property	Description
Speed	Very fast, O(E) per iteration
Determinism	Non-deterministic (random tie-breaking)
Quality	Good for well-separated communities

Hierarchical Clustering

Hierarchical clustering builds a dendrogram showing nested community structure at multiple scales.

Algorithm

Compute distance matrix: Based on shortest paths or structural dissimilarity
Agglomerative clustering: Merge closest clusters iteratively
Cut dendrogram: At desired level for flat clustering

Linkage Methods

Method	Description
single	Minimum distance between clusters
complete	Maximum distance between clusters
average	Average distance between clusters
ward	Minimize within-cluster variance

Output

Returns both:

Dendrogram: Full hierarchical structure
Flat clustering: Partition at specified number of clusters

Comparison

Algorithm	Speed	Quality	Parameters	Deterministic
Louvain	Fast	High	resolution	No
Spectral	Medium	High	n_clusters	Yes
Label Prop	Very fast	Medium	None	No
Hierarchical	Slow	Flexible	method, n_clusters	Yes

Interpretation in Lean Projects

Community Type	Description	Examples
Domain clusters	Related mathematical areas	Algebra, topology, analysis
Dependency chains	Proofs building on each other	Theorem → lemma hierarchy
Isolated modules	Self-contained components	Utility functions

Clustering Coefficient

The clustering coefficient measures local density around each node.

Local Clustering

$C_i = \frac{|\{e_{jk} : v_j, v_k \in N_i, e_{jk} \in E\}|}{|N_i|(|N_i| - 1)}$

The fraction of a node's neighbors that are also connected to each other.

Global Clustering (Transitivity)

$C_{global} = \frac{3 \times \text{triangles}}{\text{connected triples}}$

Interpretation

Value	Meaning
High local	Node is in a tight cluster
High global	Graph has many triangles (social network-like)
Low	Tree-like or sparse structure

API Endpoints

GET /api/project/analysis/communities          # Louvain (default)
GET /api/project/analysis/spectral?n_clusters=5
GET /api/project/analysis/hierarchical?n_clusters=5
GET /api/project/analysis/clustering           # Clustering coefficients

Each community detection endpoint returns:

partition: Node → community mapping
communities: Community → node list
modularity: Quality score
num_communities: Count