Community Detection

Discover clusters and communities using Louvain, spectral clustering, and other algorithms.

Community detection algorithms partition the graph into groups of densely connected nodes. In Lean projects, communities often correspond to mathematical domains or related proof clusters.

Modularity

Modularity measures the quality of a graph partition. It compares the number of edges within communities to the expected number in a random graph.

Formula

Q=12mij[Aijkikj2m]δ(ci,cj)Q = \frac{1}{2m} \sum_{ij} \left[ A_{ij} - \frac{k_i k_j}{2m} \right] \delta(c_i, c_j)

Where:

  • A_ij is the adjacency matrix
  • k_i, k_j are node degrees
  • m is the total number of edges
  • δ(c_i, c_j) = 1 if nodes i, j are in the same community

Interpretation

ModularityMeaning
Q > 0.3Strong community structure
0.1 < Q < 0.3Moderate structure
Q < 0.1Weak or no structure

Louvain Algorithm

The Louvain algorithm optimizes modularity through a greedy hierarchical approach. It's fast and produces high-quality partitions.

Algorithm

  1. Local optimization: Move each node to the community that maximizes modularity gain
  2. Aggregation: Build a new graph where nodes are communities
  3. Repeat until modularity no longer improves

Parameters

ParameterDefaultDescription
resolution1.0Higher values produce more communities
random_state42Random seed for reproducibility

Complexity

O(n log n) for most real-world graphs

Resolution Parameter

The resolution parameter γ modifies the modularity formula:

Qγ=12mij[Aijγkikj2m]δ(ci,cj)Q_\gamma = \frac{1}{2m} \sum_{ij} \left[ A_{ij} - \gamma \frac{k_i k_j}{2m} \right] \delta(c_i, c_j)

ResolutionEffect
γ < 1Fewer, larger communities
γ = 1Standard modularity
γ > 1More, smaller communities

Spectral Clustering

Spectral clustering uses the eigenvectors of the graph Laplacian to embed nodes in a low-dimensional space, then applies k-means clustering.

Algorithm

  1. Compute Laplacian: L = D - A (combinatorial) or L = I - D⁻¹/²AD⁻¹/² (normalized)
  2. Eigendecomposition: Find the k smallest eigenvectors (skip the trivial first)
  3. Embed: Use eigenvectors as coordinates
  4. Cluster: Apply k-means in the embedding space

Parameters

ParameterDefaultDescription
n_clusters2Number of clusters to find

Fiedler Vector

The second-smallest eigenvector (Fiedler vector) provides a 1D embedding that optimally separates the graph.

  • Positive values: One side of the partition
  • Negative values: Other side
  • Near zero: Nodes between communities

Spectral Gap

gap=λ2λ1\text{gap} = \lambda_2 - \lambda_1

Where λ₁ = 0 (always). A large spectral gap indicates strong community structure.

Label Propagation

Label propagation is a fast, simple algorithm that iteratively assigns each node the most common label among its neighbors.

Algorithm

  1. Assign unique labels to all nodes
  2. For each node, update label to the majority among neighbors
  3. Repeat until convergence

Characteristics

PropertyDescription
SpeedVery fast, O(E) per iteration
DeterminismNon-deterministic (random tie-breaking)
QualityGood for well-separated communities

Hierarchical Clustering

Hierarchical clustering builds a dendrogram showing nested community structure at multiple scales.

Algorithm

  1. Compute distance matrix: Based on shortest paths or structural dissimilarity
  2. Agglomerative clustering: Merge closest clusters iteratively
  3. Cut dendrogram: At desired level for flat clustering

Linkage Methods

MethodDescription
singleMinimum distance between clusters
completeMaximum distance between clusters
averageAverage distance between clusters
wardMinimize within-cluster variance

Output

Returns both:

  • Dendrogram: Full hierarchical structure
  • Flat clustering: Partition at specified number of clusters

Comparison

AlgorithmSpeedQualityParametersDeterministic
LouvainFastHighresolutionNo
SpectralMediumHighn_clustersYes
Label PropVery fastMediumNoneNo
HierarchicalSlowFlexiblemethod, n_clustersYes

Interpretation in Lean Projects

Community TypeDescriptionExamples
Domain clustersRelated mathematical areasAlgebra, topology, analysis
Dependency chainsProofs building on each otherTheorem → lemma hierarchy
Isolated modulesSelf-contained componentsUtility functions

Clustering Coefficient

The clustering coefficient measures local density around each node.

Local Clustering

Ci={ejk:vj,vkNi,ejkE}Ni(Ni1)C_i = \frac{|\{e_{jk} : v_j, v_k \in N_i, e_{jk} \in E\}|}{|N_i|(|N_i| - 1)}

The fraction of a node's neighbors that are also connected to each other.

Global Clustering (Transitivity)

Cglobal=3×trianglesconnected triplesC_{global} = \frac{3 \times \text{triangles}}{\text{connected triples}}

Interpretation

ValueMeaning
High localNode is in a tight cluster
High globalGraph has many triangles (social network-like)
LowTree-like or sparse structure

API Endpoints

GET /api/project/analysis/communities          # Louvain (default)
GET /api/project/analysis/spectral?n_clusters=5
GET /api/project/analysis/hierarchical?n_clusters=5
GET /api/project/analysis/clustering           # Clustering coefficients

Each community detection endpoint returns:

  • partition: Node → community mapping
  • communities: Community → node list
  • modularity: Quality score
  • num_communities: Count