Community Detection
Discover clusters and communities using Louvain, spectral clustering, and other algorithms.
Community detection algorithms partition the graph into groups of densely connected nodes. In Lean projects, communities often correspond to mathematical domains or related proof clusters.
Modularity
Modularity measures the quality of a graph partition. It compares the number of edges within communities to the expected number in a random graph.
Formula
Where:
- A_ij is the adjacency matrix
- k_i, k_j are node degrees
- m is the total number of edges
- δ(c_i, c_j) = 1 if nodes i, j are in the same community
Interpretation
| Modularity | Meaning |
|---|---|
| Q > 0.3 | Strong community structure |
| 0.1 < Q < 0.3 | Moderate structure |
| Q < 0.1 | Weak or no structure |
Louvain Algorithm
The Louvain algorithm optimizes modularity through a greedy hierarchical approach. It's fast and produces high-quality partitions.
Algorithm
- Local optimization: Move each node to the community that maximizes modularity gain
- Aggregation: Build a new graph where nodes are communities
- Repeat until modularity no longer improves
Parameters
| Parameter | Default | Description |
|---|---|---|
| resolution | 1.0 | Higher values produce more communities |
| random_state | 42 | Random seed for reproducibility |
Complexity
O(n log n) for most real-world graphs
Resolution Parameter
The resolution parameter γ modifies the modularity formula:
| Resolution | Effect |
|---|---|
| γ < 1 | Fewer, larger communities |
| γ = 1 | Standard modularity |
| γ > 1 | More, smaller communities |
Spectral Clustering
Spectral clustering uses the eigenvectors of the graph Laplacian to embed nodes in a low-dimensional space, then applies k-means clustering.
Algorithm
- Compute Laplacian: L = D - A (combinatorial) or L = I - D⁻¹/²AD⁻¹/² (normalized)
- Eigendecomposition: Find the k smallest eigenvectors (skip the trivial first)
- Embed: Use eigenvectors as coordinates
- Cluster: Apply k-means in the embedding space
Parameters
| Parameter | Default | Description |
|---|---|---|
| n_clusters | 2 | Number of clusters to find |
Fiedler Vector
The second-smallest eigenvector (Fiedler vector) provides a 1D embedding that optimally separates the graph.
- Positive values: One side of the partition
- Negative values: Other side
- Near zero: Nodes between communities
Spectral Gap
Where λ₁ = 0 (always). A large spectral gap indicates strong community structure.
Label Propagation
Label propagation is a fast, simple algorithm that iteratively assigns each node the most common label among its neighbors.
Algorithm
- Assign unique labels to all nodes
- For each node, update label to the majority among neighbors
- Repeat until convergence
Characteristics
| Property | Description |
|---|---|
| Speed | Very fast, O(E) per iteration |
| Determinism | Non-deterministic (random tie-breaking) |
| Quality | Good for well-separated communities |
Hierarchical Clustering
Hierarchical clustering builds a dendrogram showing nested community structure at multiple scales.
Algorithm
- Compute distance matrix: Based on shortest paths or structural dissimilarity
- Agglomerative clustering: Merge closest clusters iteratively
- Cut dendrogram: At desired level for flat clustering
Linkage Methods
| Method | Description |
|---|---|
| single | Minimum distance between clusters |
| complete | Maximum distance between clusters |
| average | Average distance between clusters |
| ward | Minimize within-cluster variance |
Output
Returns both:
- Dendrogram: Full hierarchical structure
- Flat clustering: Partition at specified number of clusters
Comparison
| Algorithm | Speed | Quality | Parameters | Deterministic |
|---|---|---|---|---|
| Louvain | Fast | High | resolution | No |
| Spectral | Medium | High | n_clusters | Yes |
| Label Prop | Very fast | Medium | None | No |
| Hierarchical | Slow | Flexible | method, n_clusters | Yes |
Interpretation in Lean Projects
| Community Type | Description | Examples |
|---|---|---|
| Domain clusters | Related mathematical areas | Algebra, topology, analysis |
| Dependency chains | Proofs building on each other | Theorem → lemma hierarchy |
| Isolated modules | Self-contained components | Utility functions |
Clustering Coefficient
The clustering coefficient measures local density around each node.
Local Clustering
The fraction of a node's neighbors that are also connected to each other.
Global Clustering (Transitivity)
Interpretation
| Value | Meaning |
|---|---|
| High local | Node is in a tight cluster |
| High global | Graph has many triangles (social network-like) |
| Low | Tree-like or sparse structure |
API Endpoints
GET /api/project/analysis/communities # Louvain (default)
GET /api/project/analysis/spectral?n_clusters=5
GET /api/project/analysis/hierarchical?n_clusters=5
GET /api/project/analysis/clustering # Clustering coefficients
Each community detection endpoint returns:
- partition: Node → community mapping
- communities: Community → node list
- modularity: Quality score
- num_communities: Count