Centrality Metrics

Measure node importance using PageRank, betweenness, Katz, HITS, and other centrality algorithms.

Centrality metrics quantify the importance of nodes in a graph. In Lean dependency graphs, they reveal which theorems, lemmas, and definitions are most critical to the codebase.

𓂀

Start with PageRank for a quick overview of important nodes, then use betweenness centrality to find critical "bridge" lemmas.

PageRank

PageRank measures the importance of a node based on the structure of incoming links. A node is important if other important nodes depend on it.

Formula

$PR(v) = \frac{1-d}{N} + d \sum_{u \in N^-(v)} \frac{PR(u)}{|N^+(u)|}$

Where:

$d = 0.85$ is the damping factor (probability of following a link vs random jump)
$N$ is the total number of nodes
$N^-(v)$ is the set of nodes pointing to $v$ (predecessors)
$N^+(u)$ is the set of nodes $u$ points to (successors)

Parameters

Parameter	Default	Description
`alpha`	0.85	Damping factor
`max_iter`	100	Maximum iterations for convergence
`tol`	1e-6	Convergence tolerance

Interpretation in Lean

High PageRank: Fundamental theorems that many other proofs depend on (e.g., basic lemmas in Mathlib.Algebra)
Low PageRank: Specialized results at the "leaves" of the dependency tree

Betweenness Centrality

Betweenness measures how often a node lies on the shortest path between other nodes. High betweenness nodes are "bridges" connecting different parts of the graph.

Formula

$BC(v) = \sum_{s \neq v \neq t} \frac{\sigma_{st}(v)}{\sigma_{st}}$

Where:

$\sigma_{st}$ is the number of shortest paths from $s$ to $t$
$\sigma_{st}(v)$ is the number of those paths passing through $v$

Parameters

Parameter	Default	Description
`normalized`	true	Normalize by $\frac{2}{(n-1)(n-2)}$ for directed graphs
`k`	null	Sample size for approximation (auto-set for large graphs)

Optimization

ℹ

For graphs with more than 5,000 nodes, Astrolabe automatically uses sampling to speed up computation. Results are approximate but sufficient for identifying key nodes.

For large graphs, sampling uses min(1000, n/5) random source nodes.

Interpretation in Lean

High betweenness: "Gateway" lemmas that connect different mathematical domains
Low betweenness: Nodes in densely connected local clusters

Katz Centrality

Katz centrality measures influence based on the total number of walks from a node, not just shortest paths. It's better suited for DAGs than PageRank.

Formula

$x_i = \alpha \sum_j A_{ij} x_j + \beta$

Where:

$\alpha$ is the attenuation factor (should be $< 1/\lambda_{max}$ )
$\beta$ is the base centrality for all nodes
$A$ is the adjacency matrix

Parameters

Parameter	Default	Description
`alpha`	0.1	Attenuation factor
`beta`	1.0	Base centrality
`max_iter`	1000	Maximum iterations
`normalized`	true	Normalize scores

Katz vs PageRank

Aspect	Katz	PageRank
Link weight	All walks count	Only direct links
DAG handling	Native support	May need adaptation
Computation	Can diverge if $\alpha$ too high	More stable

HITS (Hub/Authority)

HITS identifies two types of important nodes: hubs that point to many authorities, and authorities pointed to by many hubs.

Formulas

Authority score (iterative): $a(v) = \sum_{u \to v} h(u)$

Hub score (iterative): $h(v) = \sum_{v \to u} a(u)$

Parameters

Parameter	Default	Description
`max_iter`	100	Maximum power method iterations
`tol`	1e-8	Convergence tolerance
`normalized`	true	Normalize scores

Interpretation in Lean

Role	Description	Example
Authority	Fundamental theorems pointed to by many proofs	Core type class instances
Hub	Comprehensive proofs that cite many sources	Major theorems combining many lemmas

Closeness Centrality

Closeness measures how close a node is to all other nodes. High closeness nodes can reach the rest of the graph quickly.

Formula

$CC(v) = \frac{n-1}{\sum_{u \neq v} d(v, u)}$

Where $d(v, u)$ is the shortest path distance from $v$ to $u$ .

Interpretation in Lean

High closeness: Central definitions that are within short dependency chains of most theorems
Low closeness: Peripheral or highly specialized results

Eigenvector Centrality

A node is important if it's connected to other important nodes. This is the principal eigenvector of the adjacency matrix.

Formula

$x_v = \frac{1}{\lambda} \sum_{u \in N(v)} x_u$

Where $\lambda$ is the largest eigenvalue of the adjacency matrix.

Parameters

Parameter	Default	Description
`max_iter`	1000	Maximum power method iterations

Fallback

If the power method fails to converge (common in sparse DAGs), Astrolabe falls back to PageRank.

Comparison Summary

Metric	Best For	Complexity	Lean Interpretation
PageRank	General importance	O(E) per iteration	Foundational theorems
Betweenness	Bridge detection	O(VE)	Gateway lemmas
Katz	DAG influence	O(E) per iteration	Total downstream impact
HITS	Role differentiation	O(E) per iteration	Hubs vs authorities
Closeness	Accessibility	O(V(V+E))	Central definitions
Eigenvector	Network effects	O(E) per iteration	Recursive importance

API Endpoints

GET /api/project/analysis/pagerank
GET /api/project/analysis/betweenness
GET /api/project/analysis/katz
GET /api/project/analysis/closeness
GET /api/project/analysis/eigenvector

Each returns a dictionary mapping node IDs to centrality scores, plus the top-k highest-scoring nodes.