Centrality Metrics

Measure node importance using PageRank, betweenness, Katz, HITS, and other centrality algorithms.

Centrality metrics quantify the importance of nodes in a graph. In Lean dependency graphs, they reveal which theorems, lemmas, and definitions are most critical to the codebase.

ð“‚€

Start with PageRank for a quick overview of important nodes, then use betweenness centrality to find critical "bridge" lemmas.

PageRank

PageRank measures the importance of a node based on the structure of incoming links. A node is important if other important nodes depend on it.

Formula

PR(v)=1−dN+d∑u∈N−(v)PR(u)∣N+(u)∣PR(v) = \frac{1-d}{N} + d \sum_{u \in N^-(v)} \frac{PR(u)}{|N^+(u)|}

Where:

  • d=0.85d = 0.85 is the damping factor (probability of following a link vs random jump)
  • NN is the total number of nodes
  • N−(v)N^-(v) is the set of nodes pointing to vv (predecessors)
  • N+(u)N^+(u) is the set of nodes uu points to (successors)

Parameters

ParameterDefaultDescription
alpha0.85Damping factor
max_iter100Maximum iterations for convergence
tol1e-6Convergence tolerance

Interpretation in Lean

  • High PageRank: Fundamental theorems that many other proofs depend on (e.g., basic lemmas in Mathlib.Algebra)
  • Low PageRank: Specialized results at the "leaves" of the dependency tree

Betweenness Centrality

Betweenness measures how often a node lies on the shortest path between other nodes. High betweenness nodes are "bridges" connecting different parts of the graph.

Formula

BC(v)=∑s≠v≠tσst(v)σstBC(v) = \sum_{s \neq v \neq t} \frac{\sigma_{st}(v)}{\sigma_{st}}

Where:

  • σst\sigma_{st} is the number of shortest paths from ss to tt
  • σst(v)\sigma_{st}(v) is the number of those paths passing through vv

Parameters

ParameterDefaultDescription
normalizedtrueNormalize by 2(n−1)(n−2)\frac{2}{(n-1)(n-2)} for directed graphs
knullSample size for approximation (auto-set for large graphs)

Optimization

ℹ

For graphs with more than 5,000 nodes, Astrolabe automatically uses sampling to speed up computation. Results are approximate but sufficient for identifying key nodes.

For large graphs, sampling uses min(1000, n/5) random source nodes.

Interpretation in Lean

  • High betweenness: "Gateway" lemmas that connect different mathematical domains
  • Low betweenness: Nodes in densely connected local clusters

Katz Centrality

Katz centrality measures influence based on the total number of walks from a node, not just shortest paths. It's better suited for DAGs than PageRank.

Formula

xi=α∑jAijxj+βx_i = \alpha \sum_j A_{ij} x_j + \beta

Where:

  • α\alpha is the attenuation factor (should be <1/λmax< 1/\lambda_{max})
  • β\beta is the base centrality for all nodes
  • AA is the adjacency matrix

Parameters

ParameterDefaultDescription
alpha0.1Attenuation factor
beta1.0Base centrality
max_iter1000Maximum iterations
normalizedtrueNormalize scores

Katz vs PageRank

AspectKatzPageRank
Link weightAll walks countOnly direct links
DAG handlingNative supportMay need adaptation
ComputationCan diverge if α\alpha too highMore stable

HITS (Hub/Authority)

HITS identifies two types of important nodes: hubs that point to many authorities, and authorities pointed to by many hubs.

Formulas

Authority score (iterative): a(v)=∑u→vh(u)a(v) = \sum_{u \to v} h(u)

Hub score (iterative): h(v)=∑v→ua(u)h(v) = \sum_{v \to u} a(u)

Parameters

ParameterDefaultDescription
max_iter100Maximum power method iterations
tol1e-8Convergence tolerance
normalizedtrueNormalize scores

Interpretation in Lean

RoleDescriptionExample
AuthorityFundamental theorems pointed to by many proofsCore type class instances
HubComprehensive proofs that cite many sourcesMajor theorems combining many lemmas

Closeness Centrality

Closeness measures how close a node is to all other nodes. High closeness nodes can reach the rest of the graph quickly.

Formula

CC(v)=n−1∑u≠vd(v,u)CC(v) = \frac{n-1}{\sum_{u \neq v} d(v, u)}

Where d(v,u)d(v, u) is the shortest path distance from vv to uu.

Interpretation in Lean

  • High closeness: Central definitions that are within short dependency chains of most theorems
  • Low closeness: Peripheral or highly specialized results

Eigenvector Centrality

A node is important if it's connected to other important nodes. This is the principal eigenvector of the adjacency matrix.

Formula

xv=1λ∑u∈N(v)xux_v = \frac{1}{\lambda} \sum_{u \in N(v)} x_u

Where λ\lambda is the largest eigenvalue of the adjacency matrix.

Parameters

ParameterDefaultDescription
max_iter1000Maximum power method iterations

Fallback

If the power method fails to converge (common in sparse DAGs), Astrolabe falls back to PageRank.

Comparison Summary

MetricBest ForComplexityLean Interpretation
PageRankGeneral importanceO(E) per iterationFoundational theorems
BetweennessBridge detectionO(VE)Gateway lemmas
KatzDAG influenceO(E) per iterationTotal downstream impact
HITSRole differentiationO(E) per iterationHubs vs authorities
ClosenessAccessibilityO(V(V+E))Central definitions
EigenvectorNetwork effectsO(E) per iterationRecursive importance

API Endpoints

GET /api/project/analysis/pagerank
GET /api/project/analysis/betweenness
GET /api/project/analysis/katz
GET /api/project/analysis/closeness
GET /api/project/analysis/eigenvector

Each returns a dictionary mapping node IDs to centrality scores, plus the top-k highest-scoring nodes.