Statistical Analysis

Analyze degree distributions, correlations, assortativity, and detect anomalies.

Statistical analysis reveals patterns in graph structure and identifies unusual nodes that may deserve attention.

Degree Distribution Fitting

Real-world networks often follow specific degree distributions. Astrolabe fits several models to determine the best match.

Models

Power Law

$P(k) \propto k^{-\gamma}$

Scale-free networks where a few nodes have very high degree. Common in natural and social networks.

MLE Estimator:

$\hat{\gamma} \approx 1 + \frac{n}{\sum_{i=1}^n \ln \frac{x_i}{x_{min}}}$

Exponential

$P(k) \propto e^{-k/\kappa}$

Networks with a characteristic scale. Degrees decay exponentially.

Truncated Power Law

$P(k) \propto k^{-\gamma} e^{-k/\kappa}$

Power law with an exponential cutoff for very high degrees.

Lognormal

$P(k) \propto \frac{1}{k} e^{-\frac{(\ln k - \mu)^2}{2\sigma^2}}$

Multiplicative growth processes produce lognormal distributions.

Fitting Process

Compute degree sequence
Fit each model using MLE or powerlaw library
Compare models using likelihood ratio tests
Report best fit and parameters

Output

{
  "degree_stats": {"min": 1, "max": 42, "mean": 5.3, "std": 4.2, "median": 4},
  "fits": {
    "power_law": {"alpha": 2.3, "xmin": 2},
    "exponential": {"lambda": 0.19},
    "lognormal": {"mu": 1.4, "sigma": 0.8}
  },
  "best_model": "power_law",
  "is_scale_free": true,
  "power_law_vs_exponential": {"R": 3.2, "p_value": 0.01}
}

Interpretation in Lean

Distribution	Interpretation
Power law	Few "hub" theorems, many specialized lemmas
Exponential	More uniform dependency structure
Lognormal	Gradual hierarchy of importance

Correlation Analysis

Compute pairwise correlations between different node metrics.

Methods

Method	Assumption	Best For
Pearson	Linear relationship	Normally distributed metrics
Spearman	Monotonic relationship	Ordinal or skewed data

Formula (Spearman)

$\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}$

Where d_i is the difference in ranks for each observation.

Correlation Strength

Absolute ρ	Strength
≥ 0.8	Very strong
≥ 0.6	Strong
≥ 0.4	Moderate
≥ 0.2	Weak
< 0.2	Negligible

Output

{
  "correlation_matrix": [[1.0, 0.85], [0.85, 1.0]],
  "p_value_matrix": [[0.0, 0.001], [0.001, 0.0]],
  "significant_pairs": [
    {"metric_a": "pagerank", "metric_b": "betweenness", "correlation": 0.85}
  ]
}

Common Correlations in Lean

Metrics	Expected Correlation	Meaning
PageRank vs In-degree	Strong positive	Importance from citations
Betweenness vs Clustering	Negative	Bridges have low clustering
Out-degree vs Depth	Positive	Deep nodes have more deps

Degree Assortativity

Assortativity measures whether high-degree nodes tend to connect to other high-degree nodes.

Formula

Pearson correlation of degrees at edge endpoints:

$r = \frac{\sum_{jk} jk(e_{jk} - q_j q_k)}{\sigma_q^2}$

Where:

e_jk is the fraction of edges connecting nodes of degrees j and k
q_j is the degree distribution
σ_q is the standard deviation of degree

Interpretation

r Value	Type	Description
> 0.3	Strongly assortative	Hub-hub connections (social networks)
0.1 to 0.3	Weakly assortative	Some hub preference
-0.1 to 0.1	Neutral	No degree preference
-0.3 to -0.1	Weakly disassortative	Hub-leaf connections
< -0.3	Strongly disassortative	Star-like structure (technical networks)

In Lean Projects

Most dependency graphs are disassortative: high-degree "hub" theorems are used by many low-degree specialized results.

Anomaly Detection

Identify unusual nodes that deviate from expected patterns.

Z-Score Method

The simplest approach: flag nodes with metrics far from the mean.

$z = \frac{x - \mu}{\sigma}$

Parameters:

threshold: 2.0 (roughly 95% confidence)

Output:

Nodes with |z| > threshold for any metric
Multi-anomaly nodes (unusual in ≥2 metrics)

Mahalanobis Distance

Accounts for correlations between metrics (unlike z-score).

$d_M = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}$

Where:

x is the metric vector for a node
μ is the mean vector
Σ is the covariance matrix

Parameters:

threshold: 3.0

Advantage: Finds nodes unusual in the combination of metrics, not just individual outliers.

Local Outlier Factor (LOF)

Compares local density of a node to its neighbors.

$LOF(A) = \frac{\sum_{B \in N_k(A)} \frac{lrd(B)}{lrd(A)}}{|N_k(A)|}$

Where lrd is the local reachability density.

Parameters:

Parameter	Default	Description
n_neighbors	20	Number of neighbors
contamination	0.1	Expected outlier fraction

Interpretation:

LOF ≈ 1: Normal density
LOF > 1: Lower density than neighbors (outlier)
LOF < 1: Higher density than neighbors

Isolation Forest

Ensemble method that isolates observations using random splits.

Intuition: Outliers are easier to isolate (require fewer splits).

Parameters:

Parameter	Default	Description
contamination	0.1	Expected outlier fraction
random_state	42	Reproducibility seed

Comparison

Method	Strengths	Best For
Z-score	Simple, interpretable	Quick scan
Mahalanobis	Handles correlations	Multivariate anomalies
LOF	Density-based	Local structure anomalies
Isolation Forest	Robust, efficient	Large datasets

Anomalies in Lean Projects

Anomaly Type	Possible Meaning
High PageRank, low in-degree	Important but under-cited
High betweenness, low degree	Critical bridge lemma
Unusual metric combination	Structural peculiarity

Shannon Entropy

Measures the diversity of the degree distribution.

Formula

$H = -\sum_k p(k) \log_2 p(k)$

Where p(k) is the probability of degree k.

Interpretation

Entropy	Meaning
High	Uniform degree distribution (diverse)
Low	Skewed distribution (dominated by few degrees)

API Endpoints

GET /api/project/analysis/degree           # Degree stats and distribution
GET /api/project/analysis/statistics       # Overall statistics
GET /api/project/analysis/correlations     # Metric correlations
GET /api/project/analysis/patterns         # Pattern detection including anomalies

Example Response (Anomaly Detection)

{
  "z_score_anomalies": {
    "by_metric": {
      "pagerank": {"anomalies": ["Theorem.Important"], "threshold": 2.0}
    },
    "multi_anomaly_nodes": [
      {"node": "Lemma.Bridge", "anomalous_metrics": ["betweenness", "clustering"]}
    ]
  },
  "lof_anomalies": {
    "anomalies": [{"node": "Def.Unusual", "lof_score": 2.3}],
    "total_anomalies": 5
  }
}