Statistical Analysis
Analyze degree distributions, correlations, assortativity, and detect anomalies.
Statistical analysis reveals patterns in graph structure and identifies unusual nodes that may deserve attention.
Degree Distribution Fitting
Real-world networks often follow specific degree distributions. Astrolabe fits several models to determine the best match.
Models
Power Law
Scale-free networks where a few nodes have very high degree. Common in natural and social networks.
MLE Estimator:
Exponential
Networks with a characteristic scale. Degrees decay exponentially.
Truncated Power Law
Power law with an exponential cutoff for very high degrees.
Lognormal
Multiplicative growth processes produce lognormal distributions.
Fitting Process
- Compute degree sequence
- Fit each model using MLE or powerlaw library
- Compare models using likelihood ratio tests
- Report best fit and parameters
Output
{
"degree_stats": {"min": 1, "max": 42, "mean": 5.3, "std": 4.2, "median": 4},
"fits": {
"power_law": {"alpha": 2.3, "xmin": 2},
"exponential": {"lambda": 0.19},
"lognormal": {"mu": 1.4, "sigma": 0.8}
},
"best_model": "power_law",
"is_scale_free": true,
"power_law_vs_exponential": {"R": 3.2, "p_value": 0.01}
}
Interpretation in Lean
| Distribution | Interpretation |
|---|---|
| Power law | Few "hub" theorems, many specialized lemmas |
| Exponential | More uniform dependency structure |
| Lognormal | Gradual hierarchy of importance |
Correlation Analysis
Compute pairwise correlations between different node metrics.
Methods
| Method | Assumption | Best For |
|---|---|---|
| Pearson | Linear relationship | Normally distributed metrics |
| Spearman | Monotonic relationship | Ordinal or skewed data |
Formula (Spearman)
Where d_i is the difference in ranks for each observation.
Correlation Strength
| Absolute ρ | Strength |
|---|---|
| ≥ 0.8 | Very strong |
| ≥ 0.6 | Strong |
| ≥ 0.4 | Moderate |
| ≥ 0.2 | Weak |
| < 0.2 | Negligible |
Output
{
"correlation_matrix": [[1.0, 0.85], [0.85, 1.0]],
"p_value_matrix": [[0.0, 0.001], [0.001, 0.0]],
"significant_pairs": [
{"metric_a": "pagerank", "metric_b": "betweenness", "correlation": 0.85}
]
}
Common Correlations in Lean
| Metrics | Expected Correlation | Meaning |
|---|---|---|
| PageRank vs In-degree | Strong positive | Importance from citations |
| Betweenness vs Clustering | Negative | Bridges have low clustering |
| Out-degree vs Depth | Positive | Deep nodes have more deps |
Degree Assortativity
Assortativity measures whether high-degree nodes tend to connect to other high-degree nodes.
Formula
Pearson correlation of degrees at edge endpoints:
Where:
- e_jk is the fraction of edges connecting nodes of degrees j and k
- q_j is the degree distribution
- σ_q is the standard deviation of degree
Interpretation
| r Value | Type | Description |
|---|---|---|
| > 0.3 | Strongly assortative | Hub-hub connections (social networks) |
| 0.1 to 0.3 | Weakly assortative | Some hub preference |
| -0.1 to 0.1 | Neutral | No degree preference |
| -0.3 to -0.1 | Weakly disassortative | Hub-leaf connections |
| < -0.3 | Strongly disassortative | Star-like structure (technical networks) |
In Lean Projects
Most dependency graphs are disassortative: high-degree "hub" theorems are used by many low-degree specialized results.
Anomaly Detection
Identify unusual nodes that deviate from expected patterns.
Z-Score Method
The simplest approach: flag nodes with metrics far from the mean.
Parameters:
- threshold: 2.0 (roughly 95% confidence)
Output:
- Nodes with |z| > threshold for any metric
- Multi-anomaly nodes (unusual in ≥2 metrics)
Mahalanobis Distance
Accounts for correlations between metrics (unlike z-score).
Where:
- x is the metric vector for a node
- μ is the mean vector
- Σ is the covariance matrix
Parameters:
- threshold: 3.0
Advantage: Finds nodes unusual in the combination of metrics, not just individual outliers.
Local Outlier Factor (LOF)
Compares local density of a node to its neighbors.
Where lrd is the local reachability density.
Parameters:
| Parameter | Default | Description |
|---|---|---|
| n_neighbors | 20 | Number of neighbors |
| contamination | 0.1 | Expected outlier fraction |
Interpretation:
- LOF ≈ 1: Normal density
- LOF > 1: Lower density than neighbors (outlier)
- LOF < 1: Higher density than neighbors
Isolation Forest
Ensemble method that isolates observations using random splits.
Intuition: Outliers are easier to isolate (require fewer splits).
Parameters:
| Parameter | Default | Description |
|---|---|---|
| contamination | 0.1 | Expected outlier fraction |
| random_state | 42 | Reproducibility seed |
Comparison
| Method | Strengths | Best For |
|---|---|---|
| Z-score | Simple, interpretable | Quick scan |
| Mahalanobis | Handles correlations | Multivariate anomalies |
| LOF | Density-based | Local structure anomalies |
| Isolation Forest | Robust, efficient | Large datasets |
Anomalies in Lean Projects
| Anomaly Type | Possible Meaning |
|---|---|
| High PageRank, low in-degree | Important but under-cited |
| High betweenness, low degree | Critical bridge lemma |
| Unusual metric combination | Structural peculiarity |
Shannon Entropy
Measures the diversity of the degree distribution.
Formula
Where p(k) is the probability of degree k.
Interpretation
| Entropy | Meaning |
|---|---|
| High | Uniform degree distribution (diverse) |
| Low | Skewed distribution (dominated by few degrees) |
API Endpoints
GET /api/project/analysis/degree # Degree stats and distribution
GET /api/project/analysis/statistics # Overall statistics
GET /api/project/analysis/correlations # Metric correlations
GET /api/project/analysis/patterns # Pattern detection including anomalies
Example Response (Anomaly Detection)
{
"z_score_anomalies": {
"by_metric": {
"pagerank": {"anomalies": ["Theorem.Important"], "threshold": 2.0}
},
"multi_anomaly_nodes": [
{"node": "Lemma.Bridge", "anomalous_metrics": ["betweenness", "clustering"]}
]
},
"lof_anomalies": {
"anomalies": [{"node": "Def.Unusual", "lof_score": 2.3}],
"total_anomalies": 5
}
}