Statistical Analysis

Analyze degree distributions, correlations, assortativity, and detect anomalies.

Statistical analysis reveals patterns in graph structure and identifies unusual nodes that may deserve attention.

Degree Distribution Fitting

Real-world networks often follow specific degree distributions. Astrolabe fits several models to determine the best match.

Models

Power Law

P(k)kγP(k) \propto k^{-\gamma}

Scale-free networks where a few nodes have very high degree. Common in natural and social networks.

MLE Estimator:

γ^1+ni=1nlnxixmin\hat{\gamma} \approx 1 + \frac{n}{\sum_{i=1}^n \ln \frac{x_i}{x_{min}}}

Exponential

P(k)ek/κP(k) \propto e^{-k/\kappa}

Networks with a characteristic scale. Degrees decay exponentially.

Truncated Power Law

P(k)kγek/κP(k) \propto k^{-\gamma} e^{-k/\kappa}

Power law with an exponential cutoff for very high degrees.

Lognormal

P(k)1ke(lnkμ)22σ2P(k) \propto \frac{1}{k} e^{-\frac{(\ln k - \mu)^2}{2\sigma^2}}

Multiplicative growth processes produce lognormal distributions.

Fitting Process

  1. Compute degree sequence
  2. Fit each model using MLE or powerlaw library
  3. Compare models using likelihood ratio tests
  4. Report best fit and parameters

Output

{
  "degree_stats": {"min": 1, "max": 42, "mean": 5.3, "std": 4.2, "median": 4},
  "fits": {
    "power_law": {"alpha": 2.3, "xmin": 2},
    "exponential": {"lambda": 0.19},
    "lognormal": {"mu": 1.4, "sigma": 0.8}
  },
  "best_model": "power_law",
  "is_scale_free": true,
  "power_law_vs_exponential": {"R": 3.2, "p_value": 0.01}
}

Interpretation in Lean

DistributionInterpretation
Power lawFew "hub" theorems, many specialized lemmas
ExponentialMore uniform dependency structure
LognormalGradual hierarchy of importance

Correlation Analysis

Compute pairwise correlations between different node metrics.

Methods

MethodAssumptionBest For
PearsonLinear relationshipNormally distributed metrics
SpearmanMonotonic relationshipOrdinal or skewed data

Formula (Spearman)

ρ=16di2n(n21)\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}

Where d_i is the difference in ranks for each observation.

Correlation Strength

Absolute ρStrength
≥ 0.8Very strong
≥ 0.6Strong
≥ 0.4Moderate
≥ 0.2Weak
< 0.2Negligible

Output

{
  "correlation_matrix": [[1.0, 0.85], [0.85, 1.0]],
  "p_value_matrix": [[0.0, 0.001], [0.001, 0.0]],
  "significant_pairs": [
    {"metric_a": "pagerank", "metric_b": "betweenness", "correlation": 0.85}
  ]
}

Common Correlations in Lean

MetricsExpected CorrelationMeaning
PageRank vs In-degreeStrong positiveImportance from citations
Betweenness vs ClusteringNegativeBridges have low clustering
Out-degree vs DepthPositiveDeep nodes have more deps

Degree Assortativity

Assortativity measures whether high-degree nodes tend to connect to other high-degree nodes.

Formula

Pearson correlation of degrees at edge endpoints:

r=jkjk(ejkqjqk)σq2r = \frac{\sum_{jk} jk(e_{jk} - q_j q_k)}{\sigma_q^2}

Where:

  • e_jk is the fraction of edges connecting nodes of degrees j and k
  • q_j is the degree distribution
  • σ_q is the standard deviation of degree

Interpretation

r ValueTypeDescription
> 0.3Strongly assortativeHub-hub connections (social networks)
0.1 to 0.3Weakly assortativeSome hub preference
-0.1 to 0.1NeutralNo degree preference
-0.3 to -0.1Weakly disassortativeHub-leaf connections
< -0.3Strongly disassortativeStar-like structure (technical networks)

In Lean Projects

Most dependency graphs are disassortative: high-degree "hub" theorems are used by many low-degree specialized results.

Anomaly Detection

Identify unusual nodes that deviate from expected patterns.

Z-Score Method

The simplest approach: flag nodes with metrics far from the mean.

z=xμσz = \frac{x - \mu}{\sigma}

Parameters:

  • threshold: 2.0 (roughly 95% confidence)

Output:

  • Nodes with |z| > threshold for any metric
  • Multi-anomaly nodes (unusual in ≥2 metrics)

Mahalanobis Distance

Accounts for correlations between metrics (unlike z-score).

dM=(xμ)TΣ1(xμ)d_M = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}

Where:

  • x is the metric vector for a node
  • μ is the mean vector
  • Σ is the covariance matrix

Parameters:

  • threshold: 3.0

Advantage: Finds nodes unusual in the combination of metrics, not just individual outliers.

Local Outlier Factor (LOF)

Compares local density of a node to its neighbors.

LOF(A)=BNk(A)lrd(B)lrd(A)Nk(A)LOF(A) = \frac{\sum_{B \in N_k(A)} \frac{lrd(B)}{lrd(A)}}{|N_k(A)|}

Where lrd is the local reachability density.

Parameters:

ParameterDefaultDescription
n_neighbors20Number of neighbors
contamination0.1Expected outlier fraction

Interpretation:

  • LOF ≈ 1: Normal density
  • LOF > 1: Lower density than neighbors (outlier)
  • LOF < 1: Higher density than neighbors

Isolation Forest

Ensemble method that isolates observations using random splits.

Intuition: Outliers are easier to isolate (require fewer splits).

Parameters:

ParameterDefaultDescription
contamination0.1Expected outlier fraction
random_state42Reproducibility seed

Comparison

MethodStrengthsBest For
Z-scoreSimple, interpretableQuick scan
MahalanobisHandles correlationsMultivariate anomalies
LOFDensity-basedLocal structure anomalies
Isolation ForestRobust, efficientLarge datasets

Anomalies in Lean Projects

Anomaly TypePossible Meaning
High PageRank, low in-degreeImportant but under-cited
High betweenness, low degreeCritical bridge lemma
Unusual metric combinationStructural peculiarity

Shannon Entropy

Measures the diversity of the degree distribution.

Formula

H=kp(k)log2p(k)H = -\sum_k p(k) \log_2 p(k)

Where p(k) is the probability of degree k.

Interpretation

EntropyMeaning
HighUniform degree distribution (diverse)
LowSkewed distribution (dominated by few degrees)

API Endpoints

GET /api/project/analysis/degree           # Degree stats and distribution
GET /api/project/analysis/statistics       # Overall statistics
GET /api/project/analysis/correlations     # Metric correlations
GET /api/project/analysis/patterns         # Pattern detection including anomalies

Example Response (Anomaly Detection)

{
  "z_score_anomalies": {
    "by_metric": {
      "pagerank": {"anomalies": ["Theorem.Important"], "threshold": 2.0}
    },
    "multi_anomaly_nodes": [
      {"node": "Lemma.Bridge", "anomalous_metrics": ["betweenness", "clustering"]}
    ]
  },
  "lof_anomalies": {
    "anomalies": [{"node": "Def.Unusual", "lof_score": 2.3}],
    "total_anomalies": 5
  }
}