model_tools / sparsify_v3_notes.md
Naphula's picture
Update sparsify_v3_notes.md
992967a verified

The two scripts you provided are nearly identical in structure, but the second script contains significant safety enhancements and robustness fixes for the della_magprune method.

Here are the specific differences:

1. Parameter Validation (Safety Guards)

The second script adds a "Safety Guard" block at the start of the della_magprune function. This prevents the function from crashing or producing invalid results if the input parameters (density or epsilon) are mathematically impossible.

  • Density Clipping: It ensures density stays within a valid range (between 1e-4 and 1-1e-4).
  • Epsilon Adjustment: It automatically shrinks epsilon if it is too large. Since the algorithm calculates probabilities as density +/- epsilon, an epsilon that is too large would result in probabilities greater than 1 or less than 0. The second script forces epsilon to be within a safe bound.

2. Division by Zero Protection

In the rank normalization step of della_magprune:

  • Script 1: rank_norm = ((ranks - min_ranks) / (max_ranks - min_ranks))
  • Script 2: rank_norm = ((ranks - min_ranks) / (max_ranks - min_ranks).clamp(min=1e-8))
  • Impact: If a tensor has only one unique value (meaning max_ranks == min_ranks), the first script would divide by zero and produce NaN values. The second script uses .clamp(min=1e-8) to ensure the denominator is never zero.

3. Probability Clipping

In the final step of generating the mask:

  • Script 1: probs = (density - epsilon) + rank_norm * 2 * epsilon
  • Script 2: probs = (density - epsilon) + rank_norm * 2 * epsilon followed by torch.bernoulli(probs.clamp(0, 1))
  • Impact: Even with the epsilon guards, floating-point errors could theoretically push a probability slightly outside the $[0, 1]$ range. The second script adds .clamp(0, 1) to the Bernoulli input to ensure PyTorch does not throw an error.

4. Logic Flow in della_magprune

  • Script 1 has a check: if density + epsilon >= 1 or density - epsilon <= 0: raise ValueError(...). This causes the script to crash if the parameters are bad.
  • Script 2 removes that ValueError and replaces it with the "Safety Guard" logic mentioned above. Instead of crashing, it corrects the values and continues running.

Summary Table

Feature Script 1 Script 2
Bad Inputs Crashes with ValueError Automatically fixes/clips values
Single-value Tensors May produce NaN (Div by 0) Safe (Clamped denominator)
Bernoulli Stability Risk of out-of-bounds error Guaranteed $[0, 1]$ range
Reliability Experimental/Strict Production-ready/Robust

Recommendation: Use the second script. It is a more mature version of the code designed to handle edge cases and prevent runtime failures during automated optimization or training.

Note

This patch is required for Gemma 4 31B merges.