Update README.md
Browse files
README.md
CHANGED
|
@@ -18,7 +18,7 @@ When models are released, their accompanying technical reports or papers often p
|
|
| 18 |
|
| 19 |
To illustrate how much evaluation scores can vary across reports, we provide concrete examples of benchmark score differences for major models in the **Evaluation Appendix**.
|
| 20 |
|
| 21 |
-
###
|
| 22 |
|
| 23 |
The benchmarks and corresponding scores listed in the table below are taken directly from the [Mistral 7B technical report](https://arxiv.org/pdf/2310.06825).
|
| 24 |
|
|
@@ -40,7 +40,64 @@ The benchmarks and corresponding scores listed in the table below are taken dire
|
|
| 40 |
|
| 41 |
\* : We report the 4-shot score instead of the 4-shot, maj@4.
|
| 42 |
|
| 43 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
#### Llama 3
|
| 46 |
The benchmarks and corresponding scores listed in the table below are taken directly from the [Llama 3 technical report](https://arxiv.org/abs/2407.21783).
|
|
@@ -77,7 +134,7 @@ The benchmarks and corresponding scores listed in the table below are taken dire
|
|
| 77 |
|
| 78 |
\*: We were unable to find an evaluation framework for this benchmark.
|
| 79 |
|
| 80 |
-
###
|
| 81 |
The benchmarks and corresponding scores listed in the table below are taken directly from the [Phi-3 technical report](https://arxiv.org/abs/2404.14219).
|
| 82 |
|
| 83 |
|Benchmark|Metric|Phi-3 3.8B|Phi-3 7B|Phi-2 2.7B|Motif 2.6B|Improvement(over 3.8B)|Improvement(over 7B)|Improvement(over 2.7B)|
|
|
@@ -106,61 +163,4 @@ The benchmarks and corresponding scores listed in the table below are taken dire
|
|
| 106 |
|MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
|
| 107 |
||||||**Average**|**-10.09%**|**-13.45%**|**+10.18%**|
|
| 108 |
|
| 109 |
-
\*: We were unable to find an evaluation framework for this benchmark.
|
| 110 |
-
|
| 111 |
-
### Comparsion to Gemma
|
| 112 |
-
|
| 113 |
-
#### Gemma 1 & 2
|
| 114 |
-
The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 2 technical report](https://arxiv.org/abs/2408.00118).
|
| 115 |
-
|
| 116 |
-
_Note: Although referred to as "2B", Gemma 2 2B actually has 2.6 billion parameters._
|
| 117 |
-
|
| 118 |
-
|Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
|
| 119 |
-
|---|---|---|---|---|---|---|---|---|---|---|
|
| 120 |
-
|MMLU|5-shot|42.3|64.4|52.2|71.3|57.93|+36.95%|-10.05%|+10.98%|-18.75%|
|
| 121 |
-
|ARC-C|25-shot|48.5|61.1|55.7|68.4|75.08|+54.80%|+22.88%|+34.79%|+9.77%|
|
| 122 |
-
|GSM8K|5-shot|15.1|51.8|24.3|68.6|67.85|+349.34%|+30.98%|+179.22%|-1.09%|
|
| 123 |
-
|AGIEval*|3-5-shot|24.2|44.9|31.5|52.8|-|-|-|-|-|
|
| 124 |
-
|DROP|3-shot, F1|48.5|56.3|51.2|69.4|29.33|-39.53%|-47.90%|-42.71%|-57.74%|
|
| 125 |
-
|BBH|3-shot, CoT|35.2|59|41.9|68.2|48.56|37.95%|-17.69%|+15.89%|-28.80%|
|
| 126 |
-
|Winogrande|5-shot|66.8|79|71.3|80.6|67.09|+0.43%|-15.08%|-5.90%|-16.76%|
|
| 127 |
-
|HellaSwag|10-shot|71.7|82.3|72.9|81.9|69.89|-2.52%|-15.08%|-4.13%|-14.66%|
|
| 128 |
-
|MATH|4-shot|11.8|24.3|16|36.6|40.2|+240.88%|+65.43%|+151.25%|+9.84%|
|
| 129 |
-
|ARC-e|0-shot|73.2|81.5|80.6|88|87.21|+19.14%|+7.01%|+8.20%|-0.90%|
|
| 130 |
-
|PIQA|0-shot|77.3|81.2|78.4|81.7|75.95|-1.75%|-6.47%|-3.13%|-7.04%|
|
| 131 |
-
|SIQA|0-shot|49.7|51.8|51.9|53.4|61.97|+24.69%|+19.63%|+19.40%|+16.05%|
|
| 132 |
-
|Boolq|0-shot|69.4|83.2|72.7|84.2|67.76|-2.36%|-18.56%|-6.80%|-19.52%|
|
| 133 |
-
|TriviaQA|5-shot|53.2|63.4|60.4|76.6|54.97|+3.33%|-13.30%|-8.99%|-28.24%|
|
| 134 |
-
|NQ|5-shot|12.5|23|17.1|29.2|10.91|-12.72%|-52.57%|-36.20%|-62.64%|
|
| 135 |
-
|HumanEval|pass@1|22|32.3|20.1|40.2|68.3|+210.45%|+111.46%|+239.80%|+69.90%|
|
| 136 |
-
|MBPP|3-shot|29.2|44.4|30.2|52.4|60.3|+106.51%|+35.81%|+99.67%|+15.08%|
|
| 137 |
-
|||||||**Average**|**+84.76%**|**+1.69%**|**+42.42%**|**-14.78%**|
|
| 138 |
-
|
| 139 |
-
\*: We were unable to find an evaluation framework for this benchmark.
|
| 140 |
-
|
| 141 |
-
#### Gemma 3
|
| 142 |
-
The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 3 technical report](https://arxiv.org/abs/2503.19786).
|
| 143 |
-
|
| 144 |
-
|Benchmark|Metric|Gemma 3 1B|Gemma 3 4B|Motif 2.6B|Improvement(over 1B)|Improvement(over 4B)|
|
| 145 |
-
|---|---|---|---|---|---|---|
|
| 146 |
-
|HellaS|10-shot|62.3|77.2|69.89|+12.18%|-9.47%|
|
| 147 |
-
|BoolQ|0-shot|63.2|72.3|67.76|+7.22%|-6.28%|
|
| 148 |
-
|PIQA|0-shot|73.8|79.6|75.59|+2.43%|-5.04%|
|
| 149 |
-
|SIQA|0-shot|48.9|51.9|61.97|+26.73%|+19.40%|
|
| 150 |
-
|TQA|5-shot|39.8|65.8|54.97|+38.12%|-16.46%|
|
| 151 |
-
|NQ|5-shot|9.48|20|10.91|+15.08%|-45.45%|
|
| 152 |
-
|ARC-C|25-shot|38.4|56.2|75.08|+95.52%|+33.59%|
|
| 153 |
-
|ARC-E|0-shot|73|82.4|87.21|+19.47%|+5.84%|
|
| 154 |
-
|WinoG|5-shot|58.2|64.7|67.09|+15.27%|+3.69%|
|
| 155 |
-
|BBH|few-shot, CoT|28.4|50.9|48.56|+70.99%|-4.60%|
|
| 156 |
-
|Drop|1-shot, F1|42.4|60.1|29.33|-30.83%|-51.20%|
|
| 157 |
-
|MMLU|5-shot|-|59.6|57.93|-|-2.80%|
|
| 158 |
-
|MMLUpro|5-shot, CoT|-|29.2|-|-|-|
|
| 159 |
-
|AGIE|3-5-shot|-|42.1|-|-|-|
|
| 160 |
-
|MATH|4-shot, CoT|-|24.2|40.2|-|+66.12%|
|
| 161 |
-
|GSM8K|8-shot, CoT|-|38.4|77.71|-|+102.37%|
|
| 162 |
-
|GPQA Diamond|5-shot, CoT|-|15|31.81|-|+112.07%|
|
| 163 |
-
|MBPP|3-shot|-|46|60.3|-|+31.09%|
|
| 164 |
-
|HumanE|0-shot|-|36|68.3|-|+89.72%|
|
| 165 |
-
|IFEval|-|80.2|90.2|74.02|-7.71%|-17.94%|
|
| 166 |
-
|||||**Average**|**+22.04%**|**+16.93%**|
|
|
|
|
| 18 |
|
| 19 |
To illustrate how much evaluation scores can vary across reports, we provide concrete examples of benchmark score differences for major models in the **Evaluation Appendix**.
|
| 20 |
|
| 21 |
+
### Comparison to Mistral 7B by Mistral AI
|
| 22 |
|
| 23 |
The benchmarks and corresponding scores listed in the table below are taken directly from the [Mistral 7B technical report](https://arxiv.org/pdf/2310.06825).
|
| 24 |
|
|
|
|
| 40 |
|
| 41 |
\* : We report the 4-shot score instead of the 4-shot, maj@4.
|
| 42 |
|
| 43 |
+
### Comparison to the Gemma series by Google
|
| 44 |
+
|
| 45 |
+
#### Gemma 1 & 2
|
| 46 |
+
The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 2 technical report](https://arxiv.org/abs/2408.00118).
|
| 47 |
+
|
| 48 |
+
_Note: Although referred to as "2B", Gemma 2 2B actually has <U>2.6 billion</U> parameters._
|
| 49 |
+
|
| 50 |
+
|Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
|
| 51 |
+
|---|---|---|---|---|---|---|---|---|---|---|
|
| 52 |
+
|MMLU|5-shot|42.3|64.4|52.2|71.3|57.93|+36.95%|-10.05%|+10.98%|-18.75%|
|
| 53 |
+
|ARC-C|25-shot|48.5|61.1|55.7|68.4|75.08|+54.80%|+22.88%|+34.79%|+9.77%|
|
| 54 |
+
|GSM8K|5-shot|15.1|51.8|24.3|68.6|67.85|+349.34%|+30.98%|+179.22%|-1.09%|
|
| 55 |
+
|AGIEval*|3-5-shot|24.2|44.9|31.5|52.8|-|-|-|-|-|
|
| 56 |
+
|DROP|3-shot, F1|48.5|56.3|51.2|69.4|29.33|-39.53%|-47.90%|-42.71%|-57.74%|
|
| 57 |
+
|BBH|3-shot, CoT|35.2|59|41.9|68.2|48.56|37.95%|-17.69%|+15.89%|-28.80%|
|
| 58 |
+
|Winogrande|5-shot|66.8|79|71.3|80.6|67.09|+0.43%|-15.08%|-5.90%|-16.76%|
|
| 59 |
+
|HellaSwag|10-shot|71.7|82.3|72.9|81.9|69.89|-2.52%|-15.08%|-4.13%|-14.66%|
|
| 60 |
+
|MATH|4-shot|11.8|24.3|16|36.6|40.2|+240.88%|+65.43%|+151.25%|+9.84%|
|
| 61 |
+
|ARC-e|0-shot|73.2|81.5|80.6|88|87.21|+19.14%|+7.01%|+8.20%|-0.90%|
|
| 62 |
+
|PIQA|0-shot|77.3|81.2|78.4|81.7|75.95|-1.75%|-6.47%|-3.13%|-7.04%|
|
| 63 |
+
|SIQA|0-shot|49.7|51.8|51.9|53.4|61.97|+24.69%|+19.63%|+19.40%|+16.05%|
|
| 64 |
+
|Boolq|0-shot|69.4|83.2|72.7|84.2|67.76|-2.36%|-18.56%|-6.80%|-19.52%|
|
| 65 |
+
|TriviaQA|5-shot|53.2|63.4|60.4|76.6|54.97|+3.33%|-13.30%|-8.99%|-28.24%|
|
| 66 |
+
|NQ|5-shot|12.5|23|17.1|29.2|10.91|-12.72%|-52.57%|-36.20%|-62.64%|
|
| 67 |
+
|HumanEval|pass@1|22|32.3|20.1|40.2|68.3|+210.45%|+111.46%|+239.80%|+69.90%|
|
| 68 |
+
|MBPP|3-shot|29.2|44.4|30.2|52.4|60.3|+106.51%|+35.81%|+99.67%|+15.08%|
|
| 69 |
+
|||||||**Average**|**+84.76%**|**+1.69%**|**+42.42%**|**-14.78%**|
|
| 70 |
+
|
| 71 |
+
\*: We were unable to find an evaluation framework for this benchmark.
|
| 72 |
+
|
| 73 |
+
#### Gemma 3
|
| 74 |
+
The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 3 technical report](https://arxiv.org/abs/2503.19786).
|
| 75 |
+
|
| 76 |
+
|Benchmark|Metric|Gemma 3 1B|Gemma 3 4B|Motif 2.6B|Improvement(over 1B)|Improvement(over 4B)|
|
| 77 |
+
|---|---|---|---|---|---|---|
|
| 78 |
+
|HellaS|10-shot|62.3|77.2|69.89|+12.18%|-9.47%|
|
| 79 |
+
|BoolQ|0-shot|63.2|72.3|67.76|+7.22%|-6.28%|
|
| 80 |
+
|PIQA|0-shot|73.8|79.6|75.59|+2.43%|-5.04%|
|
| 81 |
+
|SIQA|0-shot|48.9|51.9|61.97|+26.73%|+19.40%|
|
| 82 |
+
|TQA|5-shot|39.8|65.8|54.97|+38.12%|-16.46%|
|
| 83 |
+
|NQ|5-shot|9.48|20|10.91|+15.08%|-45.45%|
|
| 84 |
+
|ARC-C|25-shot|38.4|56.2|75.08|+95.52%|+33.59%|
|
| 85 |
+
|ARC-E|0-shot|73|82.4|87.21|+19.47%|+5.84%|
|
| 86 |
+
|WinoG|5-shot|58.2|64.7|67.09|+15.27%|+3.69%|
|
| 87 |
+
|BBH|few-shot, CoT|28.4|50.9|48.56|+70.99%|-4.60%|
|
| 88 |
+
|Drop|1-shot, F1|42.4|60.1|29.33|-30.83%|-51.20%|
|
| 89 |
+
|MMLU|5-shot|-|59.6|57.93|-|-2.80%|
|
| 90 |
+
|MMLUpro|5-shot, CoT|-|29.2|-|-|-|
|
| 91 |
+
|AGIE|3-5-shot|-|42.1|-|-|-|
|
| 92 |
+
|MATH|4-shot, CoT|-|24.2|40.2|-|+66.12%|
|
| 93 |
+
|GSM8K|8-shot, CoT|-|38.4|77.71|-|+102.37%|
|
| 94 |
+
|GPQA Diamond|5-shot, CoT|-|15|31.81|-|+112.07%|
|
| 95 |
+
|MBPP|3-shot|-|46|60.3|-|+31.09%|
|
| 96 |
+
|HumanE|0-shot|-|36|68.3|-|+89.72%|
|
| 97 |
+
|IFEval|-|80.2|90.2|74.02|-7.71%|-17.94%|
|
| 98 |
+
|||||**Average**|**+22.04%**|**+16.93%**|
|
| 99 |
+
|
| 100 |
+
### Comparison to the Llama series by Meta
|
| 101 |
|
| 102 |
#### Llama 3
|
| 103 |
The benchmarks and corresponding scores listed in the table below are taken directly from the [Llama 3 technical report](https://arxiv.org/abs/2407.21783).
|
|
|
|
| 134 |
|
| 135 |
\*: We were unable to find an evaluation framework for this benchmark.
|
| 136 |
|
| 137 |
+
### Comparison to the Phi series by Microsoft
|
| 138 |
The benchmarks and corresponding scores listed in the table below are taken directly from the [Phi-3 technical report](https://arxiv.org/abs/2404.14219).
|
| 139 |
|
| 140 |
|Benchmark|Metric|Phi-3 3.8B|Phi-3 7B|Phi-2 2.7B|Motif 2.6B|Improvement(over 3.8B)|Improvement(over 7B)|Improvement(over 2.7B)|
|
|
|
|
| 163 |
|MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
|
| 164 |
||||||**Average**|**-10.09%**|**-13.45%**|**+10.18%**|
|
| 165 |
|
| 166 |
+
\*: We were unable to find an evaluation framework for this benchmark.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|