JH-Motif commited on
Commit
2f6f3b8
·
verified ·
1 Parent(s): c025eef

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -61
README.md CHANGED
@@ -18,7 +18,7 @@ When models are released, their accompanying technical reports or papers often p
18
 
19
  To illustrate how much evaluation scores can vary across reports, we provide concrete examples of benchmark score differences for major models in the **Evaluation Appendix**.
20
 
21
- ### Comparsion to Mistral
22
 
23
  The benchmarks and corresponding scores listed in the table below are taken directly from the [Mistral 7B technical report](https://arxiv.org/pdf/2310.06825).
24
 
@@ -40,7 +40,64 @@ The benchmarks and corresponding scores listed in the table below are taken dire
40
 
41
  \* : We report the 4-shot score instead of the 4-shot, maj@4.
42
 
43
- ### Comparsion to Llama
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
  #### Llama 3
46
  The benchmarks and corresponding scores listed in the table below are taken directly from the [Llama 3 technical report](https://arxiv.org/abs/2407.21783).
@@ -77,7 +134,7 @@ The benchmarks and corresponding scores listed in the table below are taken dire
77
 
78
  \*: We were unable to find an evaluation framework for this benchmark.
79
 
80
- ### Comparsion to Phi
81
  The benchmarks and corresponding scores listed in the table below are taken directly from the [Phi-3 technical report](https://arxiv.org/abs/2404.14219).
82
 
83
  |Benchmark|Metric|Phi-3 3.8B|Phi-3 7B|Phi-2 2.7B|Motif 2.6B|Improvement(over 3.8B)|Improvement(over 7B)|Improvement(over 2.7B)|
@@ -106,61 +163,4 @@ The benchmarks and corresponding scores listed in the table below are taken dire
106
  |MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
107
  ||||||**Average**|**-10.09%**|**-13.45%**|**+10.18%**|
108
 
109
- \*: We were unable to find an evaluation framework for this benchmark.
110
-
111
- ### Comparsion to Gemma
112
-
113
- #### Gemma 1 & 2
114
- The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 2 technical report](https://arxiv.org/abs/2408.00118).
115
-
116
- _Note: Although referred to as "2B", Gemma 2 2B actually has 2.6 billion parameters._
117
-
118
- |Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
119
- |---|---|---|---|---|---|---|---|---|---|---|
120
- |MMLU|5-shot|42.3|64.4|52.2|71.3|57.93|+36.95%|-10.05%|+10.98%|-18.75%|
121
- |ARC-C|25-shot|48.5|61.1|55.7|68.4|75.08|+54.80%|+22.88%|+34.79%|+9.77%|
122
- |GSM8K|5-shot|15.1|51.8|24.3|68.6|67.85|+349.34%|+30.98%|+179.22%|-1.09%|
123
- |AGIEval*|3-5-shot|24.2|44.9|31.5|52.8|-|-|-|-|-|
124
- |DROP|3-shot, F1|48.5|56.3|51.2|69.4|29.33|-39.53%|-47.90%|-42.71%|-57.74%|
125
- |BBH|3-shot, CoT|35.2|59|41.9|68.2|48.56|37.95%|-17.69%|+15.89%|-28.80%|
126
- |Winogrande|5-shot|66.8|79|71.3|80.6|67.09|+0.43%|-15.08%|-5.90%|-16.76%|
127
- |HellaSwag|10-shot|71.7|82.3|72.9|81.9|69.89|-2.52%|-15.08%|-4.13%|-14.66%|
128
- |MATH|4-shot|11.8|24.3|16|36.6|40.2|+240.88%|+65.43%|+151.25%|+9.84%|
129
- |ARC-e|0-shot|73.2|81.5|80.6|88|87.21|+19.14%|+7.01%|+8.20%|-0.90%|
130
- |PIQA|0-shot|77.3|81.2|78.4|81.7|75.95|-1.75%|-6.47%|-3.13%|-7.04%|
131
- |SIQA|0-shot|49.7|51.8|51.9|53.4|61.97|+24.69%|+19.63%|+19.40%|+16.05%|
132
- |Boolq|0-shot|69.4|83.2|72.7|84.2|67.76|-2.36%|-18.56%|-6.80%|-19.52%|
133
- |TriviaQA|5-shot|53.2|63.4|60.4|76.6|54.97|+3.33%|-13.30%|-8.99%|-28.24%|
134
- |NQ|5-shot|12.5|23|17.1|29.2|10.91|-12.72%|-52.57%|-36.20%|-62.64%|
135
- |HumanEval|pass@1|22|32.3|20.1|40.2|68.3|+210.45%|+111.46%|+239.80%|+69.90%|
136
- |MBPP|3-shot|29.2|44.4|30.2|52.4|60.3|+106.51%|+35.81%|+99.67%|+15.08%|
137
- |||||||**Average**|**+84.76%**|**+1.69%**|**+42.42%**|**-14.78%**|
138
-
139
- \*: We were unable to find an evaluation framework for this benchmark.
140
-
141
- #### Gemma 3
142
- The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 3 technical report](https://arxiv.org/abs/2503.19786).
143
-
144
- |Benchmark|Metric|Gemma 3 1B|Gemma 3 4B|Motif 2.6B|Improvement(over 1B)|Improvement(over 4B)|
145
- |---|---|---|---|---|---|---|
146
- |HellaS|10-shot|62.3|77.2|69.89|+12.18%|-9.47%|
147
- |BoolQ|0-shot|63.2|72.3|67.76|+7.22%|-6.28%|
148
- |PIQA|0-shot|73.8|79.6|75.59|+2.43%|-5.04%|
149
- |SIQA|0-shot|48.9|51.9|61.97|+26.73%|+19.40%|
150
- |TQA|5-shot|39.8|65.8|54.97|+38.12%|-16.46%|
151
- |NQ|5-shot|9.48|20|10.91|+15.08%|-45.45%|
152
- |ARC-C|25-shot|38.4|56.2|75.08|+95.52%|+33.59%|
153
- |ARC-E|0-shot|73|82.4|87.21|+19.47%|+5.84%|
154
- |WinoG|5-shot|58.2|64.7|67.09|+15.27%|+3.69%|
155
- |BBH|few-shot, CoT|28.4|50.9|48.56|+70.99%|-4.60%|
156
- |Drop|1-shot, F1|42.4|60.1|29.33|-30.83%|-51.20%|
157
- |MMLU|5-shot|-|59.6|57.93|-|-2.80%|
158
- |MMLUpro|5-shot, CoT|-|29.2|-|-|-|
159
- |AGIE|3-5-shot|-|42.1|-|-|-|
160
- |MATH|4-shot, CoT|-|24.2|40.2|-|+66.12%|
161
- |GSM8K|8-shot, CoT|-|38.4|77.71|-|+102.37%|
162
- |GPQA Diamond|5-shot, CoT|-|15|31.81|-|+112.07%|
163
- |MBPP|3-shot|-|46|60.3|-|+31.09%|
164
- |HumanE|0-shot|-|36|68.3|-|+89.72%|
165
- |IFEval|-|80.2|90.2|74.02|-7.71%|-17.94%|
166
- |||||**Average**|**+22.04%**|**+16.93%**|
 
18
 
19
  To illustrate how much evaluation scores can vary across reports, we provide concrete examples of benchmark score differences for major models in the **Evaluation Appendix**.
20
 
21
+ ### Comparison to Mistral 7B by Mistral AI
22
 
23
  The benchmarks and corresponding scores listed in the table below are taken directly from the [Mistral 7B technical report](https://arxiv.org/pdf/2310.06825).
24
 
 
40
 
41
  \* : We report the 4-shot score instead of the 4-shot, maj@4.
42
 
43
+ ### Comparison to the Gemma series by Google
44
+
45
+ #### Gemma 1 & 2
46
+ The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 2 technical report](https://arxiv.org/abs/2408.00118).
47
+
48
+ _Note: Although referred to as "2B", Gemma 2 2B actually has <U>2.6 billion</U> parameters._
49
+
50
+ |Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
51
+ |---|---|---|---|---|---|---|---|---|---|---|
52
+ |MMLU|5-shot|42.3|64.4|52.2|71.3|57.93|+36.95%|-10.05%|+10.98%|-18.75%|
53
+ |ARC-C|25-shot|48.5|61.1|55.7|68.4|75.08|+54.80%|+22.88%|+34.79%|+9.77%|
54
+ |GSM8K|5-shot|15.1|51.8|24.3|68.6|67.85|+349.34%|+30.98%|+179.22%|-1.09%|
55
+ |AGIEval*|3-5-shot|24.2|44.9|31.5|52.8|-|-|-|-|-|
56
+ |DROP|3-shot, F1|48.5|56.3|51.2|69.4|29.33|-39.53%|-47.90%|-42.71%|-57.74%|
57
+ |BBH|3-shot, CoT|35.2|59|41.9|68.2|48.56|37.95%|-17.69%|+15.89%|-28.80%|
58
+ |Winogrande|5-shot|66.8|79|71.3|80.6|67.09|+0.43%|-15.08%|-5.90%|-16.76%|
59
+ |HellaSwag|10-shot|71.7|82.3|72.9|81.9|69.89|-2.52%|-15.08%|-4.13%|-14.66%|
60
+ |MATH|4-shot|11.8|24.3|16|36.6|40.2|+240.88%|+65.43%|+151.25%|+9.84%|
61
+ |ARC-e|0-shot|73.2|81.5|80.6|88|87.21|+19.14%|+7.01%|+8.20%|-0.90%|
62
+ |PIQA|0-shot|77.3|81.2|78.4|81.7|75.95|-1.75%|-6.47%|-3.13%|-7.04%|
63
+ |SIQA|0-shot|49.7|51.8|51.9|53.4|61.97|+24.69%|+19.63%|+19.40%|+16.05%|
64
+ |Boolq|0-shot|69.4|83.2|72.7|84.2|67.76|-2.36%|-18.56%|-6.80%|-19.52%|
65
+ |TriviaQA|5-shot|53.2|63.4|60.4|76.6|54.97|+3.33%|-13.30%|-8.99%|-28.24%|
66
+ |NQ|5-shot|12.5|23|17.1|29.2|10.91|-12.72%|-52.57%|-36.20%|-62.64%|
67
+ |HumanEval|pass@1|22|32.3|20.1|40.2|68.3|+210.45%|+111.46%|+239.80%|+69.90%|
68
+ |MBPP|3-shot|29.2|44.4|30.2|52.4|60.3|+106.51%|+35.81%|+99.67%|+15.08%|
69
+ |||||||**Average**|**+84.76%**|**+1.69%**|**+42.42%**|**-14.78%**|
70
+
71
+ \*: We were unable to find an evaluation framework for this benchmark.
72
+
73
+ #### Gemma 3
74
+ The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 3 technical report](https://arxiv.org/abs/2503.19786).
75
+
76
+ |Benchmark|Metric|Gemma 3 1B|Gemma 3 4B|Motif 2.6B|Improvement(over 1B)|Improvement(over 4B)|
77
+ |---|---|---|---|---|---|---|
78
+ |HellaS|10-shot|62.3|77.2|69.89|+12.18%|-9.47%|
79
+ |BoolQ|0-shot|63.2|72.3|67.76|+7.22%|-6.28%|
80
+ |PIQA|0-shot|73.8|79.6|75.59|+2.43%|-5.04%|
81
+ |SIQA|0-shot|48.9|51.9|61.97|+26.73%|+19.40%|
82
+ |TQA|5-shot|39.8|65.8|54.97|+38.12%|-16.46%|
83
+ |NQ|5-shot|9.48|20|10.91|+15.08%|-45.45%|
84
+ |ARC-C|25-shot|38.4|56.2|75.08|+95.52%|+33.59%|
85
+ |ARC-E|0-shot|73|82.4|87.21|+19.47%|+5.84%|
86
+ |WinoG|5-shot|58.2|64.7|67.09|+15.27%|+3.69%|
87
+ |BBH|few-shot, CoT|28.4|50.9|48.56|+70.99%|-4.60%|
88
+ |Drop|1-shot, F1|42.4|60.1|29.33|-30.83%|-51.20%|
89
+ |MMLU|5-shot|-|59.6|57.93|-|-2.80%|
90
+ |MMLUpro|5-shot, CoT|-|29.2|-|-|-|
91
+ |AGIE|3-5-shot|-|42.1|-|-|-|
92
+ |MATH|4-shot, CoT|-|24.2|40.2|-|+66.12%|
93
+ |GSM8K|8-shot, CoT|-|38.4|77.71|-|+102.37%|
94
+ |GPQA Diamond|5-shot, CoT|-|15|31.81|-|+112.07%|
95
+ |MBPP|3-shot|-|46|60.3|-|+31.09%|
96
+ |HumanE|0-shot|-|36|68.3|-|+89.72%|
97
+ |IFEval|-|80.2|90.2|74.02|-7.71%|-17.94%|
98
+ |||||**Average**|**+22.04%**|**+16.93%**|
99
+
100
+ ### Comparison to the Llama series by Meta
101
 
102
  #### Llama 3
103
  The benchmarks and corresponding scores listed in the table below are taken directly from the [Llama 3 technical report](https://arxiv.org/abs/2407.21783).
 
134
 
135
  \*: We were unable to find an evaluation framework for this benchmark.
136
 
137
+ ### Comparison to the Phi series by Microsoft
138
  The benchmarks and corresponding scores listed in the table below are taken directly from the [Phi-3 technical report](https://arxiv.org/abs/2404.14219).
139
 
140
  |Benchmark|Metric|Phi-3 3.8B|Phi-3 7B|Phi-2 2.7B|Motif 2.6B|Improvement(over 3.8B)|Improvement(over 7B)|Improvement(over 2.7B)|
 
163
  |MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
164
  ||||||**Average**|**-10.09%**|**-13.45%**|**+10.18%**|
165
 
166
+ \*: We were unable to find an evaluation framework for this benchmark.