Motif-Technologies
/

Motif-2.6B

@@ -18,7 +18,7 @@ When models are released, their accompanying technical reports or papers often p
 To illustrate how much evaluation scores can vary across reports, we provide concrete examples of benchmark score differences for major models in the **Evaluation Appendix**.
-### Comparsion to Mistral
 The benchmarks and corresponding scores listed in the table below are taken directly from the [Mistral 7B technical report](https://arxiv.org/pdf/2310.06825).
@@ -40,7 +40,64 @@ The benchmarks and corresponding scores listed in the table below are taken dire
 \* : We report the 4-shot score instead of the 4-shot, maj@4.
-### Comparsion to Llama
 #### Llama 3
 The benchmarks and corresponding scores listed in the table below are taken directly from the [Llama 3 technical report](https://arxiv.org/abs/2407.21783).
@@ -77,7 +134,7 @@ The benchmarks and corresponding scores listed in the table below are taken dire
 \*: We were unable to find an evaluation framework for this benchmark.
-### Comparsion to Phi
 The benchmarks and corresponding scores listed in the table below are taken directly from the [Phi-3 technical report](https://arxiv.org/abs/2404.14219).
 |Benchmark|Metric|Phi-3 3.8B|Phi-3 7B|Phi-2 2.7B|Motif 2.6B|Improvement(over 3.8B)|Improvement(over 7B)|Improvement(over 2.7B)|
@@ -106,61 +163,4 @@ The benchmarks and corresponding scores listed in the table below are taken dire
 |MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
 ||||||**Average**|**-10.09%**|**-13.45%**|**+10.18%**|
-\*: We were unable to find an evaluation framework for this benchmark.
-### Comparsion to Gemma
-#### Gemma 1 & 2
-The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 2 technical report](https://arxiv.org/abs/2408.00118).
-_Note: Although referred to as "2B", Gemma 2 2B actually has 2.6 billion parameters._
-|Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
-|---|---|---|---|---|---|---|---|---|---|---|
-|MMLU|5-shot|42.3|64.4|52.2|71.3|57.93|+36.95%|-10.05%|+10.98%|-18.75%|
-|ARC-C|25-shot|48.5|61.1|55.7|68.4|75.08|+54.80%|+22.88%|+34.79%|+9.77%|
-|GSM8K|5-shot|15.1|51.8|24.3|68.6|67.85|+349.34%|+30.98%|+179.22%|-1.09%|
-|AGIEval*|3-5-shot|24.2|44.9|31.5|52.8|-|-|-|-|-|
-|DROP|3-shot, F1|48.5|56.3|51.2|69.4|29.33|-39.53%|-47.90%|-42.71%|-57.74%|
-|BBH|3-shot, CoT|35.2|59|41.9|68.2|48.56|37.95%|-17.69%|+15.89%|-28.80%|
-|Winogrande|5-shot|66.8|79|71.3|80.6|67.09|+0.43%|-15.08%|-5.90%|-16.76%|
-|HellaSwag|10-shot|71.7|82.3|72.9|81.9|69.89|-2.52%|-15.08%|-4.13%|-14.66%|
-|MATH|4-shot|11.8|24.3|16|36.6|40.2|+240.88%|+65.43%|+151.25%|+9.84%|
-|ARC-e|0-shot|73.2|81.5|80.6|88|87.21|+19.14%|+7.01%|+8.20%|-0.90%|
-|PIQA|0-shot|77.3|81.2|78.4|81.7|75.95|-1.75%|-6.47%|-3.13%|-7.04%|
-|SIQA|0-shot|49.7|51.8|51.9|53.4|61.97|+24.69%|+19.63%|+19.40%|+16.05%|
-|Boolq|0-shot|69.4|83.2|72.7|84.2|67.76|-2.36%|-18.56%|-6.80%|-19.52%|
-|TriviaQA|5-shot|53.2|63.4|60.4|76.6|54.97|+3.33%|-13.30%|-8.99%|-28.24%|
-|NQ|5-shot|12.5|23|17.1|29.2|10.91|-12.72%|-52.57%|-36.20%|-62.64%|
-|HumanEval|pass@1|22|32.3|20.1|40.2|68.3|+210.45%|+111.46%|+239.80%|+69.90%|
-|MBPP|3-shot|29.2|44.4|30.2|52.4|60.3|+106.51%|+35.81%|+99.67%|+15.08%|
-|||||||**Average**|**+84.76%**|**+1.69%**|**+42.42%**|**-14.78%**|
-\*: We were unable to find an evaluation framework for this benchmark.
-#### Gemma 3
-The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 3 technical report](https://arxiv.org/abs/2503.19786).
-|Benchmark|Metric|Gemma 3 1B|Gemma 3 4B|Motif 2.6B|Improvement(over 1B)|Improvement(over 4B)|
-|---|---|---|---|---|---|---|
-|HellaS|10-shot|62.3|77.2|69.89|+12.18%|-9.47%|
-|BoolQ|0-shot|63.2|72.3|67.76|+7.22%|-6.28%|
-|PIQA|0-shot|73.8|79.6|75.59|+2.43%|-5.04%|
-|SIQA|0-shot|48.9|51.9|61.97|+26.73%|+19.40%|
-|TQA|5-shot|39.8|65.8|54.97|+38.12%|-16.46%|
-|NQ|5-shot|9.48|20|10.91|+15.08%|-45.45%|
-|ARC-C|25-shot|38.4|56.2|75.08|+95.52%|+33.59%|
-|ARC-E|0-shot|73|82.4|87.21|+19.47%|+5.84%|
-|WinoG|5-shot|58.2|64.7|67.09|+15.27%|+3.69%|
-|BBH|few-shot, CoT|28.4|50.9|48.56|+70.99%|-4.60%|
-|Drop|1-shot, F1|42.4|60.1|29.33|-30.83%|-51.20%|
-|MMLU|5-shot|-|59.6|57.93|-|-2.80%|
-|MMLUpro|5-shot, CoT|-|29.2|-|-|-|
-|AGIE|3-5-shot|-|42.1|-|-|-|
-|MATH|4-shot, CoT|-|24.2|40.2|-|+66.12%|
-|GSM8K|8-shot, CoT|-|38.4|77.71|-|+102.37%|
-|GPQA Diamond|5-shot, CoT|-|15|31.81|-|+112.07%|
-|MBPP|3-shot|-|46|60.3|-|+31.09%|
-|HumanE|0-shot|-|36|68.3|-|+89.72%|
-|IFEval|-|80.2|90.2|74.02|-7.71%|-17.94%|
-|||||**Average**|**+22.04%**|**+16.93%**|

 To illustrate how much evaluation scores can vary across reports, we provide concrete examples of benchmark score differences for major models in the **Evaluation Appendix**.
+### Comparison to Mistral 7B by Mistral AI
 The benchmarks and corresponding scores listed in the table below are taken directly from the [Mistral 7B technical report](https://arxiv.org/pdf/2310.06825).
 \* : We report the 4-shot score instead of the 4-shot, maj@4.
+### Comparison to the Gemma series by Google
+#### Gemma 1 & 2
+The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 2 technical report](https://arxiv.org/abs/2408.00118).
+_Note: Although referred to as "2B", Gemma 2 2B actually has <U>2.6 billion</U> parameters._
+|Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
+|---|---|---|---|---|---|---|---|---|---|---|
+|MMLU|5-shot|42.3|64.4|52.2|71.3|57.93|+36.95%|-10.05%|+10.98%|-18.75%|
+|ARC-C|25-shot|48.5|61.1|55.7|68.4|75.08|+54.80%|+22.88%|+34.79%|+9.77%|
+|GSM8K|5-shot|15.1|51.8|24.3|68.6|67.85|+349.34%|+30.98%|+179.22%|-1.09%|
+|AGIEval*|3-5-shot|24.2|44.9|31.5|52.8|-|-|-|-|-|
+|DROP|3-shot, F1|48.5|56.3|51.2|69.4|29.33|-39.53%|-47.90%|-42.71%|-57.74%|
+|BBH|3-shot, CoT|35.2|59|41.9|68.2|48.56|37.95%|-17.69%|+15.89%|-28.80%|
+|Winogrande|5-shot|66.8|79|71.3|80.6|67.09|+0.43%|-15.08%|-5.90%|-16.76%|
+|HellaSwag|10-shot|71.7|82.3|72.9|81.9|69.89|-2.52%|-15.08%|-4.13%|-14.66%|
+|MATH|4-shot|11.8|24.3|16|36.6|40.2|+240.88%|+65.43%|+151.25%|+9.84%|
+|ARC-e|0-shot|73.2|81.5|80.6|88|87.21|+19.14%|+7.01%|+8.20%|-0.90%|
+|PIQA|0-shot|77.3|81.2|78.4|81.7|75.95|-1.75%|-6.47%|-3.13%|-7.04%|
+|SIQA|0-shot|49.7|51.8|51.9|53.4|61.97|+24.69%|+19.63%|+19.40%|+16.05%|
+|Boolq|0-shot|69.4|83.2|72.7|84.2|67.76|-2.36%|-18.56%|-6.80%|-19.52%|
+|TriviaQA|5-shot|53.2|63.4|60.4|76.6|54.97|+3.33%|-13.30%|-8.99%|-28.24%|
+|NQ|5-shot|12.5|23|17.1|29.2|10.91|-12.72%|-52.57%|-36.20%|-62.64%|
+|HumanEval|pass@1|22|32.3|20.1|40.2|68.3|+210.45%|+111.46%|+239.80%|+69.90%|
+|MBPP|3-shot|29.2|44.4|30.2|52.4|60.3|+106.51%|+35.81%|+99.67%|+15.08%|
+|||||||**Average**|**+84.76%**|**+1.69%**|**+42.42%**|**-14.78%**|
+\*: We were unable to find an evaluation framework for this benchmark.
+#### Gemma 3
+The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 3 technical report](https://arxiv.org/abs/2503.19786).
+|Benchmark|Metric|Gemma 3 1B|Gemma 3 4B|Motif 2.6B|Improvement(over 1B)|Improvement(over 4B)|
+|---|---|---|---|---|---|---|
+|HellaS|10-shot|62.3|77.2|69.89|+12.18%|-9.47%|
+|BoolQ|0-shot|63.2|72.3|67.76|+7.22%|-6.28%|
+|PIQA|0-shot|73.8|79.6|75.59|+2.43%|-5.04%|
+|SIQA|0-shot|48.9|51.9|61.97|+26.73%|+19.40%|
+|TQA|5-shot|39.8|65.8|54.97|+38.12%|-16.46%|
+|NQ|5-shot|9.48|20|10.91|+15.08%|-45.45%|
+|ARC-C|25-shot|38.4|56.2|75.08|+95.52%|+33.59%|
+|ARC-E|0-shot|73|82.4|87.21|+19.47%|+5.84%|
+|WinoG|5-shot|58.2|64.7|67.09|+15.27%|+3.69%|
+|BBH|few-shot, CoT|28.4|50.9|48.56|+70.99%|-4.60%|
+|Drop|1-shot, F1|42.4|60.1|29.33|-30.83%|-51.20%|
+|MMLU|5-shot|-|59.6|57.93|-|-2.80%|
+|MMLUpro|5-shot, CoT|-|29.2|-|-|-|
+|AGIE|3-5-shot|-|42.1|-|-|-|
+|MATH|4-shot, CoT|-|24.2|40.2|-|+66.12%|
+|GSM8K|8-shot, CoT|-|38.4|77.71|-|+102.37%|
+|GPQA Diamond|5-shot, CoT|-|15|31.81|-|+112.07%|
+|MBPP|3-shot|-|46|60.3|-|+31.09%|
+|HumanE|0-shot|-|36|68.3|-|+89.72%|
+|IFEval|-|80.2|90.2|74.02|-7.71%|-17.94%|
+|||||**Average**|**+22.04%**|**+16.93%**|
+### Comparison to the Llama series by Meta
 #### Llama 3
 The benchmarks and corresponding scores listed in the table below are taken directly from the [Llama 3 technical report](https://arxiv.org/abs/2407.21783).
 \*: We were unable to find an evaluation framework for this benchmark.
+### Comparison to the Phi series by Microsoft
 The benchmarks and corresponding scores listed in the table below are taken directly from the [Phi-3 technical report](https://arxiv.org/abs/2404.14219).
 |Benchmark|Metric|Phi-3 3.8B|Phi-3 7B|Phi-2 2.7B|Motif 2.6B|Improvement(over 3.8B)|Improvement(over 7B)|Improvement(over 2.7B)|
 |MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
 ||||||**Average**|**-10.09%**|**-13.45%**|**+10.18%**|
+\*: We were unable to find an evaluation framework for this benchmark.