update readme

2024-09-20 20:33:46 +08:00 · 2024-09-20 20:33:46 +08:00 · 1c493e8c2a
commit 1c493e8c2a
parent 9093519c37
2 changed files with 12 additions and 21 deletions
--- a/README.md
+++ b/README.md
@ -457,7 +457,6 @@ MobileLLM提出架构的深度比宽度更重要，「深而窄」的「瘦长
 [C]: 泰山位于中国山东省泰安市，是中国五岳之一。
 [D]: 作为一个人工智能助手，我无法访问互联网，因此无法提供关于你所在位置的信息。不过，你可以通过搜索或搜索相关关键词来获取有关你所在位置的信息，或者查看相关新闻或网站来获取有关你所在位置的信息。
 [E]: 泰山位于中国山东省泰安市境内，是中国五岳之一，也是中国文化和历史的重要象征之一。它是中国五岳之一，也是中国道教和佛教的发源地之一。泰山的海拔高度为1,545米，是中国五岳之一，也是中国最著名的旅游景点之一。它的历史可以追溯到公元前5世纪，是中国古代帝王祭祀的场所。泰山也是中国文化的重要组成部分，被誉为“五岳之首”，是中国文化的重要遗产之一。
-[F]: 泰山位于中国山东省中部，是中国的五岳之一，也是世界文化和自然遗产。它不仅以其雄伟的自然景观闻名，还承载着丰富的历史文化，是中国古代帝王封禅的重要场所。泰山是中华民族的象征之一，体现了中国人民对自然和历史的尊重与传承。


 [Q]: 世界上最高的山峰是什么？
@ -532,13 +531,11 @@ MobileLLM提出架构的深度比宽度更重要，「深而窄」的「瘦长
 * minimind系列（ABC）的排序符合直觉，minimind-v1(0.1B)评分最高，常识性问题的回答基本没有错误和幻觉。
    * 出乎意料的是，minimind-v1-small(0.02B)仅有26M参数，却可以接近minimind-v1(0.1B)的表现。
    * minimind-v1(0.1B)的sft轮数`epochs`仅有不到2，偷懒提前kill腾出资源给小模型，0.1B没有得到充分训练的情况下依然做到了最强，其实还是底大一级压死人。
-    * minimind-v1-moe(0.1B)
-      表现很差，同样是因为偷懒提前kill腾出资源给小模型，但是MoE模型多专家模式需要的训练轮次本来就需要酌情更高，在epochs设置为2时训练的极其不充分。minimind不久前实验阶段在Yi
-      tokenizer上试验过moe的充分训练版本，可以做到比dense表现肉眼可见的更好。日后腾出服务器再训练更新v2、v3版本。
+    * minimind-v1-moe(0.1B)表现只比minimind-v1-small(0.02B)略好，同样是因为偷懒早停腾出资源做其它训练了，但是MoE模型这种稀疏多Experts模式需要的训练轮次需要酌情更高，让所有FFN层专家得到路由的激活充分训练，在目前epochs设置为3时训练的还不够充足。
+      minimind在早期实验验证阶段在Yi-Tokenizer上试验过moe的充分训练版本，可以做到比dense小模型表现肉眼可见地更好。此部分可能需要留给日后腾出服务器再训练并更新v2、v3版本。

-
-* E模型的回答看起来是这里最完美的，尽管存在些许幻觉瞎编的情况。但GPT-4o和Deepseek的评分都一致认为它“信息过度冗长，且有重复内容，存在幻觉”。
-  其实这种评价太严格了，100个字中有10个字是幻觉，就很容易把它归到0分。由于F模型训练文本默认长度更长，数据集大得多，所以回答的看起来很完备，在体积近似的情况下，数据比模型更重要得多。
+* E模型的回答肉眼看起来是非常不错的，尽管存在些许幻觉瞎编的情况。但GPT-4o和Deepseek的评分都一致认为它“信息过度冗长，且有重复内容，存在幻觉”。
+  其实这种评价略显严格，100个字中哪怕有10个字是幻觉，就很容易把它归到低分。由于E模型预训练文本长度更长，数据集大得多，所以回答的看起来很完备。在体积近似的情况下，数据数量和质量都很重要。

 > 🙋‍♂️个人主观评价：E>C>B≈A>D

--- a/README_en.md
+++ b/README_en.md
@ -545,6 +545,7 @@ better with the scaling law for small models.
 [B]: Seawater contains abundant elements like calcium, phosphorus, potassium, and phosphorus, which are limited in quantity and distribution in the ocean. Additionally, seawater contains nutrients from various organisms and microorganisms, such as shells and seaweeds.
 [C]: Seawater contains many dissolved salts, with the most common being sodium chloride. When salt is added to water, its molecules react with water molecules to form sodium hydroxide, which is a type of salt.
 [D]: Water is essential for many organisms, as it can absorb and release carbon dioxide, release oxygen,
+[E]: The primary reason seawater is salty is due to the high concentration of dissolved salts, with sodium chloride (NaCl) being the most significant. Sodium chloride is a very common salt that is present in most regions of the Earth, including oceans, lakes, rivers, and groundwater. When water evaporates from the ocean, it carries away some of the salt, making the seawater salty. Additionally, seawater contains other dissolved salts such as sulfates, magnesium sulfate, calcium sulfate, magnesium potassium sulfate, magnesium sodium sulfate, magnesium calcium sulfate, magnesium magnesium calcium sulfate, magnesium magnesium calcium sulfate, magnesium magnesium calcium sulfate, magnesium magnesium sodium sulfate, magnesium magnesium calcium sulfate, magnesium magnesium magnesium calcium sulfate, magnesium magnesium magnesium 
 ```

 > [!NOTE]
@ -590,21 +591,14 @@ better with the scaling law for small models.

 ## 👉 Summary of Effects

-* The ranking of the minimind series (ABC) is intuitive, with minimind-v1(0.1B) scoring the highest and providing mostly
-  accurate answers to common knowledge questions.
-    * Surprisingly, minimind-v1-small (0.02B) with only 26M parameters performs close to minimind-v1(0.1B).
-    * Despite having less than 2 epochs of training, minimind-v1(0.1B) performed the best. This suggests that a larger
-      model often yields better performance, even with limited training.
-    * minimind-v1-moe (0.1B) performed poorly, likely because it was terminated early to free up resources for smaller
-      models. MoE models require more training epochs, and with only 2 epochs, it was under-trained. Previous
-      experiments with a fully trained MoE model on Yi tokenizer showed visible improvements. Future versions, v2 and
-      v3, will be updated with better training.
+* The ranking of the minimind series (ABC) aligns with intuition, with minimind-v1(0.1B) scoring the highest, and its responses to common sense questions are mostly error-free and free of hallucinations.
+    * Surprisingly, minimind-v1-small(0.02B), with only 26M parameters, can perform nearly as well as minimind-v1(0.1B).
+    * minimind-v1(0.1B) underwent less than 2 epochs of SFT (Supervised Fine-Tuning) due to being prematurely killed to free up resources for smaller models. Despite not being fully trained, it still achieved the best performance, demonstrating that larger models generally outperform smaller ones.
+    * minimind-v1-moe(0.1B) performed only slightly better than minimind-v1-small(0.02B), also due to early termination to free up resources for other training. However, the MoE (Mixture of Experts) model, with its sparse multi-Experts mode, requires more training epochs to fully activate and train all FFN (Feed-Forward Network) layer experts. In the current setup with 3 epochs, the training is not yet sufficient.
+      Early experiments with minimind on the Yi-Tokenizer showed that a fully trained MoE version could outperform dense small models visibly. This aspect may need to be reserved for future training and updates to v2 and v3 versions when more server resources are available.

-* Model E’s responses appear the most complete, despite some instances of hallucination and overly verbose content.
-  However, GPT-4o and Deepseek's evaluations suggest it is "overly verbose and repetitive, with some hallucinations."
-  This strict evaluation might penalize models with some hallucinations heavily. Due to F models having longer default
-  text lengths and much larger datasets, the quality of responses depends significantly on the data rather than the
-  model size alone.
+* The responses from Model E appear to be quite good to the naked eye, although there are occasional instances of hallucinations and fabrications. However, both GPT-4o and Deepseek's evaluations consistently noted that it "provides overly verbose and repetitive information, and contains hallucinations."
+  This evaluation seems somewhat strict, as even a small number of hallucinated words in a 100-word response can easily result in a low score. Given that Model E was pre-trained on longer texts and a larger dataset, its responses appear more comprehensive. In models of similar size, both the quantity and quality of the data are crucial.

 > 🙋‍♂️ Personal Subjective Evaluation: E>C>B≈A>D