【原】NLP之ModelScope：基于ModelScope框架的afqmc數(shù)據(jù)集利用StructBERT預(yù)訓(xùn)練模型的文本相似度算法實現(xiàn)文本分類任務(wù)圖文教程之詳細攻略

處女座的程序猿 2022-10-21 發(fā)布于上海

展開全文

ModelScope之NLP：基于ModelScope框架的afqmc數(shù)據(jù)集利用StructBERT預(yù)訓(xùn)練模型的文本相似度算法實現(xiàn)文本分類任務(wù)圖文教程之詳細攻略

官方文檔：
https://www./docs/%E5%A6%82%E4%BD%95%E4%BD%BF%E7%94%A8Notebook%E8%BF%90%E8%A1%8C%E6%A8%A1%E5%9E%8B

基于ModelScope框架的afqmc數(shù)據(jù)集利用StructBERT預(yù)訓(xùn)練模型的文本相似度算法實現(xiàn)文本分類任務(wù)圖文教程

建立在線環(huán)境

基于PAI-DSW在Jupyterlab內(nèi)建模地址：https://dsw-gateway-cn-hangzhou.data.aliyun.com/dsw-14046/lab/workspaces/auto-a/tree/NLP_test20221016.ipynb

打開在線Notebook—Jupyterlab

?打開CPU實例通常需要2-5分鐘，打開GPU實例通常需要8-10分鐘，請耐心等待

基于PAI-DSW在Jupyterlab內(nèi)建模

第一次運行，需要加載相關(guān)的庫和數(shù)據(jù)集

案例設(shè)計思路

1、載入數(shù)據(jù)集

afqmc（Ant Financial Question Matching Corpus）數(shù)據(jù)集

2、數(shù)據(jù)預(yù)處理

在ModelScope中，數(shù)據(jù)預(yù)處理與模型強相關(guān)，因此，在指定模型以后，ModelScope框架會自動從對應(yīng)的modelcard中讀取配置文件中的preprocessor關(guān)鍵字，自動完成預(yù)處理的實例化。

3、模型訓(xùn)練與評估

訓(xùn)練

根據(jù)參數(shù)實例化trainer對象，最后，調(diào)用train接口進行訓(xùn)練

/opt/conda/lib/python3.7/site-packages/transformers/modeling_utils.py:736: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transformers.
  "The `device` argument is deprecated and will be removed in v5 of Transformers.", FutureWarning
2022-10-16 23:43:35,985 - modelscope - INFO - epoch [1][1/1]	lr: 1.000e-03, eta: 0:02:25, iter_time: 16.119, data_load_time: 2.028, loss: 0.0859
Total test samples: 100%|██████████| 3/3 [00:01<00:00,  1.58it/s]
2022-10-16 23:43:37,994 - modelscope - INFO - Saving checkpoint at 1 epoch
2022-10-16 23:43:38,568 - modelscope - INFO - epoch(eval) [1][1]	accuracy: 1.0000
2022-10-16 23:44:04,388 - modelscope - INFO - epoch [2][1/1]	lr: 5.500e-03, eta: 0:02:47, iter_time: 25.817, data_load_time: 2.032, loss: 0.0985
Total test samples: 100%|██████████| 3/3 [00:01<00:00,  1.60it/s]
2022-10-16 23:44:06,398 - modelscope - INFO - Saving checkpoint at 2 epoch
2022-10-16 23:44:06,971 - modelscope - INFO - epoch(eval) [2][1]	accuracy: 1.0000
2022-10-16 23:44:40,386 - modelscope - INFO - epoch [3][1/1]	lr: 1.000e-03, eta: 0:02:55, iter_time: 33.411, data_load_time: 2.034, loss: 0.0743
Total test samples: 100%|██████████| 3/3 [00:02<00:00,  1.32it/s]
2022-10-16 23:44:42,802 - modelscope - INFO - Saving checkpoint at 3 epoch
2022-10-16 23:44:43,388 - modelscope - INFO - epoch(eval) [3][1]	accuracy: 1.0000
2022-10-16 23:45:21,681 - modelscope - INFO - epoch [4][1/1]	lr: 1.000e-03, eta: 0:02:50, iter_time: 38.290, data_load_time: 2.035, loss: 0.0754
Total test samples: 100%|██████████| 3/3 [00:02<00:00,  1.13it/s]
2022-10-16 23:45:24,400 - modelscope - INFO - Saving checkpoint at 4 epoch
2022-10-16 23:45:24,988 - modelscope - INFO - epoch(eval) [4][1]	accuracy: 1.0000
2022-10-16 23:46:01,486 - modelscope - INFO - epoch [5][1/1]	lr: 1.000e-04, eta: 0:02:30, iter_time: 36.494, data_load_time: 2.035, loss: 0.0687
Total test samples: 100%|██████████| 3/3 [00:02<00:00,  1.04it/s]
2022-10-16 23:46:04,510 - modelscope - INFO - Saving checkpoint at 5 epoch
2022-10-16 23:46:05,140 - modelscope - INFO - epoch(eval) [5][1]	accuracy: 1.0000
2022-10-16 23:46:42,186 - modelscope - INFO - epoch [6][1/1]	lr: 1.000e-04, eta: 0:02:04, iter_time: 37.042, data_load_time: 2.036, loss: 0.0656
Total test samples: 100%|██████████| 3/3 [00:02<00:00,  1.26it/s]
2022-10-16 23:46:44,707 - modelscope - INFO - Saving checkpoint at 6 epoch
2022-10-16 23:46:45,311 - modelscope - INFO - epoch(eval) [6][1]	accuracy: 1.0000
2022-10-16 23:47:22,378 - modelscope - INFO - epoch [7][1/1]	lr: 1.000e-05, eta: 0:01:36, iter_time: 37.063, data_load_time: 2.037, loss: 0.0669
Total test samples: 100%|██████████| 3/3 [00:02<00:00,  1.08it/s]
2022-10-16 23:47:25,202 - modelscope - INFO - Saving checkpoint at 7 epoch
2022-10-16 23:47:25,785 - modelscope - INFO - epoch(eval) [7][1]	accuracy: 1.0000
2022-10-16 23:48:03,174 - modelscope - INFO - epoch [8][1/1]	lr: 1.000e-05, eta: 0:01:05, iter_time: 37.297, data_load_time: 2.037, loss: 0.0723
Total test samples: 100%|██████████| 3/3 [00:02<00:00,  1.01it/s]
2022-10-16 23:48:06,203 - modelscope - INFO - Saving checkpoint at 8 epoch
2022-10-16 23:48:06,807 - modelscope - INFO - epoch(eval) [8][1]	accuracy: 1.0000
2022-10-16 23:48:43,083 - modelscope - INFO - epoch [9][1/1]	lr: 1.000e-06, eta: 0:00:33, iter_time: 36.272, data_load_time: 2.039, loss: 0.0750
Total test samples: 100%|██████████| 3/3 [00:02<00:00,  1.11it/s]
2022-10-16 23:48:45,899 - modelscope - INFO - Saving checkpoint at 9 epoch
2022-10-16 23:48:46,486 - modelscope - INFO - epoch(eval) [9][1]	accuracy: 1.0000
2022-10-16 23:49:24,087 - modelscope - INFO - epoch [10][1/1]	lr: 1.000e-06, eta: 0:00:00, iter_time: 37.598, data_load_time: 2.036, loss: 0.0705
Total test samples: 100%|██████████| 3/3 [00:02<00:00,  1.32it/s]
2022-10-16 23:49:26,500 - modelscope - INFO - Saving checkpoint at 10 epoch
2022-10-16 23:49:27,696 - modelscope - INFO - epoch(eval) [10][1]	accuracy: 1.0000

評估

訓(xùn)練完成以后，配置評估數(shù)據(jù)集，直接調(diào)用trainer對象的evaluate函數(shù)，即可完成模型的評估。

Total test samples: 100%|██████████| 3/3 [00:02<00:00,  1.30it/s]
{'accuracy': 1.0}

完整代碼

from modelscope.msdatasets import MsDataset
# 載入訓(xùn)練數(shù)據(jù)
train_dataset = MsDataset.load('afqmc_small', split='train')
# 載入評估數(shù)據(jù)
eval_dataset = MsDataset.load('afqmc_small', split='validation')


# 指定文本分類模型
model_id = 'damo/nlp_structbert_sentence-similarity_chinese-base'



from modelscope.trainers import build_trainer

# 指定工作目錄
tmp_dir = "/tmp"

# 配置參數(shù)
kwargs = dict(
        model=model_id,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        work_dir=tmp_dir)



trainer = build_trainer(default_args=kwargs)
trainer.train()


# 直接調(diào)用trainer.evaluate，可以傳入train階段生成的ckpt
# 也可以不傳入?yún)?shù)，直接驗證model
metrics = trainer.evaluate(checkXSpoint_path=None)
print(metrics)