知识图谱论文学习并基于华为云计算平台复现

举报
yd_274944317 发表于 2023/11/24 19:38:39 2023/11/24
【摘要】 ConvE复现实验 论文信息论文:Convolutional 2D Knowledge Graph Embeddings项目链接:https://github.com/TimDettmers/ConvE 重要警告官方提供的运行配置说明基本有效,但如果你是在2022.10.15左右直接按照下面官方提供的运行配置不加修改地运行的话,会遇到以下问题:该模型所依赖的作者另一个代码库spoderne...

ConvE复现实验

论文信息

论文:Convolutional 2D Knowledge Graph Embeddings

项目链接:https://github.com/TimDettmers/ConvE

本次运行的配置

平台:华为云ECS弹性服务器

配置:

CPU: Intel SkyLake 6151 3.0GHz / Intel Cascade Lake 6278 2.6GHz 8vCPUs

GPU: 1 * NVIDIA T4 / 1 * 16G # 支持CUDA的GPU是必须的,我第一次训练时没有CUDA GPU,触发异常,无法正常训练

RAM: 32GiB

系统:CentOS 8.2 64bit for GPU(40GB)

Disk:通用型SSD 60GiB

环境:Anaconda, Python 3.8, PyTorch 1.12.1 Linux CUDA 11.3

重要警告

官方提供的运行配置说明基本有效,但如果你是在2022.10.15左右直接按照下面官方提供的运行配置不加修改地运行的话,会遇到以下问题:

该模型所依赖的作者另一个代码库spodernet用到了spaCy的早期版本。而直接按下面的方式利用requirement.txt安装依赖的话,会安装spaCy 3.0+,而spaCy 3.0废弃了作者在代码中使用的一个缩写写法,所以运行时会抛出异常。

Installation (Mostly copy from project README)

This repo supports Linux and Python installation via Anaconda.

  1. Install PyTorch using Anaconda.

  2. Install the requirements pip install -r requirements.txt

    如果在这一步可能卡在了网络连接问题上的话,可以尝试在机器上添加ssh-key,然后把requirements.txt改成下面这样(我也不知道为啥我弄完以后才成功,很玄学):

    scipy
    -e git+https://github.com/TimDettmers/spodernet.git#egg=spodernet
    -e git+https://github.com/TimDettmers/bashmagic.git#egg=bashmagic
    
    
  3. Download the default English model used by spaCy, which is installed in the previous step python -m spacy download en

  4. Run the preprocessing script for WN18RR, FB15k-237, YAGO3-10, UMLS, Kinship, and Nations: sh preprocess.sh

  5. You can now run the model

Running a model

Parameters need to be specified by white-space tuples for example:

CUDA_VISIBLE_DEVICES=0 python main.py --model conve --data FB15k-237 \
                                      --input-drop 0.2 --hidden-drop 0.3 --feat-drop 0.2 \
                                      --lr 0.003 --preprocess

will run a ConvE model on FB15k-237.

To run a model, you first need to preprocess the data once. This can be done by specifying the --preprocess parameter:

CUDA_VISIBLE_DEVICES=0 python main.py --data DATASET_NAME --preprocess

After the dataset is preprocessed it will be saved to disk and this parameter can be omitted.

CUDA_VISIBLE_DEVICES=0 python main.py --data DATASET_NAME

The following parameters can be used for the --model parameter:

conve
distmult
complex

The following datasets can be used for the --data parameter:

FB15k-237
WN18RR
YAGO3-10
umls
kinship
nations

And here a complete list of parameters.

Link prediction for knowledge graphs

optional arguments:
  -h, --help            show this help message and exit
  --batch-size BATCH_SIZE
                        input batch size for training (default: 128)
  --test-batch-size TEST_BATCH_SIZE
                        input batch size for testing/validation (default: 128)
  --epochs EPOCHS       number of epochs to train (default: 1000)
  --lr LR               learning rate (default: 0.003)
  --seed S              random seed (default: 17)
  --log-interval LOG_INTERVAL
                        how many batches to wait before logging training
                        status
  --data DATA           Dataset to use: {FB15k-237, YAGO3-10, WN18RR, umls,
                        nations, kinship}, default: FB15k-237
  --l2 L2               Weight decay value to use in the optimizer. Default:
                        0.0
  --model MODEL         Choose from: {conve, distmult, complex}
  --embedding-dim EMBEDDING_DIM
                        The embedding dimension (1D). Default: 200
  --embedding-shape1 EMBEDDING_SHAPE1
                        The first dimension of the reshaped 2D embedding. The
                        second dimension is infered. Default: 20
  --hidden-drop HIDDEN_DROP
                        Dropout for the hidden layer. Default: 0.3.
  --input-drop INPUT_DROP
                        Dropout for the input embeddings. Default: 0.2.
  --feat-drop FEAT_DROP
                        Dropout for the convolutional features. Default: 0.2.
  --lr-decay LR_DECAY   Decay the learning rate by this factor every epoch.
                        Default: 0.995
  --loader-threads LOADER_THREADS
                        How many loader threads to use for the batch loaders.
                        Default: 4
  --preprocess          Preprocess the dataset. Needs to be executed only
                        once. Default: 4
  --resume              Resume a model.
  --use-bias            Use a bias in the convolutional layer. Default: True
  --label-smoothing LABEL_SMOOTHING
                        Label smoothing value to use. Default: 0.1
  --hidden-size HIDDEN_SIZE
                        The side of the hidden layer. The required size
                        changes with the size of the embeddings. Default: 9728
                        (embedding size 200).

To reproduce most of the results in the ConvE paper, you can use the default parameters and execute the command below:

CUDA_VISIBLE_DEVICES=0 python main.py --data DATASET_NAME

For the reverse model, you can run the provided file with the name of the dataset name and a threshold probability:

python inverse_model.py WN18RR 0.9

Changing the embedding size for ConvE

If you want to change the embedding size you can do that via the ``–embedding-dimparameter. However, for ConvE, since the embedding is reshaped as a 2D embedding one also needs to pass the first dimension of the reshaped embedding (–embedding-shape1) while the second dimension is infered. When once changes the embedding size, the hidden layer size–hidden-size` also needs to be different but it is difficult to determine before run time. The easiest way to determine the hidden size is to run the model, let it run on an error due to wrong shape, and then reshape according to the dimension in the error message.

Example: Change embedding size to be 100. We want 10x10 2D embeddings. We run python main.py --embedding-dim 100 --embedding-shape1 10 and we run on an error due to wrong hidden dimension:

   ret = torch.addmm(bias, input, weight.t())
RuntimeError: size mismatch, m1: [128 x 4608], m2: [9728 x 100] at /opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THC/generic/THCTensorMathBlas.cu:273

Now we change the hidden dimension to 4608 accordingly: python main.py --embedding-dim 100 --embedding-shape1 10 --hidden-size 4608. Now the model runs with an embedding size of 100 and 10x10 2D embeddings.

Adding new datasets

To run it on a new datasets, copy your dataset folder into the data folder and make sure your dataset split files have the name train.txt, valid.txt, and test.txt which contain tab separated triples of a knowledge graph. Then execute python wrangle_KG.py FOLDER_NAME, afterwards, you can use the folder name of your dataset in the dataset parameter.

Adding your own model

You can easily write your own knowledge graph model by extending the barebone model MyModel that can be found in the model.py file.

Quirks

There are some quirks of this framework.

  1. The model currently ignores data that does not fit into the specified batch size, for example if your batch size is 100 and your test data is 220, then 20 samples will be ignored. This is designed in that way to improve performance on small datasets. To test on the full test-data you can save the model checkpoint, load the model (with the --resume True variable) and then evaluate with a batch size that fits the test data (for 220 you could use a batch size of 110). Another solution is to just use a fitting batch size from the start, that is, you could train with a batch size of 110.

Issues

It has been noted that #6 WN18RR does contain 212 entities in the test set that do not appear in the training set. About 6.7% of the test set is affected. This means that most models will find it impossible to make any reasonable predictions for these entities. This will make WN18RR appear more difficult than it really is, but it should not affect the usefulness of the dataset. If all researchers compared to the same datasets the scores will still be comparable.

Logs

Some log files of the original research are included in the repo (logs.tar.gz). These log files are mostly unstructured in names and might be created from checkpoints so that it is difficult to comprehend them. Nevertheless, it might help to replicate the results or study the behavior of the training under certain conditions and thus I included them here.

Citation

If you found this codebase or our work useful please cite us:

@inproceedings{dettmers2018conve,
	Author = {Dettmers, Tim and Pasquale, Minervini and Pontus, Stenetorp and Riedel, Sebastian},
	Booktitle = {Proceedings of the 32th AAAI Conference on Artificial Intelligence},
	Title = {Convolutional 2D Knowledge Graph Embeddings},
	Url = {https://arxiv.org/abs/1707.01476},
	Year = {2018},
        pages  = {1811--1818},
  	Month = {February}
}



【版权声明】本文为华为云社区用户转载文章,如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。