心得技巧

html5 HTML/Xhtml CSS XML/XSLT Dreamweaver教程 Frontpage教程心得技巧

上一篇: ✔PHP文件包含漏洞全面总结下一篇:用burpsuite抓取手机流量

GCN数据集Cora、Citeseer、Pubmed文件分析

发布时间：2022-07-05 发布网站：脚本宝典

脚本宝典收集整理的这篇文章主要介绍了GCN数据集Cora、Citeseer、Pubmed文件分析，脚本宝典觉得挺不错的，现在分享给大家，也给大家做个参考。

简介

　　本文将对Cora、CITeseer、Pubmed 数据集进行详细介绍

　　Cora、Citeseer、Pubmed

数据集	来源	图	节点	边	特征	标签(y)
Cora	“Collective classification in network data,” ai magazine,2008	1	2708	5429	1433	7
Citeseer	“Collective classification in network data,” AI magazine,2008	1	3327	4732	3703	6
Pubmed	“Collective classification in network data,” AI magazine,2008	1	19717	44338	500	3

　　├── gcn

　　│ ├── data //图数据　　│ │ ├── ind.citeseer.allx　　│ │ ├── ind.citeseer.ally　　│ │ ├── ind.citeseer.graph　　│ │ ├── ind.citeseer.test.index　　│ │ ├── ind.citeseer.tx　　│ │ ├── ind.citeseer.ty　　│ │ ├── ind.citeseer.x　　│ │ ├── ind.citeseer.y　　│ │ ├── ind.cora.allx　　│ │ ├── ind.cora.ally　　│ │ ├── ind.cora.graph　　│ │ ├── ind.cora.test.index　　│ │ ├── ind.cora.tx　　│ │ ├── ind.cora.ty　　│ │ ├── ind.cora.x　　│ │ ├── ind.cora.y　　│ │ ├── ind.pubmed.allx　　│ │ ├── ind.pubmed.ally　　│ │ ├── ind.pubmed.graph　　│ │ ├── ind.pubmed.test.index　　│ │ ├── ind.pubmed.tx　　│ │ ├── ind.pubmed.ty　　│ │ ├── ind.pubmed.x　　│ │ └── ind.pubmed.y　　│ ├── __init__.py　　│ ├── inits.py //初始化的公用函数　　│ ├── layers.py //GCN层定义　　│ ├── metrics.py //评测指标的计算　　│ ├── models.py //模型结构定义　　│ ├── train.py //训练　　│ └── utils.py //工具函数的定义　　├── LICENCE　　├── README.md　　├── requirements.txt　　└── SETUP.py

　　三种数据都由以下八个文件组成，存储格式类似：

　　ind.dataset_str.x => the feature vectors of the training instances as scipy.sparse.csr.csr_matrix object;　　ind.dataset_str.tx => the feature vectors of the test instances as scipy.sparse.csr.csr_matrix object;　　ind.dataset_str.allx => the feature vectors of both labeled and unlabeled training instances 　　(a suPErset of ind.dataset_str.x) as scipy.sparse.csr.csr_matrix object;　　ind.dataset_str.y => the one-hot labels of the labeled training instances as numpy.ndarray object;　　ind.dataset_str.ty => the one-hot labels of the test instances as numpy.ndarray object;　　ind.dataset_str.ally => the labels for instances in ind.dataset_str.allx as numpy.ndarray object;

　　ind.dataset_str.graph => a dict in the format {index: [index_of_neighbor_nodes]} as collections.defaultdict object;　　ind.dataset_str.test.index => the indices of test instances in graph, for the inductive setting as list object.

　　以cora为例：

　　ind.dataset_str.x => 训练实例的特征向量，是scipy.sparse.csr.csr_matrix类对象，Shape:(140, 1433)　　ind.dataset_str.tx => 测试实例的特征向量,shape:(1000, 1433)　　ind.dataset_str.allx => 有标签的+无无标签训练实例的特征向量，是ind.dataset_str.x的超集，shape:(1708, 1433)

　　ind.dataset_str.y => 训练实例的标签，独热编码，numpy.ndarray类的实例，是numpy.ndarray对象，shape：(140, 7)　　ind.dataset_str.ty => 测试实例的标签，独热编码，numpy.ndarray类的实例,shape:(1000, 7)　　ind.dataset_str.ally => 对应于ind.dataset_str.allx的标签，独热编码,shape:(1708, 7)

　　ind.dataset_str.graph => 图数据，collections.defaultdict类的实例，格式为 {index：[index_of_neighbor_nodes]}　　ind.dataset_str.test.index => 测试实例的id，2157行

以Cora为例

 　　Cora 数据集由机器学习论文组成，是近年来图深度学习很喜欢使用的数据集。在数据集中，论文分为以下七类之一:

基于案例
遗传算法
神经网络
概率方法
强化学习
规则学习
理论

　　论文的选择方式是，在最终语料库中，每篇论文引用或被至少一篇其他论文引用。整个语料库中有2708篇论文。

　　在词干堵塞和去除词尾后，只剩下 1433 个独特的单词。文档频率小于 10 的所有单词都被删除。cora数据集包含 1433 个独特单词，所以特征是 1433 维。0 和 1 描述的是每个单词在 paper 中是否存在。

　　文件组成(cora)　　三种数据都由以下八个文件(3类)组成，存储格式类似：

　　x，tx，allx 是特征（转换成array后是独热编码）

x (维度（140，1433）) 是140 篇论文训练实例的特征向量，ty (维度（1000，1433）)是 1000 篇论文测试实例的特征向量，allx (维度（1708，1433）)是1708 篇论文中有标签的+无无标签训练实例的特征向量，从0-1707，共1708个。
节点数 = 1000 + 1708 = 2708 （tx 中的1000 和 allx 中的 1708）。

　　y，ty，ally是上面对应标签（独热编码）

y (维度（140，7）) 是140 篇论文训练实例的标签，ty (维度（1000，7）)是 1000 篇论文测试实例的标签，allx (维度（1708，7）)对应于ind.dataset_str.allx的标签，包含有标签的和无标签的，从0-1707，共1708个

　　graph，test.index

　　总共2708个节点，训练数据仅用了140个，范围是(0, 140)，验证集用了500个，范围是(140, 640]，测试集用了1000个，范围是[1708，2707]，其余范围从[641，1707]的数据集。

　　关于特征代码：

data 
with open("data/ind.cora.x", 'rb') as f:
data = pkl.load(f, encoding='latin1')
PRint(type(data)) # 变量data是个scipy.sparse.csr.csr_matrix，类似稀疏矩阵，输出得到的是矩阵中非0的行列坐标及值
print(data.shape) #(140, 1433)-ind.cora.x是140行，1433列的
print(data.shape[0]) #row:140
print(data.shape[1]) #column:1433
nonzero=data.nonzero()
print(nonzero) #输出非零元素对应的行坐标和列坐标
print(type(nonzero)) #<class 'tuple'>
print(nonzero[0]) #行：[ 0 0 0 ... 139 139 139]
print(nonzero[1]) #列：[ 19 81 146 ... 1263 1274 1393]
print(data.toarray())
print(data)

　　变量 data 是个scipy.sparse.csr.csr_matrix，类似稀疏矩阵，输出得到的是矩阵中非 0 的行列坐标及值。也就是说如果该文献如果出现这个单词则其设置为 1 ，类似于one-hot 编码。

　　关于标签代码：

with open("data/ind.cora.y", 'rb') as f:
    print(f)  #<_io.BufferedReader name='data/ind.cora.y'>
    data = pkl.load(f, encoding='latin1')
    print(type(data)) #<class 'numpy.ndarray'>
    print(data.shape)   #(140, 7)
    print(data.shape[0]) #row:140
    print(data.shape[1]) #column:7
    print(data[1]) #[0 0 0 0 1 0 0]

　　关于边关系代码：

with open("data/ind.cora.graph", 'rb') as f:
        data = pkl.load(f, encoding='latin1')
        print(type(data)) #<class 'collections.defaultdict'>
        print(data)

　　defaultdict(<class 'list'>, {0: [633, 1862, 2582], 1: [2, 652, 654], 2: [1986, 332, 1666, 1, 1454], 　　 , ... , 　　2706: [165, 2707, 1473, 169], 2707: [598, 165, 1473, 2706]})

　　关于data/ind.cora.test.index代码：

test_idx_reorder = parse_index_file("data/ind.cora.test.index")
print("test index:",test_idx_reorder)
#test index: [2692, 2532, 2050, 1715, 2362, 2609, 2622, 1975, 2081, 1767, 2263,..]
print("min_index:",min(test_idx_reorder))

　　citeseer数据集中一些孤立点的特殊处理

    #处理citeseer中一些孤立的点
    if dataset_str == 'citeseer':
        # Fix citeseer dataset (there are some isolated nodes in the graph)
        # Find isolated nodes, add them as zero-vecs into the right position

        test_idx_range_full = range(min(test_idx_reorder), max(test_idx_reorder)+1)
        # print("test_idx_range_full.length",len(test_idx_range_full))
        #test_idx_range_full.length 1015

        #转化成LIL格式的稀疏矩阵,tx_extended.shape=(1015,1433)
        tx_extended = sp.lil_matrix((len(test_idx_range_full), x.shape[1]))
        # print(tx_extended)
        #[2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325
        # ....
        # 3321 3322 3323 3324 3325 3326]

        #test_idx_range-min(test_idx_range):列表中每个元素都减去min(test_idx_range)，即将test_idx_range列表中的index值变为从0开始编号
        tx_extended[test_idx_range-min(test_idx_range), :] = tx
        # print(tx_extended.shape) #(1015, 3703)

        # print(tx_extended)
        # (0, 19) 1.0
        # (0, 21) 1.0
        # (0, 169) 1.0
        # (0, 170) 1.0
        # (0, 425) 1.0
        #  ...
        # (1014, 3243) 1.0
        # (1014, 3351) 1.0
        # (1014, 3472) 1.0

        tx = tx_extended
        # print(tx.shape)
        # (1015, 3703)
        #997,994,993,980,938...等15行全为0


        ty_extended = np.zeros((len(test_idx_range_full), y.shape[1]))
        ty_extended[test_idx_range-min(test_idx_range), :] = ty
        ty = ty_extended
        # for i in range(ty.shape[0]):
        #     print(i," ",ty[i])
        #     # 980 [0. 0. 0. 0. 0. 0.]
        #     # 994 [0. 0. 0. 0. 0. 0.]
        #     # 993 [0. 0. 0. 0. 0. 0.]

allx是训练集中的所有训练实例，包含有标签的和无标签的，从0-1707，共1708个
ally是allx对应的标签，从1708-2707，共1000个
citeseer的测试数据集中有一些孤立的点（test.index中没有对应的索引，15个），可把这些点当作特征全为0的节点加入到测练集tx中，并且对应的标签在ty中
输入是一张整图，因此将 tx 和 allx 拼起来作为 feature
没有标签的数据的 y 值:[0,0,0,0,0,0,0]
数据集中的特征也是稀疏的，用LIL稀疏矩阵存储，格式如下

A=np.array([[1,0,2,0],[0,0,0,0],[3,0,0,0],[1,0,0,4]])
AS=sp.lil_matrix(A)
print(AS)
# (0, 0) 1
# (0, 2) 2
# (2, 0) 3
# (3, 0) 1
# (3, 3) 4

　　Tox21 数据集　　此数据集来源于一个PubChem网站的一个2014年的竞赛：https://trIPOd.nih.gov/tox21/challenge/about.jsp　　PubChem是美国国立卫生研究院（NIH）的开放化学数据库，是世界上最大的免费化学物信息集合。　　PubChem的数据由数百个数据源提供，包括：政府机构，化学品供应商，期刊出版商等。

　　21世纪的毒理学（Tox21）计划是NIH，环境保护局和食品药品管理局的联邦合作计划，旨在开发更好的毒性评估方法。目标是快速有效地测试某些化合物是否有可能破坏人体中可能导致不良健康影响的过程。Tox21数据集是其中一个比赛用到的数据集，包含了12个毒理试验测定的化学合成物质的结构信息。

雌激素受体α，LBD（ER，LBD）
雌激素受体α，full（ER，full）
芳香
芳烃受体（Ahr）
雄激素受体，full（AR，full）
雄激素受体，LBD（AR，LBD）
过氧化物酶体增殖物激活受体γ（PPAR-γ）
核因子（红细胞衍生的2）样2 /抗氧化反应元件（Nrf2 / ARE）
热休克因子反应元件（HSE）
ATAD5
线粒体膜电位（MMP）
P53

　　每个毒理实验测试的都是PUBCHEM_SID从144203552-144214049共10486个化合物，包括环保化合物、一些上市药物等物质的活性结果。

　　略

脚本宝典总结

以上是脚本宝典为你收集整理的GCN数据集Cora、Citeseer、Pubmed文件分析全部内容，希望文章能够帮你解决GCN数据集Cora、Citeseer、Pubmed文件分析所遇到的问题。

如果觉得脚本宝典网站内容还不错，欢迎将脚本宝典推荐好友。

本图文内容来源于网友网络收集整理提供，作为学习参考使用，版权属于原作者。
如您有任何意见或建议可联系处理。小编QQ：384754419，请注明来意。

上一篇: ✔PHP文件包含漏洞全面总结下一篇:用burpsuite抓取手机流量

猜你在找的心得技巧相关文章

clion结合vcpkg以及GTest的使用 2022-07-07
EGF 2022-06-06
ExtJS 布局-Column布局（Column layout） 2022-06-05
颜色之ARGB与RGB、RGBA的区别与介绍 2022-04-15
rgba中的a是什么意思 CSS之RGBA颜色指南 2022-04-15
rootfs -根文件系统制作 2022-07-07
网页简单布局之结构与表现原则分享 2022-04-15
小项目中怎么防止Vue的闪现画面效果 2022-04-15
隐藏 Web 中的元素方法及优缺点教程详解 2022-04-15
告别硬编码让你的前端表格自动计算的实例代码 2022-04-15

全站导航更多