赞
踩
本教程我们学习如何用DeepChem (ZINC)有效的筛选大的化学合数据库。用机器学习筛选大型化合物库是直接受CPU约束的平行计算问题。我要使用的代码用例假定可用的资源是在个大的机器(像AWS c5.18xlarge),但是其它是统也是可交换的(如超级计算群)。更高层次的,我们将要做的是:
这个教程与前面的教程的不同之处在于它设计来运行在AWS上而不是Google Colab上。那是因为我们要访问一个有许多核心的大型机器以有效的进行计算。本教程我们将详细地讲解如何做。
1.用标签数据训练模型
我们这里只是做一个简单的模型。实际问题中你可能需要尝试多个模型并摸索超参数。
In [1]:
from deepchem.molnet.load_function import hiv_datasets
/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/sklearn/externals/joblib/__init__.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
warnings.warn(msg, category=FutureWarning)
RDKit WARNING: [18:15:24] Enabling RDKit 2019.09.3 jupyter extensions
/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
In [2]:
from deepchem.models import GraphConvModel
from deepchem.data import NumpyDataset
from sklearn.metrics import average_precision_score
import numpy as np
tasks, all_datasets, transformers = hiv_datasets.load_hiv(featurizer="GraphConv")
train, valid, test = [NumpyDataset.from_DiskDataset(x) for x in all_datasets]
model = GraphConvModel(1, mode="classification")
model.fit(train)
Loading raw samples now.
shard_size: 8192
About to start loading CSV from /var/folders/st/ds45jcqj2232lvhr0y9qt5sc0000gn/T/HIV.csv
Loading shard 1 of size 8192.
Featurizing sample 0
Featurizing sample 1000
Featurizing sample 2000
Featurizing sample 3000
Featurizing sample 4000
Featurizing sample 5000
Featurizing sample 6000
Featurizing sample 7000
Featurizing sample 8000
TIMING: featurizing shard 0 took 12.479 s
Loading shard 2 of size 8192.
Featurizing sample 0
Featurizing sample 1000
Featurizing sample 2000
Featurizing sample 3000
Featurizing sample 4000
Featurizing sample 5000
Featurizing sample 6000
Featurizing sample 7000
Featurizing sample 8000
TIMING: featurizing shard 1 took 13.668 s
Loading shard 3 of size 8192.
Featurizing sample 0
Featurizing sample 1000
Featurizing sample 2000
Featurizing sample 3000
Featurizing sample 4000
Featurizing sample 5000
Featurizing sample 6000
Featurizing sample 7000
Featurizing sample 8000
TIMING: featurizing shard 2 took 13.550 s
Loading shard 4 of size 8192.
Featurizing sample 0
Featurizing sample 1000
Featurizing sample 2000
Featurizing sample 3000
Featurizing sample 4000
Featurizing sample 5000
Featurizing sample 6000
Featurizing sample 7000
Featurizing sample 8000
TIMING: featurizing shard 3 took 13.173 s
Loading shard 5 of size 8192.
Featurizing sample 0
Featurizing sample 1000
Featurizing sample 2000
RDKit WARNING: [18:16:53] WARNING: not removing hydrogen atom without neighbors
RDKit WARNING: [18:16:53] WARNING: not removing hydrogen atom without neighbors
Featurizing sample 3000
Featurizing sample 4000
Featurizing sample 5000
Featurizing sample 6000
Featurizing sample 7000
Featurizing sample 8000
TIMING: featurizing shard 4 took 13.362 s
Loading shard 6 of size 8192.
Featurizing sample 0
TIMING: featurizing shard 5 took 0.355 s
TIMING: dataset construction took 80.394 s
Loading dataset from disk.
TIMING: dataset construction took 16.676 s
Loading dataset from disk.
TIMING: dataset construction took 7.529 s
Loading dataset from disk.
TIMING: dataset construction took 7.796 s
Loading dataset from disk.
TIMING: dataset construction took 17.521 s
Loading dataset from disk.
TIMING: dataset construction took 7.770 s
Loading dataset from disk.
TIMING: dataset construction took 7.873 s
Loading dataset from disk.
TIMING: dataset construction took 15.495 s
Loading dataset from disk.
TIMING: dataset construction took 1.959 s
Loading dataset from disk.
TIMING: dataset construction took 1.949 s
Loading dataset from disk.
WARNING:tensorflow:From /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:Entity <bound method GraphConv.call of <deepchem.models.layers.GraphConv object at 0x1a3e35c048>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphConv.call of <deepchem.models.layers.GraphConv object at 0x1a3e35c048>>: AttributeError: module 'gast' has no attribute 'Num'
WARNING: Entity <bound method GraphConv.call of <deepchem.models.layers.GraphConv object at 0x1a3e35c048>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphConv.call of <deepchem.models.layers.GraphConv object at 0x1a3e35c048>>: AttributeError: module 'gast' has no attribute 'Num'
WARNING:tensorflow:Entity <bound method GraphPool.call of <deepchem.models.layers.GraphPool object at 0x1a41856e80>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphPool.call of <deepchem.models.layers.GraphPool object at 0x1a41856e80>>: AttributeError: module 'gast' has no attribute 'Num'
WARNING: Entity <bound method GraphPool.call of <deepchem.models.layers.GraphPool object at 0x1a41856e80>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphPool.call of <deepchem.models.layers.GraphPool object at 0x1a41856e80>>: AttributeError: module 'gast' has no attribute 'Num'
WARNING:tensorflow:Entity <bound method GraphConv.call of <deepchem.models.layers.GraphConv object at 0x1a49f5aa90>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphConv.call of <deepchem.models.layers.GraphConv object at 0x1a49f5aa90>>: AttributeError: module 'gast' has no attribute 'Num'
WARNING: Entity <bound method GraphConv.call of <deepchem.models.layers.GraphConv object at 0x1a49f5aa90>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphConv.call of <deepchem.models.layers.GraphConv object at 0x1a49f5aa90>>: AttributeError: module 'gast' has no attribute 'Num'
WARNING:tensorflow:Entity <bound method GraphPool.call of <deepchem.models.layers.GraphPool object at 0x1a43f5d198>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphPool.call of <deepchem.models.layers.GraphPool object at 0x1a43f5d198>>: AttributeError: module 'gast' has no attribute 'Num'
WARNING: Entity <bound method GraphPool.call of <deepchem.models.layers.GraphPool object at 0x1a43f5d198>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphPool.call of <deepchem.models.layers.GraphPool object at 0x1a43f5d198>>: AttributeError: module 'gast' has no attribute 'Num'
WARNING:tensorflow:Entity <bound method GraphGather.call of <deepchem.models.layers.GraphGather object at 0x1a43f3a940>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphGather.call of <deepchem.models.layers.GraphGather object at 0x1a43f3a940>>: AttributeError: module 'gast' has no attribute 'Num'
WARNING: Entity <bound method GraphGather.call of <deepchem.models.layers.GraphGather object at 0x1a43f3a940>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphGather.call of <deepchem.models.layers.GraphGather object at 0x1a43f3a940>>: AttributeError: module 'gast' has no attribute 'Num'
WARNING:tensorflow:From /Users/bharath/Code/deepchem/deepchem/models/layers.py:222: The name tf.unsorted_segment_sum is deprecated. Please use tf.math.unsorted_segment_sum instead.
WARNING:tensorflow:From /Users/bharath/Code/deepchem/deepchem/models/layers.py:224: The name tf.unsorted_segment_max is deprecated. Please use tf.math.unsorted_segment_max instead.
WARNING:tensorflow:Entity <bound method TrimGraphOutput.call of <deepchem.models.graph_models.TrimGraphOutput object at 0x1a41a9ecf8>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method TrimGraphOutput.call of <deepchem.models.graph_models.TrimGraphOutput object at 0x1a41a9ecf8>>: AttributeError: module 'gast' has no attribute 'Num'
WARNING: Entity <bound method TrimGraphOutput.call of <deepchem.models.graph_models.TrimGraphOutput object at 0x1a41a9ecf8>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method TrimGraphOutput.call of <deepchem.models.graph_models.TrimGraphOutput object at 0x1a41a9ecf8>>: AttributeError: module 'gast' has no attribute 'Num'
WARNING:tensorflow:From /Users/bharath/Code/deepchem/deepchem/models/keras_model.py:169: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
WARNING:tensorflow:From /Users/bharath/Code/deepchem/deepchem/models/optimizers.py:76: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.
WARNING:tensorflow:From /Users/bharath/Code/deepchem/deepchem/models/keras_model.py:258: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.
WARNING:tensorflow:From /Users/bharath/Code/deepchem/deepchem/models/keras_model.py:260: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.
WARNING:tensorflow:Entity <bound method GraphConv.call of <deepchem.models.layers.GraphConv object at 0x1a3e35c048>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphConv.call of <deepchem.models.layers.GraphConv object at 0x1a3e35c048>>: AttributeError: module 'gast' has no attribute 'Num'
WARNING: Entity <bound method GraphConv.call of <deepchem.models.layers.GraphConv object at 0x1a3e35c048>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphConv.call of <deepchem.models.layers.GraphConv object at 0x1a3e35c048>>: AttributeError: module 'gast' has no attribute 'Num'
WARNING:tensorflow:Entity <bound method GraphPool.call of <deepchem.models.layers.GraphPool object at 0x1a41856e80>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphPool.call of <deepchem.models.layers.GraphPool object at 0x1a41856e80>>: AttributeError: module 'gast' has no attribute 'Num'
WARNING: Entity <bound method GraphPool.call of <deepchem.models.layers.GraphPool object at 0x1a41856e80>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphPool.call of <deepchem.models.layers.GraphPool object at 0x1a41856e80>>: AttributeError: module 'gast' has no attribute 'Num'
WARNING:tensorflow:Entity <bound method GraphConv.call of <deepchem.models.layers.GraphConv object at 0x1a49f5aa90>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphConv.call of <deepchem.models.layers.GraphConv object at 0x1a49f5aa90>>: AttributeError: module 'gast' has no attribute 'Num'
WARNING: Entity <bound method GraphConv.call of <deepchem.models.layers.GraphConv object at 0x1a49f5aa90>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphConv.call of <deepchem.models.layers.GraphConv object at 0x1a49f5aa90>>: AttributeError: module 'gast' has no attribute 'Num'
WARNING:tensorflow:Entity <bound method GraphPool.call of <deepchem.models.layers.GraphPool object at 0x1a43f5d198>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphPool.call of <deepchem.models.layers.GraphPool object at 0x1a43f5d198>>: AttributeError: module 'gast' has no attribute 'Num'
WARNING: Entity <bound method GraphPool.call of <deepchem.models.layers.GraphPool object at 0x1a43f5d198>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphPool.call of <deepchem.models.layers.GraphPool object at 0x1a43f5d198>>: AttributeError: module 'gast' has no attribute 'Num'
WARNING:tensorflow:Entity <bound method GraphGather.call of <deepchem.models.layers.GraphGather object at 0x1a43f3a940>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphGather.call of <deepchem.models.layers.GraphGather object at 0x1a43f3a940>>: AttributeError: module 'gast' has no attribute 'Num'
WARNING: Entity <bound method GraphGather.call of <deepchem.models.layers.GraphGather object at 0x1a43f3a940>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphGather.call of <deepchem.models.layers.GraphGather object at 0x1a43f3a940>>: AttributeError: module 'gast' has no attribute 'Num'
WARNING:tensorflow:Entity <bound method TrimGraphOutput.call of <deepchem.models.graph_models.TrimGraphOutput object at 0x1a41a9ecf8>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method TrimGraphOutput.call of <deepchem.models.graph_models.TrimGraphOutput object at 0x1a41a9ecf8>>: AttributeError: module 'gast' has no attribute 'Num'
WARNING: Entity <bound method TrimGraphOutput.call of <deepchem.models.graph_models.TrimGraphOutput object at 0x1a41a9ecf8>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method TrimGraphOutput.call of <deepchem.models.graph_models.TrimGraphOutput object at 0x1a41a9ecf8>>: AttributeError: module 'gast' has no attribute 'Num'
WARNING:tensorflow:From /Users/bharath/Code/deepchem/deepchem/models/losses.py:108: The name tf.losses.softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.softmax_cross_entropy instead.
WARNING:tensorflow:From /Users/bharath/Code/deepchem/deepchem/models/losses.py:109: The name tf.losses.Reduction is deprecated. Please use tf.compat.v1.losses.Reduction instead.
WARNING:tensorflow:From /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/math_grad.py:318: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/gradients_util.py:93: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/gradients_util.py:93: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/gradients_util.py:93: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
Out[2]:
0.0
In [3]:
y_true = np.squeeze(valid.y)
y_pred = model.predict(valid)[:,0,1]
print("Average Precision Score:%s" % average_precision_score(y_true, y_pred))
sorted_results = sorted(zip(y_pred, y_true), reverse=True)
hit_rate_100 = sum(x[1] for x in sorted_results[:100]) / 100
print("Hit Rate Top 100: %s" % hit_rate_100)
Average Precision Score:0.19783388433313015
Hit Rate Top 100: 0.37
为筛选用全数据集再次训练
In [29]:
tasks, all_datasets, transformers = hiv_datasets.load_hiv(featurizer="GraphConv", split=None)
model = GraphConvModel(1, mode="classification", model_dir="/tmp/zinc/screen_model")
model.fit(all_datasets[0])
Loading raw samples now.
shard_size: 8192
About to start loading CSV from /tmp/HIV.csv
Loading shard 1 of size 8192.
Featurizing sample 0
Featurizing sample 1000
Featurizing sample 2000
Featurizing sample 3000
Featurizing sample 4000
Featurizing sample 5000
Featurizing sample 6000
Featurizing sample 7000
Featurizing sample 8000
TIMING: featurizing shard 0 took 15.701 s
Loading shard 2 of size 8192.
Featurizing sample 0
Featurizing sample 1000
Featurizing sample 2000
Featurizing sample 3000
Featurizing sample 4000
Featurizing sample 5000
Featurizing sample 6000
Featurizing sample 7000
Featurizing sample 8000
TIMING: featurizing shard 1 took 15.869 s
Loading shard 3 of size 8192.
Featurizing sample 0
Featurizing sample 1000
Featurizing sample 2000
Featurizing sample 3000
Featurizing sample 4000
Featurizing sample 5000
Featurizing sample 6000
Featurizing sample 7000
Featurizing sample 8000
TIMING: featurizing shard 2 took 19.106 s
Loading shard 4 of size 8192.
Featurizing sample 0
Featurizing sample 1000
Featurizing sample 2000
Featurizing sample 3000
Featurizing sample 4000
Featurizing sample 5000
Featurizing sample 6000
Featurizing sample 7000
Featurizing sample 8000
TIMING: featurizing shard 3 took 16.267 s
Loading shard 5 of size 8192.
Featurizing sample 0
Featurizing sample 1000
Featurizing sample 2000
Featurizing sample 3000
Featurizing sample 4000
Featurizing sample 5000
Featurizing sample 6000
Featurizing sample 7000
Featurizing sample 8000
TIMING: featurizing shard 4 took 16.754 s
Loading shard 6 of size 8192.
Featurizing sample 0
TIMING: featurizing shard 5 took 0.446 s
TIMING: dataset construction took 98.214 s
Loading dataset from disk.
TIMING: dataset construction took 21.127 s
Loading dataset from disk.
/home/leswing/miniconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2. 创建工作单元
下载所有的ZINC15数据集
到http://zinc15.docking.org/tranches/home下载所有的非空.smi格式分部数据。我发现很容易下载wget脚本并运行wget脚本。本教程的后面我们假定zinc下载到/tmp/zinc。
Zinc下载数据的方法不太适合推理。我们然望单一的CPU工作单元可以执行合理的时间(10分钟到1小时)。为完成这项任务,我们要分割zinc数据到文件中,每个文件有50万行。
mkdir /tmp/zinc/screen
find /tmp/zinc -name '*.smi' -exec cat {} \; | grep -iv "smiles" \
| split -l 500000 /tmp/zinc/screen/segment
这个bash命令
3. 创建推理脚本
现在我们有了工作单元,我们需要构建程序消化工作单元并记录结果。重要的是记录机制是线程安全的!这个例子,我们将通过文件路径取得工作单元,并记录结果到文件中。一个分配多台计算机的容易的扩展是通过url获得工作单元,记录结果到分配序列中。
看起来大概是这样子的。
inference.py
import sys
import deepchem as dc
import numpy as np
from rdkit import Chem
import pickle
import os
def create_dataset(fname, batch_size=50000):
featurizer = dc.feat.ConvMolFeaturizer()
fin = open(fname)
mols, orig_lines = [], []
for line in fin:
line = line.strip().split()
try:
mol = Chem.MolFromSmiles(line[0])
if mol is None:
continue
mols.append(mol)
orig_lines.append(line)
except:
pass
if len(mols) > 0 and len(mols) % batch_size == 0:
features = featurizer.featurize(mols)
y = np.ones(shape=(len(mols), 1))
ds = dc.data.NumpyDataset(features, y)
yield ds, orig_lines
mols, orig_lines = [], []
if len(mols) > 0:
features = featurizer.featurize(mols)
y = np.ones(shape=(len(mols), 1))
ds = dc.data.NumpyDataset(features, y)
yield ds, orig_lines
def evaluate(fname):
fout_name = "%s_out.smi" % fname
model = dc.models.TensorGraph.load_from_dir('screen_model')
for ds, lines in create_dataset(fname):
y_pred = np.squeeze(model.predict(ds), axis=1)
with open(fout_name, 'a') as fout:
for index, line in enumerate(lines):
line.append(y_pred[index][1])
line = [str(x) for x in line]
line = "\t".join(line)
fout.write("%s\n" % line)
if __name__ == "__main__":
evaluate(sys.argv[1])
4. 将工作单元加载到工作序列中
我们将要使作扁平文件作为我们的分配机制。它可能是为每一个工作单元调用我们的推理脚本的bash脚本。如果你在研究机构,这可能需要在pbs/qsub/slurm排序你的工作。云计算的一个备选是rabbitmq或kafka。
In [ ]:
import os
work_units = os.listdir('/tmp/zinc/screen')
with open('/tmp/zinc/work_queue.sh', 'w') as fout:
fout.write("#!/bin/bash\n")
for work_unit in work_units:
full_path = os.path.join('/tmp/zinc', work_unit)
fout.write("python inference.py %s" % full_path)
5. 从"distribution mechanism"使用工作单元
我们从工作序列中使用工作单元,使用非常简单的线程池。它从我们的“工作序列”中取多行并运行它们,平行运行我们的CUP支持的尽可能多的线程。如果你使用超级计算机集群系统如pbs/qsub/slurm,它会为你处理这个。关键是每个工作单元使用一个CPU来获得更高的产出。我们使用linux工具"taskset"来完成它。
使用AWS的c5.18xlarge,这将花一个晚上完成。
process_pool.py
import multiprocessing
import sys
from multiprocessing.pool import Pool
import delegator
def run_command(args):
q, command = args
cpu_id = q.get()
try:
command = "taskset -c %s %s" % (cpu_id, command)
print("running %s" % command)
c = delegator.run(command)
print(c.err)
print(c.out)
except Exception as e:
print(e)
q.put(cpu_id)
def main(n_processors, command_file):
commands = [x.strip() for x in open(command_file).readlines()]
commands = list(filter(lambda x: not x.startswith("#"), commands))
q = multiprocessing.Manager().Queue()
for i in range(n_processors):
q.put(i)
argslist = [(q, x) for x in commands]
pool = Pool(processes=n_processors)
pool.map(run_command, argslist)
if __name__ == "__main__":
processors = multiprocessing.cpu_count()
main(processors, sys.argv[1])
>> python process_pool.py /tmp/zinc/work_queue.sh
6. 收集结果
由于我们记录我们的结果于*_out.smi,我们要将所有的结果收集起来,并用我们的预测来排序。结果文件可能> 40GB。要进一步分析文件,你可以用dask,或将数据送到rdkit postgres盒子中。
我向你展示如何联合和排序数据以获得最好的结果。
find /tmp/zinc -name '*_out.smi' -exec cat {} \; > /tmp/zinc/screen/results.smi
sort -rg -k 3,3 /tmp/zinc/screen/results.smi > /tmp/zinc/screen/sorted_results.smi
# Put the top 100k scoring molecules in their own file
head -n 50000 /tmp/zinc/screen/sorted_results.smi > /tmp/zinc/screen/top_100k.smi
/tmp/zinc/screen/top_100k.smi现在足够小可以用标准的工具如pandas 来调查。
In [9]:
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from IPython.display import SVG
from rdkit.Chem.Draw import rdMolDraw2D
best_mols = [Chem.MolFromSmiles(x.strip().split()[0]) for x in open('/tmp/zinc/screen/top_100k.smi').readlines()[:100]]
best_scores = [x.strip().split()[2] for x in open('/tmp/zinc/screen/top_100k.smi').readlines()[:100]]
In [10]:
print(best_scores[0])
best_mols[0]
0.98874843
Out[10]:
In [11]:
print(best_scores[0])
best_mols[1]
0.98874843
Out[11]:
In [12]:
print(best_scores[0])
best_mols[2]
0.98874843
Out[12]:
In [13]:
print(best_scores[0])
best_mols[3]
0.98874843
Out[13]:
筛选看来趋向于三氧化硫。最高分值的分子也有低的多样性。当创建一个菜单时我们想要优化更多的东西而不仅仅是活性,如多样性和类药MPO。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。