• DeepFM
    • 1. 算法介绍
      • 1.1 Embedding与BiInnerSumCross层的说明
      • 1.2 其它层说明
      • 1.3 网络构建
    • 2. 运行与性能
      • 2.1 Json配置文件说明
      • 2.2 提交脚本说明

    DeepFM

    1. 算法介绍

    DeepFM算法是在FM(Factorization machine)的基础上加入深度层构成. 与PNN, NFM算法相比, 它保留了FM的二阶隐式特征交叉的同时又用深度网络来获取高阶特征交叉. 其构架如下:

    DeepFM

    1.1 Embedding与BiInnerSumCross层的说明

    与传统的FM实现不同, 这里采用Embedding与BiInnerSumCross结合的方式实现二阶隐式交叉, 传统的FM二次交叉项的表达式如下:

    model^T(x_j\bold{v}_j)-\sum_i(x_i\bold{v}_i)^T(x_i\bold{v}_i)))

    在实现中, 用Embedding的方式存储Deep Factorization Machine(DeepFM) - 图3, 调用Embedding的calOutput后, 将Deep Factorization Machine(DeepFM) - 图4计算后一起输出, 所以一个样本的Embedding output结果为:

    model=(\bold{u}_1,\bold{u}_2,\bold{u}_3,\cdots,\bold{u}_k))

    原始的二次交叉项的结为可重新表达为:

    model^T(\sum_j\bold{u}_j)-\sum_i\bold{u}_i^T\bold{u}_i))

    以上即是BiInnerSumCross的前向计算公式, 用Scala代码实现为:

    1. val sumVector = VFactory.denseDoubleVector(mat.getSubDim)
    2. (0 until batchSize).foreach { row =>
    3. val partitions = mat.getRow(row).getPartitions
    4. partitions.foreach { vectorOuter =>
    5. data(row) -= vectorOuter.dot(vectorOuter)
    6. sumVector.iadd(vectorOuter)
    7. }
    8. data(row) += sumVector.dot(sumVector)
    9. data(row) /= 2
    10. sumVector.clear()
    11. }

    1.2 其它层说明

    • SimpleInputLayer: 稀疏数据输入层, 对稀疏高维数据做了特别优化, 本质上是一个FCLayer
    • FCLayer: DNN中最常见的层, 线性变换后接传递函数
    • SumPooling: 将多个输入的数据做element-wise的加和, 要求输入具本相同的shape
    • SimpleLossLayer: 损失层, 可以指定不同的损失函数

    1.3 网络构建

    1. override def buildNetwork(): Unit = {
    2. ensureJsonAst()
    3. val wide = new SimpleInputLayer("input", 1, new Identity(),
    4. JsonUtils.getOptimizerByLayerType(jsonAst, "SparseInputLayer")
    5. )
    6. val embeddingParams = JsonUtils.getLayerParamsByLayerType(jsonAst, "Embedding")
    7. .asInstanceOf[EmbeddingParams]
    8. val embedding = new Embedding("embedding", embeddingParams.outputDim,
    9. embeddingParams.numFactors, embeddingParams.optimizer.build()
    10. )
    11. val innerSumCross = new BiInnerSumCross("innerSumPooling", embedding)
    12. val mlpLayer = JsonUtils.getFCLayer(jsonAst, embedding)
    13. val join = new SumPooling("sumPooling", 1, Array[Layer](wide, innerSumCross, mlpLayer))
    14. new SimpleLossLayer("simpleLossLayer", join, lossFunc)
    15. }

    2. 运行与性能

    2.1 Json配置文件说明

    DeepFM的参数较多, 需要用Json配置文件的方式指定(关于Json配置文件的完整说明请参考Json说明), 一个典型的例子如下:

    1. {
    2. "data": {
    3. "format": "dummy",
    4. "indexrange": 148,
    5. "numfield": 13,
    6. "validateratio": 0.1,
    7. "sampleratio": 0.2
    8. },
    9. "model": {
    10. "modeltype": "T_DOUBLE_SPARSE_LONGKEY",
    11. "modelsize": 148
    12. },
    13. "train": {
    14. "epoch": 10,
    15. "numupdateperepoch": 10,
    16. "lr": 0.5,
    17. "decayclass": "StandardDecay",
    18. "decaybeta": 0.01
    19. },
    20. "default_optimizer": "Momentum",
    21. "layers": [
    22. {
    23. "name": "wide",
    24. "type": "simpleinputlayer",
    25. "outputdim": 1,
    26. "transfunc": "identity"
    27. },
    28. {
    29. "name": "embedding",
    30. "type": "embedding",
    31. "numfactors": 8,
    32. "outputdim": 104,
    33. "optimizer": {
    34. "type": "momentum",
    35. "momentum": 0.9,
    36. "reg2": 0.01
    37. }
    38. },
    39. {
    40. "name": "fclayer",
    41. "type": "FCLayer",
    42. "outputdims": [
    43. 100,
    44. 100,
    45. 1
    46. ],
    47. "transfuncs": [
    48. "relu",
    49. "relu",
    50. "identity"
    51. ],
    52. "inputlayer": "embedding"
    53. },
    54. {
    55. "name": "biinnersumcross",
    56. "type": "BiInnerSumCross",
    57. "inputlayer": "embedding",
    58. "outputdim": 1
    59. },
    60. {
    61. "name": "sumPooling",
    62. "type": "SumPooling",
    63. "outputdim": 1,
    64. "inputlayers": [
    65. "wide",
    66. "biinnersumcross",
    67. "fclayer"
    68. ]
    69. },
    70. {
    71. "name": "simplelosslayer",
    72. "type": "simplelosslayer",
    73. "lossfunc": "logloss",
    74. "inputlayer": "sumPooling"
    75. }
    76. ]
    77. }

    2.2 提交脚本说明

    1. runner="com.tencent.angel.ml.core.graphsubmit.GraphRunner"
    2. modelClass="com.tencent.angel.ml.classification.DeepFM"
    3. $ANGEL_HOME/bin/angel-submit \
    4. --angel.job.name DeepFM \
    5. --action.type train \
    6. --angel.app.submit.class $runner \
    7. --ml.model.class.name $modelClass \
    8. --angel.train.data.path $input_path \
    9. --angel.workergroup.number $workerNumber \
    10. --angel.worker.memory.gb $workerMemory \
    11. --angel.ps.number $PSNumber \
    12. --angel.ps.memory.gb $PSMemory \
    13. --angel.task.data.storage.level $storageLevel \
    14. --angel.task.memorystorage.max.gb $taskMemory

    对深度学习模型, 其数据, 训练和网络的配置请优先使用Json文件指定.