Skip to content

Bug: successor.local_gradient_for_argument(self) could be corrupted #52

@Banyc

Description

@Banyc

Issue

local_gradient_for_argument might read a corrupted float value from the successor.

Say we have the computation graph:

    h
    |
    |
    f
  • $h$ has parameters $w$
  • $f$ has parameters $v$
  • $h : \mathbb{R}^m \to \mathbb{R}$
  • $f : \mathbb{R}^n \to \mathbb{R}$
  • $h$ is the successor of $f$
  • $E$ is the graph
  • $\frac{\partial E}{\partial w} = \frac{\partial{E}}{\partial{h}} \frac{\partial{h}}{\partial{w}}$
  • $\frac{\partial E}{\partial f} = \frac{\partial E}{\partial h} \cdot \frac{\partial h}{\partial f}$
  • $\frac{\partial E}{\partial v} = \frac{\partial E}{\partial f} \cdot \frac{\partial f}{\partial v}$
  • when doing backpropagation, the steps will be
    1. $h$ computes $\frac{\partial E}{\partial w}$ and caches $\frac{\partial E}{\partial h}$, and $\frac{\partial h}{\partial w}$
    2. $h$ updates $w$ to $w'$
    3. $f$ computes $\frac{\partial E}{\partial f}$ and $\frac{\partial h}{\partial f}$ is cached
      1. $\frac{\partial h}{\partial f}$ is not yet in cache, so $h$ will have to compute it now
      2. $\frac{\partial h}{\partial f}$ is computed based on the new parameter $w'$
        • This is the problem!
        • $\frac{\partial h}{\partial f}$ is corrupted
      3. $\frac{\partial h}{\partial f}$ is in cache now
      4. $\frac{\partial E}{\partial f}$ is computed by looking both $\frac{\partial E}{\partial h}$ and $\frac{\partial h}{\partial f}$ in cache
      5. $\frac{\partial E}{\partial f}$ is in cache now
    4. $f$ computes $\frac{\partial f}{\partial v}$ and caches it
    5. $f$ computes $\frac{\partial E}{\partial v}$ with $\frac{\partial f}{\partial v}$ and the corrupted $\frac{\partial h}{\partial f}$
    6. $f$ updates $v$ based on the corrupted $\frac{\partial E}{\partial v}$

Solutions

I can come up with two solutions

  • compute local_gradient ($\frac{\partial h}{\partial f}$) at the beginning of do_gradient_descent_step before parameters ($w$) is modified
  • the successor $h$ distributes local_gradient ($\frac{\partial h}{\partial f}$) and global_gradient ($\frac{\partial E}{\partial h}$) to $f$ before parameters ($w$) is modified

PS

Thank you for your book and the sample code so that I could deeper understand the neural network!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions