RESEARCH & DEVELOPMENT PROCESSORS
One common factor of in-memory
computation engines is that, in
contrast to Sun’s 3D RAM, many
of the candidate technologies can
be found on the boundary between
analogue and digital computation. For
start-ups such as Mythic, analogue
processing within the array allows
the rapid processing of data at a
relatively low resolution: typically 8bit
or lower. To support this, data values
are converted to analogue values
using DACs before being fed into the
memory array for processing.
The contribution of Das’s group at
Michigan to in-memory computing does
not rely on such heavy customisation
of the interfaces to the memory, an
attribute she says could help it gain
traction. It was also a requirement
from Intel, a sponsor of the work
in order to keep implementation
cost down. The Neural Cache is
aimed squarely at machine-learning
applications and makes use of
simple logic operations to process
data. Another key difference to other
proposals is that it makes use of more
expensive memory cells – those used
in high-speed caches – than those
aimed at bulk memory technologies.
In operation, the Neural Cache
borrows a concept that drove
computation inside FPGAs before
they acquired hardwired digital signal
processing (DSP) cores: bit-serial
computation. The advantage of bitserial
arithmetic is that it can use
simple AND and NOR operations rather
than logic-intensive multipliers. Sense
amplifiers detect the logic state of
pairs of complete wordlines when they
are activated. The main addition to
the memory is a transpose unit that
rearranges data so that the values can
be pumped straight into the array.
Although bit-serial operations
suffer from high latency, Das claims
the Neural Cache can win out against
high-end processors because it can
do many operations in parallel. For
example, an 8KB SRAM can operate
on 256 elements at once because
it has 256 wordlines and 256
bitlines. The group proposes using
something like the last-level cache of
a processor such as Intel’s Haswell,
which contains more than 4000 of
those 8KB SRAM arrays. In doing
so, more than a million elements
could be processed in parallel at
more than 2GHz. “It only adds a 3.5
percent overhead to a Xeon and is
as programmable as a GPU, which
helps make it futureproof for emerging
algorithms,” Das claims.
Linear algebra
Although machine learning drives
most of the current crop of in-memory
computing start-ups, Engin Ipek,
associate professor at the University of
Rochester says other matrix-oriented
algorithms can benefit. Ipek’s work has
focused on linear algebra and, similar
to the cache-based work at Michigan,
avoids the need to handle intermediate
values even on ReRAM arrays.
The practical problem with ReRAM
for analogue computing is that its
signal-to-noise ratio is very low.
“There are no ReRAM cells
today that give you 8bit or even 2bit
resolution reliably. But you don’t
necessarily need more than 1bit.
The nice thing about linear algebra is
that it’s linear,” Ipek says. “You can
split the execution into bit slices and
perform bit-wise matrix calculations.”
Even with 1bit devices, a potential
problem with novel memories such as
memristive arrays based on ReRAM or
phase-change cells is their error rate.
“We can’t use error-correcting
codes because with that you have
to read then correct then compute.
With this architecture but the time
you’ve corrected you’ve performed the
calculation,” Ipek says.
To deal with errors, the data itself
is pre-processed using AN-codes
that makes it possible to perform
the addition operations of bit-serial
arithmetic and correct single-bit errors
once the computation is complete.
Although the technique is focused
on linear algebra, Ipek says it could be
extended to sparse-matrix problems
found in many compute problems,
which include machine learning. “Many
matrices in real-world applications are
sparse,” Ipek says, noting that for
compatible algorithms the approach
consumes ten times less power
than when run on a GPU for highly
parallelised data.
The focus on these architectures
remains one where data needs to be
imported into an accelerator – one
that happens to rely on memory
architecture to perform calculations –
rather than running the application in
situ. Shekhar Borkar, senior director
of technology at Qualcomm says this
may prove to be the weakness of inmemory
architectures. Data-intensive
applications need to work with such
large quantities of memory that the
cost of adding computation to the
entire array becomes prohibitive.
Unless the data has to move
through the array on its way
somewhere, which happens with
storage devices and caches, the idea
of keeping data in one place and
operating on it may prove to be shortlived.
Devices that simply use memory
as a processor for streaming data
may have more success as long as
workloads continue to favour highly
parallelised single operations.
At the DAC panel, a Facebook
research scientist summed the
situation up: “These techniques have
promise but they are still coming from
behind.”
“The biggest
advantage
of compute
in memory is
bandwidth, not
so much energy
in my opinion.”
Reetupama Das
Below: Data values
are converted to
analogue values
using DACs before
being fed into the
memory array for
processing
RA0 RB0 RC0
V0
X0
RA1 RB1 RC1
DAC
V1
X1
RA2 RB2 RC2
DAC
V2
X2
ADC ADC ADC
YA YB YC
DAC
24 10 September 2019 www.newelectronics.co.uk
/www.newelectronics.co.uk