Event-driven visual-tactile sensing and learning for robots
12257727 ยท 2025-03-25
Assignee
Inventors
- Chee Keong TEE (Singapore, SG)
- Hian Hian SEE (Singapore, SG)
- Brian LIM (Singapore, SG)
- Soon Hong Harold SOH (Singapore, SG)
- Tasbolat TAUNYAZOV (Singapore, SG)
- Weicong SNG (Singapore, SG)
- Sheng Yuan Jethro KUAN (Singapore, SG)
- Abdul Fatir ANSARI (Singapore, IN)
Cpc classification
B25J9/1694
PERFORMING OPERATIONS; TRANSPORTING
B25J9/161
PERFORMING OPERATIONS; TRANSPORTING
G01L1/18
PHYSICS
G06N3/049
PHYSICS
International classification
B25J9/00
PERFORMING OPERATIONS; TRANSPORTING
B25J13/08
PERFORMING OPERATIONS; TRANSPORTING
Abstract
A classifying sensing system, a classifying method performed using a sensing system, a tactile sensor, and a method of fabricating a tactile sensor. The classifying sensing system comprises a first spiking neural network, SNN, encoder configured for encoding an event-based output of a vision sensor into individual vision modality spiking representations with a first output size; a second SNN encoder configured for encoding an event-based output of a tactile sensor into individual tactile modality spiking representations with a second output size; a combination layer configured for merging the vision modality spiking representations and the tactile modality spiking representations; and a task SNN configured to receive the merged vision modality spiking representations and tactile modality spiking representations and output vision-tactile modality spiking representations with a third output size for classification.
Claims
1. A classifying sensing system comprising: a first spiking neural network, SNN, encoder configured for encoding an event-based output of a vision sensor into individual vision modality spiking representations with a first output size; a second SNN encoder configured for encoding an event-based output of a tactile sensor into individual tactile modality spiking representations with a second output size; a combination layer configured for merging the vision modality spiking representations and the tactile modality spiking representations; and a task SNN configured to receive the merged vision modality spiking representations and tactile modality spiking representations and output vision-tactile modality spiking representations with a third output size for classification.
2. The system of claim 1, wherein the task SNN is configured for classification based on a spike-count loss in the respective output vision/tactile modality representations compared to a desired spike count indexed by the output size.
3. The system of claim 1, wherein the task SNN is configured for classification based on a weighted spike-count loss in the respective output vision/tactile modality representations compared to a desired weighted spike count indexed by the output size.
4. The system of claim 1, wherein neurons in each of the first SNN encoder, the second SNN encoder, and the task SNN are configured for applying a Spike response Model, SRM.
5. The system of claim 1, comprising the tactile sensor.
6. The system of claim 5, wherein the tactile sensor comprises an event-based tactile sensor.
7. The system of claim 1, comprising the vision sensor.
8. The system of claim 1, comprising a robot arm and end-effector.
9. The system of claim 8, wherein the end-effector comprises a gripper.
10. A classifying method performed using a sensing system, the method comprising the steps of: encoding, using a first spiking neural network, SNN, encoder an event-based output of a vision sensor into individual vision modality spiking representations with a first output size; encoding, using a second SNN encoder, an event-based output of a tactile sensor into individual tactile modality spiking representations with a second output size; merging, using a combination layer, the vision modality spiking representations and the tactile modality spiking representations; and using a task SNN to receive the merged vision modality spiking representations and tactile modality spiking representations and to output vision-tactile modality spiking representations with a third output size for classification.
11. The method of claim 10, wherein the task SNN is configured for classification based on a spike-count loss in the respective output vision/tactile modality representations compared to a desired spike count indexed by the output size.
12. The method of claim 11, wherein the task SNN is configured for classification based on a weighted spike-count loss in the respective output vision/tactile modality representations compared to a desired weighted spike count indexed by the output size.
13. The system of claim 10, wherein each of the first SNN encoder, the second SNN encoder, and the task SNN is configured for applying a Spike response Model, SRM.
14. The system of claim 10, wherein the tactile sensor comprises an event-based tactile sensor.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) Embodiments of the invention will be better understood and readily apparent to one of ordinary skill in the art from the following written description, by way of example only, and in conjunction with the drawings, in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
(24)
(25)
(26)
(27)
DETAILED DESCRIPTION
(28) Embodiments of the present invention provide crucial steps towards efficient visual-tactile perception for asynchronous and event-driven robotic systems. In contrast to resource-hungry deep learning methods, event driven perception forms an alternative approach that promises power-efficiency and low-latencyfeatures that are ideal for real-time mobile robots. However, event-driven systems remain under-developed relative to standard synchronous perception methods [4], [5].
(29) To enable richer tactile sensing, a 39-taxel fingertip sensor is provided, according to an example embodiment, referred to herein as NeuTouch. Compared to existing commercially-available tactile sensors, NeuTouch's neuromorphic design enables scaling to a larger number of taxels while retaining low latencies.
(30) Multi-modal learning with NeuTouch and the Prophesee event camera are investigated, according to example embodiments. Specifically, a visual-tactile spiking neural network (VT-SNN) is provided that incorporates both sensory modalities for supervised-learning tasks.
(31) Different from conventional deep artificial neural network (ANN) models [6], SNNs process discrete spikes asynchronously and thus, are arguably better suited to the event data generated by the neuromorphic sensors according to example embodiments. In addition, SNNs can be used on efficient low-power neuromorphic chips such as the Intel Loihi [7].
(32) It is noted that in example embodiments, other event-based tactile sensors may be used. Also, the tactile sensor may comprise a converter for converting an intrinsic output of the tactile sensor into the event-based output of the tactile sensor.
(33) Similarly, it is noted that in example embodiment, other event-based vision sensors may be used. Also, the vision sensor may comprise a converter for converting an intrinsic output of the vision sensor into the event-based output of the vision sensor.
(34) Experiments performed according to example embodiments center on two robot tasks: object classification and (rotational) slip detection. In the former, the robot was tasked to determine the type of container being handled and the amount of liquid held within. The containers were opaque with differing stiffness, and hence, both visual and tactile sensing are relevant for accurate classification. It is shown that relatively small differences in weight (30 g across 20 object-weight classes) can be distinguished by the prototype sensors and spiking models according to example embodiments. Likewise, the slip detection experiment indicates rotational slip can be accurately detected within 0.08 s (visual-tactile spikes processed every 1 ms). In both experiments, SNNs achieved competitive (and sometimes superior) performance relative to ANNs with similar architecture.
(35) Taking a broader perspective, event-driven perception according to example embodiments represents an exciting opportunity to enable power-efficient intelligent robots. An end-to-end event-driven perception framework can be provided according to example embodiments.
(36) NeuTouch according to an example embodiment provides a scalable event-based tactile sensor for robot end-effectors.
(37) A Visual-Tactile Spiking Neural Network according to an example embodiment leverages multiple event sensor modalities.
(38) Systematic experiments demonstrate the effectiveness of an event-driven perception system according to example embodiments on object classification and slip detection, with comparisons to conventional ANN methods.
(39) Visual-tactile event sensor datasets comprising more than 50 different object classes across the experiments using example embodiments were obtained, which also includes RGB images and proprioceptive data from the robot.
(40) Neutouch: An Event-Based Tactile Sensor According to an Example Embodiment
(41) Although there are numerous applications for tactile sensors (e.g., minimal invasive surgery [38] and smart prosthetics [39]), current tactile sensing technology lags behind vision. In particular, current tactile sensors remain difficult to scale and integrate with robot platforms. The reasons are twofold: first, many tactile sensors are interfaced via time-divisional multiple access (TDMA), where individual taxel electrodes, hereafter also referred to as taxels are periodically and sequentially sampled. The serial readout nature of TDMA inherently leads to an increase of readout latency as the number of taxels in the sensor is increased. Second, high spatial localization accuracy is typically achieved by adding more taxels in the sensor; this invariably leads to more wiring, which complicates integration of the skin onto robot end-effectors and surfaces.
(42) Motivated by the limitations of the existing tactile sensing technology, a Neuro-inspired Tactile sensor 100 (NeuTouch) is provided according to example embodiments, for use on robot end-effectors (see
(43) Specifically,
(44) Tactile sensing is achieved via the electrode layer 106 folded around the bone 112 such that the array of electrodes with 39 taxels e.g. 104 are on the top of the bone 112 with the graphene-based piezoresistive thin film 108 covering the 39 taxels e.g. 104. The graphene-based piezoresistive thin film 108 functions as a pressure transducer forming an effective tactile sensor [40], [41] due to its high Young's modulus, which helps to reduce the transducer's hysteresis and response time. The radial arrangement of the taxels e.g. 106 on NeuTouch 100 is designed such that the taxel density is varied from high-to-low; from the center to the periphery of the top touch surface of the NeuTouch 100 sensor. The initial point-of-contact between the object and sensor is located at the central region of NeuTouch 100 where the taxel e.g. 106 density is the highest, as such the rich spatio-temporal tactile data of the initial contact (between the object and sensor) can be captured. This rich tactile information can help algorithms to accelerate inference (e.g., early classification as will be described in more detail below).
(45)
(46) The 3D-printed bone component 112 was employed to serve the role of the fingertip bone, and Ecoflex 00-30 (Ecoflex) 110 was employed to emulate skin for NeuTouch 100. The Ecoflex 110 offers protection for the electrodes/taxels e.g. 104 for a longer use-life and amplifies the stimuli exerted on NeuTouch 100. The latter enables more tactile features to be collected, since the transient phase of contact (between object and sensor) encodes much of the physical description of a grasped object, such as stiffness or surface roughness [42]. The NeuTouch 100 exhibits a slight delay of 300 ms when recovering from a deformation due to the soft nature of Ecoflex 110. Nevertheless, the experiments described below showed this effect did not impede the NeuTouch's 100 sensitivity to various tactile stimuli.
(47) Compared to existing tactile sensors, NeuTouch 100 is event-based and scales well with the number of taxels; NeuTouch 100 can accommodate 240 taxels according to a non-limiting example embodiment while maintaining an exceptionally low constant readout latency of 1 ms for rapid tactile perception [43]. This is achieved according to example embodiments by leveraging upon the Asynchronously Coded Electronic Skin (ACES) platform [43]an event-based neuro-mimetic architecture that enables asynchronous transmission of tactile information. With ACES, the taxels e.g. 104 of NeuTouch 100 mimic the function of the fast-adapting (FA) mechano-receptors of a human fingertip, which capture dynamic pressure (i.e., dynamic skin deformations) [44]. FA responses are crucial for dexterous manipulation tasks that require rapid detection of object slippage, object hardness, and local curvature.
(48) Various suitable materials may be used for the fabrication of NeuTouch 100 according to example embodiments, including, but not limited to:
(49) Skin layer: Ecoflex Series (Smooth-On), Polydimethylsiloxane (PDMS), Dragon Skin Series (Smooth-on), Silicone Rubbers.
(50) Transducer layer (Piezoresistive): Velostat (3M), Linqstat Series (Caplinq), Conductive Foam Sheet (e.g., Laird Technologies EMI), Conductive Fabric/textile (e.g., 3M), any piezoresistive material.
(51) Electrode layer: Flexible printed circuit boards (Flex PCBs) of different thickness. Material: Polyimide Electrode lines: Metallic layers of traces, e.g. copper. Any conductive metal (e.g. silver) Taxels: Copper, any conductive metal (e.g. silver)
Asynchronous Transmission of Tactile Stimuli According to Example Embodiments
(52) Compared to existing tactile sensors, NeuTouch 100 is event-based and scales well with the number of taxels e.g. 104, and can maintain an exceptionally low constant readout latency of 1 ms for rapid tactile perception. This is achieved according to an example embodiment by leveraging upon the Asynchronously Coded Electronic Skin (ACES) platform [50]an event-based neuro-mimetic architecture that enables asynchronous transmission of tactile information. It was developed to address the increasing complexity and need for transferring a large array of skin-like transducer inputs while maintaining a high level of responsiveness (i.e., low latency).
(53) With ACES, the taxels e.g. 104 of NeuTouch 100 mimic the function of the fast-adapting (FA) mechano-receptors of a human fingertip, which capture dynamic pressure (i.e., dynamic skin deformations). Transmission of the tactile stimuli information is in the form of asynchronous spikes (i.e., electrical pulses), similar to biological systems; data is transmitted by individual taxels e.g. 104 only when necessary via single common conductor for signalling. This is made possible by encoding the taxels e.g. 104 of NeuTouch 100 with unique electrical pulse signatures. These signatures are robust to overlap and permit multiple taxels e.g. 104 to transmit data without specific time synchronization (see
(54) In an example embodiment, each taxel e.g. 104 connects, via electrode lines e.g. 105, to an encoder, (e.g., if there are 39 taxels, there will be 39 encoders). The signal outputs of the encoders are combined into one common output conductor for data transmission to a decoder. The decoder will then decode the combined pulse (spike) signature to identify the activated taxels.
(55) Real-time decoding of the tactile information (acquired by NeuTouch 100) is done via a Field Programmable Gated Array (FPGA) according to an example embodiment. The event-based tactile information can be easily accessed through Universal Asynchronous Receiver/Transmitter (UART) readout to a PC, according to an example embodiment.
(56) For more information on asynchronous transmission of tactile stimuli for event based tactile sensors suitable for us in example embodiments, reference is made to WO 2019/112516.
(57) Details of how the decoded tactile event data is used for learning and classification according to example embodiments will be described below.
(58) Visual-Tactile Spiking Neural Network (VT-SNN) According to Example Embodiments
(59) As mentioned above, the successful completion of many tasks is contingent upon using multiple sensory modalities. In example embodiments, the focus is on touch and sight, i.e., tactile and visual data from NeuTouch 100 and an event-based camera, respectively, are fused via a spiking neural model. This Visual-Tactile Spiking Neural Network (VT-SNN) enables learning and perception using both these modalities, and can be easily extended to incorporate other event sensors according to different example embodiments.
(60) Model Architecture According to Example Embodiments.
(61) From a bird's-eye perspective, the VT-SNN 200 according to example embodiments employs a simple architecture (see
(62) In the following, details of the precise network structures used in one example embodiment will be described, but VT-SNN may use alternative network structures for the Tactile, Vision and Task SNNs, according to different example embodiments. The Tactile SNN 208 employs a fully connected (FC) network consisting of 2 dense spiking layers (it is noted that in preliminary experiments, convolutional layers were also tested according to other example embodiments, but it resulted in poorer performance). It has an input size of 156 (two fingers, each with the 39 taxels with a positive and negative polarity channel per taxel) and a hidden layer size of 32. The input into the Tactile SNN 208 is obtained via the signature decoder described above with reference to
(63) Neuron Model According to Example Embodiments
(64) The Spike Response Model (SRM) [30], [45] was used in example embodiments. In the SRM, spikes are generated whenever a neuron's internal state (membrane potential) u(t) exceeds a predefined threshold . Each neuron's internal state is affected by incoming spikes and a refractory response:
u(t)=w.sub.i(*s.sub.i)(t)+(v+o)(t)(1)
(65) where w.sub.i is a synaptic weight, * indicates convolution, s.sub.i(t) are the incoming spikes from input i, () is the response kernel, v() is the refractory kernel, and o(t) is the neuron's output spike train 206. In words, incoming spikes s.sub.i(t) are convolved with a response kernel () to yield a spike response signal that is scaled by a synaptic weight w.sub.i. That is, and with reference again to
(66) Model Training According to Example Embodiments
(67) The spiking networks were optimized using SLAYER [30] in example embodiments. As mentioned above, the derivative of a spike is undefined, which prohibits a direct application of backpropagation to SNNs. SLAYER overcomes this problem by using a stochastic spiking neuron approximation to derive an approximate gradient, and a temporal credit assignment policy to distribute errors. SLAYER trains models offline on GPU hardware. Hence, the spiking data needs to be binned into fixed-width intervals during the training process, but the resultant SNN model can be run on neuromorphic hardware. A straight-forward binning process was used in an example embodiment where the (binary) value for each bin window V.sub.w was 1 whenever the total spike count in that window V.sub.w exceeded a threshold value S.sub.min:
(68)
(69) Following [30], class prediction is determined by the number of spikes in the output layer spike train; each output neuron is associated with a specific class and the neuron that generates the most spikes represents the winning class. The model was trained in an example embodiment by minimizing the loss:
(70)
(71) which captures the difference between the observed output spike count .sub.t=0.sup.Ts(t) and the desired spike count
(72)
for output neuron o (indexed by n).
(73) A generalization of the spike-count loss in equation (3) is introduced to incorporate temporal weighting:
(74)
(75) is referred to as the weighted spike-count loss. In the experiments, (t) is set to be monotonically decreasing, which encourages early classification by down-weighting later spikes. Specifically, a simple quadratic function is used, (t)=t.sup.2+ with 3<0, but other forms may be used in different example embodiments. For both
and
, appropriate counts are specified for the correct and incorrect classes and are task-specific hyperparameters. The hyperparameters were tuned manually and it was found that setting the positive class count to 50% of the maximum number of spikes (across each input within the considered time interval) worked well. In initial trials, it was observed that training solely with the losses above led to rapid over-fitting and poor performance on a validation set. Several techniques to mitigate this issue were explored (e.g.,
.sub.1 regularization and dropout), and it was found that simple l.sub.2 regularization led to the best results.
(76) Robot and Sensors Setup According to Example Embodiments
(77)
(78) Neutouch Tactile Sensor According to an Example Embodiment
(79) Two NeuTouch sensors 304, 306 were mounted to the Robotiq 2F-140 gripper 302 and the ACES decoder 316 was mounted on the Panda arm 300 (
(80) Prophesee Event Camera According to an Example Embodiment.
(81) Event-based vision data was captured using the Prophesee Onboard (https://www.prophesee.ai) 308. Similar to the tactile sensor, each camera pixel fires asynchronously and a positive (negative) spike is obtained when there is an increase (decrease) in luminosity. The Prophesee Onboard 308 was mounted on the arm 300 and pointed towards the gripper 302 to obtain information about the object of interest (
(82) TABLE-US-00001 TABLE 1 (Prophesee Biases) Bias Value Remarks bias_fo 1775 Pixel low-pass cut-off frequency bias_hpf 1800 Pixel high-pass cut-off frequency bias_pr 1550 Controls photo-receptor bias_diff_on 435 Sensitivity to positive change in luminosity bias_diff_off 198 Sensitivity to negative change in lummosity bias_refr 1500 Pixel refractory period
RGB Cameras According to an Example Embodiment
(83) Two Intel RealSense D435s RGB cameras 310, 312 were used to provide additional non-event image data (The infrared emitters were disabled as they increased noise for the event camera and hence, no depth data was recorded). The first camera 310 was mounted on the end-effector with the camera 310 pointed towards the gripper 302 (providing a view of the grasped object), and the second camera 312 was placed to provide a view of the scene. The RGB images were used for visualization and validation purposes, but not as input to the models; integration of these standard sensors to provide even better model performance can be provided according to different example embodiments
(84) OptiTrack According to an Example Embodiment
(85) The OptiTrack motion capture system 314 was used to collect object movement data for the slip detection experiment. 6 reflective markers were attached on the rigid parts of the end-effector and 14 markers on the object of interest. Eleven OptiTrack Prime 13 cameras were placed strategically around the experimental area to minimize tracking error (see e.g. 316, 318 in
(86) 3D-Printed Parts for Use in an Example Embodiment
(87) In an example embodiment, the visual-tactile sensor components are mounted to the robot via 3D printed parts. There are three main 3D printed parts in an example embodiment; a main holder (
(88) Specifically, in
(89) With reference to
(90) Further Details According to an Example Embodiment.
(91) In addition to the above sensors, proprioceptive data was also collected for the Panda arm 300 and Robotiq gripper 302; these were not currently used in the models but can be included in different example embodiments.
(92) Minimizing phase shift is critical, so that machine learning models can learn meaningful interactions between the different modalities. The setup according to an example embodiment spanned across multiple machines, each having an individual Real Time Clock (RTC). Chronyd was used to sync the various clocks to the Google Public NTP pool time servers. During data collection, for each machine, the record-start time is logged according to its own RTC, and thus it was possible to retrieve differences between the different RTCs and sync them accordingly during data pre-processing.
(93) In the data collection procedure, rotational slip typically happened in the middle of a recording. In order to extract the relevant portion of the data when slip occurred, the slip onset was first detected and annotated. OptiTrack markers were attached on Panda's end-effector and the object, such that the OptiTrack was able to determine their poses.
(94)
(95) It was checked when p.sub.z departed the empirical noise distribution within when the robot arm was stationary.
(96) For object orientation, the change in angle
(97) from at rest was calculated using
.sub.t=cos.sup.1(2q.sub.0,q.sub.t
.sup.21)
(98) where q.sub.0 is the quaternion orientation at rest. Similarly, the frame f.sub.slip when the object first rotates was annotated using the following heuristic:
(99)
(100) It was found that the time it took for the object to rotate upon lifting was on average 0.03 seconds across all of the slipping data points.
(101)
(102) I. Container & Weight Classification According to Example Embodiments
(103) A first experiment applies the event-driven perception frameworkcomprising NeuTouch, the Onboard camera, and the VT-SNN according to example embodimentsto classify containers with varying amounts of liquid. The primary goal was to determine if the multi-modal system according to example embodiments was effective at detecting differences in objects that were difficult to isolate using a single sensor. It is noted that the objective was not to derive the best possible classifier; indeed, the experiment did not include proprioceptive data which would likely have improved results [11], nor conduct an exhaustive (and computationally expensive) search for the best architecture. Rather, the experiments were designed to study the potential benefits of using both visual and tactile spiking data in a reasonable setup, according to example embodiments.
(104) I.1. Methods and Procedure According to Example Embodiments
(105) I.1.1. Objects Used According to Example Embodiments
(106) Four different containers were used: an aluminium coffee can, a plastic Pepsi bottle, a cardboard soy milk carton and a metal tuna can (see
(107) I.1.2. Robot Motion According to Example Embodiments
(108) The robot would grasp and lift each object class fifteen times, yielding 15 samples per class. Trajectories for each part of the motion was computed using the Movelt Cartesian Pose Controller [47]. Briefly, the robot gripper was initialized 10 cm above each object's designated grasp point. The end-effector was then moved to the grasp position (2 seconds) and the gripper was closed using the Robotiq grasp controller with a force setting of 1 (4 seconds). The gripper then lifted the object by 5 cm (2 seconds) and held it for 0.5 seconds.
(109) I.1.3. Data Pre-Processing According to Example Embodiments
(110) For both modalities, data from the grasping, lifting and holding phases (corresponding to the 2.0 s to 8.5 s window in
(111) I.1.4. Classification Models, Including VT-SNN According to an Example Embodiment
(112) The SNNs were compared against conventional deep learning, specifically Multi-layer Perceptrons (MLPs) with Gated Recurrent Units (GRUs) [48] and 3D convolutional neural networks (CNN-3D) [51]. Each model was trained using (i) the tactile data only, (ii) the visual data only, and (iii) the combined visual-tactile data, noting that the SNN model on the combined data corresponds to the VT-SNN according to an example embodiment. When training on a single modality, Visual or Tactile SNN were used as appropriate. All the models were implemented using PyTorch. The SNNs were trained with SLAYER to minimize spike count differences [30] and the ANNs were trained to minimize the cross-entropy loss using RMSProp. All models were trained for 500 epochs.
(113) I.2. Results and Analysis
(114) I.2.1. Model Comparisons, Including VT-SNN According to an Example Embodiment
(115) The test accuracies of the models are summarized in Table 2. The tactile only modality SNN gives 12% higher accuracy than the vision only modality. The multimodal VT-SNN model according to an example embodiment achieves the highest score of 81%, an improvement of over 11% compared to the tactile modality variant. It is noted that a closer examination of the vision only modality data showed that (i) the Pepsi bottle was not fully opaque and the water level was observable by Onboard on some trials, and (ii) the Onboard was able to see object deformations as the gripper closed, which revealed the fullness of the softer containers. Hence, the vision only modality results were better than anticipated.
(116) TABLE-US-00002 TABLE 2 Model Tactile Vision Combined SNN ( ) 0.71 (0.045) 0.73 (0.064) 0.81 (0.039) SNN (
) 0.71 (0.023) 0.72 (0.065) 0.80 (0.048) ANN (MLP-GRU) 0.50 (0.059) 0.43 (0.054) 0.44 (0.062) ANN (CNN-3D) 0.75 (0.061) 0.68 (0.022) 0.80 (0.041)
(117)
(118) Referring again to Table I, the SNN models performed far better than the ANN (MLP-GRU) models, particularly for the combined visual-tactile data. The poor performance was possibly due to the relatively long sample durations (325 time-steps) and the large number of parameters in the ANN models, relative to the size of the dataset.
(119) I.2.2. Early Classification, Including VT-SNN According to an Example Embodiment
(120) Instead of waiting for all the output spikes to accumulate, early classification can be performed based on the number of spikes seen up to time t.
(121) In and
.sub. have similar final accuracies, it can be seen from
variant 700b has a similar early accuracy profile as vision 702a, b, but achieves better performance as tactile information is accumulated for times beyond 2 s.
(122) II. Rotational Slip Classification According to Example Embodiments
(123) In this second experiment, the perception system according to example embodiments was used to classify rotational slip, which is important for stable grasping; stable grasp points can be incorrectly predicted for objects with center-of-mass that are not easily determined by sight, e.g., a hammer and other irregularly-shaped items. Accurate detection of rotational slip will allow the controller to re-grasp the object and remedy poor initial grasp locations. However, to be effective, slip detection needs to be performed accurately and rapidly.
(124) II.1. Method and Procedure According to Example Embodiments
(125) II.1.1. Objects Used According to Example Embodiments
(126) The test object was constructed using Lego Duplo blocks (see
(127) II.1.2. Robot Motion According to Example Embodiments
(128) The robot would grasp and lift both object variants 50 times, yielding 50 samples per class. Similar to the previous experiment, motion trajectories were computed using the MoveIt Cartesian Pose Controller [47]. The robot was instructed to close upon the object, lift by 10 cm off the table (in 0.75 seconds) and hold it for an additional 4.25 seconds. We tuned the gripper's grasping force to enable the object to be lifted, yet allow for rotational slip for the off-center object (see
(129) II.1.3. Data Preprocessing According to Example Embodiments
(130) Instead of training the models across the entire movement period, a short time period was extracted in the lifting stage. The exact start time was obtained by analyzing the OptiTrack data; specifically, the baseline orientation distribution (for 1 second or 120 frames) was obtained and rotational slip was defined as an orientation larger (or smaller) than 98% of the baseline frames lasting more than four consecutive OptiTrack frames. It was found that slip occurred almost immediately during the lifting. Since the interest was in rapid detection, a 0.15 s window was extracted around the start of the lift, and a bin duration of 0.001 s (150 bins) with binning threshold S.sub.min=1 were set. Again, stratified K-folds was used to obtain 5 splits, where each split contained 80 training examples and 20 testing examples.
(131) II.1.4. Classification Models, Including VT-SNN According to an Example Embodiment
(132) The model setup and optimization procedure are identical to the previous task/experiment, with 3 slight modifications. First, the output size is reduced to 2 for the binary labels. Second, the sequence length for the ANN GRUs were set to 150, the number of time bins. Third, the SNN's desired true and false spike counts were set to 80 and 5 respectively. Again, SNN and ANN models were compared using (i) the tactile data only, (ii) the visual data only, and (iii) the combined visual-tactile data, including VT-SNN according to an example embodiment
(133) II.2. Results and Analysis
(134) II.2.1. Model Comparisons, Including VT-SNN According to an Example Embodiment
(135) The test accuracy of the models are summarized in in Table 3. For both the SNN and ANN, both the vision and multi-modal models achieve 100% accuracy. This suggests that vision data is highly indicative of slippage, which is unsurprising as rotational slip would produce a visually distinctive signature. Using only tactile events, the SNN and MLP-GRU achieve 91% (with L.sub.w) and 87% accuracy respectively.
(136) TABLE-US-00003 TABLE 3 Model Tactile Vision Combined SNN ( ) 0.82 (0.045) 1.00 (0.000) 1.00 (0.000) SNN (
) 0.91 (0.020) 1.00 (0.000) 1.00 (0.000) ANN (MLP-GRU) 0.87 (0.059) 1.00 (0.000) 1.00 (0.000) ANN (CNN-3D) 0.44 (0.086) 0.55 (0.100) 0.77 (0.117)
(137) II.2.2. Early Slip Detection, Including VT-SNN According to an Example Embodiment
(138) Similar to the previous analysis on early container classification,
(139) For all SNNs, models trained with weighted spike count loss 900b, 902b, 904b achieves better early classification compared to spike count loss 900a, 902a, 904a, noting that early classification accuracy of the VT-SNN with weighted spike count loss 900b achieves essentially the same early classification accuracy as the tactile-based classification with weighted spike count loss 902b
(140) III. Speed and Power Efficiency According to Example Embodiments
(141) The inference speed and energy utilization of the classification model (using the VT-SNN with spike-count loss according to an example embodiment, noting that weighted spike count loss should not affect the power consumption) on both a GPU (Nvidia GeForce RTX 2080 Ti) and the Intel Loihi were compared.
(142) Specifically, the multi-modal VT-SNN was trained using the SLAYER framework, such that it ran identically on both the Loihi and via simulation on the GPU. The model is identical to that described in the previous sections except two changes: 1) The Loihi neuron model is used in place of the SRM neuron model. 2) The polarity of the vision output is discarded to reduce the vision input size to into a single core on the Loihi.
(143) Both models attain 100% test accuracy, and produce identical results on the Loihi and the GPU. All benchmarks were obtained for the Loihi using NxSDK version 0.9.5 on a Nahuku 32 board, and on a Nvidia RTX 2080Ti GPU respectively.
(144) The model is tasked to perform 1000 forward passes, with a batch size of 1 on the GPU. The dataset of 1000 samples is obtained by repeating samples from our test set. Each sample consists of 0.15 s of spike data, binned every 1 ms into a 150 timesteps.
(145) Latency measurement: on the GPU, the system clock on the CPU was used to capture the start (tstart) and end time (tend) for model inference, and on the Loihi, we used the system clock on superhost. We compute the latency per timestep as (t.sub.endt.sub.start)/(1000150), dividing across 1000 samples, each with 150 timesteps.
(146) Power Utilization Measurement: To obtain power utilization on the GPU, the approach in [52] and used the NVIDIA System Management Interface, logging (timestamp, power draw) pairs at 200 ms intervals with the utility. The power draw during the time spent was extracted, and averaged to obtain the average power draw under load. To obtain the idle power draw of the GPU, power usage on the GPU was logged for 15 minutes with no processes running on the GPU, and the power draw was averaged over the period. The performance profiling tools available within NxSDK 0.9.5 were used to obtain the power utilization for the VT-SNN on the Loihi. The model according to an example embodiment is small and occupies less than 1 chip on the 32-chip Nahuku 32 board. To obtain more accurate power measurements, the workload was replicated 32 times and the results reported per-copy. The replicated workload occupies 594 neuromorphic cores and 586 cores, with 624 neuromorphic cores powered for barrier synchronization
(147) To simulate a real-world setting (where data arrives in an online sequential manner), 1) the x86 cores are artificially slowed down to match the 1 ms timestep duration of the data. 2) an artificial delay of 0.15 s is introduced to the dataset fetch for the GPU, to simulate waiting for the full window of data before it is able to perform inference.
(148) The benchmark results are shown in Table 4, where latency is the time taken to process 1 timestep. It was observed that the latency on the Loihi is slightly lower, because it is able to perform the inference as the spiking data arrives. The power consumption on the Loihi is significantly (1900) lower than on the GPU.
(149) TABLE-US-00004 TABLE 4 Hardware Latency (s) Total Power (mW) Loihi 1039.9 32.3 GPU 1045.6 61930
(150)
(151) The task SNN 1412 may be configured for classification based on a spike-count loss in the respective output vision/tactile modality representations compared to a desired spike count indexed by the output size. Preferably, the task SNN 1412 is configured for classification based on a weighted spike-count loss in the respective output vision/tactile modality representations compared to a desired weighted spike count indexed by the output size.
(152) Neurons in each of the first SNN encoder 1402, the second SNN encoder 1406, and the task SNN 1412 may be configured for applying a Spike response Model, SRM.
(153) The sensor system 1400 may comprise the tactile sensor 1404. Preferably, the tactile sensor 1404 comprises an event-based tactile sensor. Alternatively, the tactile sensor 1404 comprises a converter for converting an intrinsic output of the tactile sensor 1404 into the event-based output of the tactile sensor 1404.
(154) The sensor system 1400 may comprise the vision sensor 1408. Preferably, the vision sensor 1408 comprises an event-based vision sensor. Alternatively, the vision sensor 1408 comprises a converter for converting an intrinsic output of the vision sensor into the event-based output of the vision sensor 1408.
(155) The sensor system 1400 may comprise a robot arm and end-effector. The end-effector may comprise a gripper. Preferably, the tactile sensor 1406 may comprise one tactile element on each finger of the gripper.
(156) The vision sensor 1408 may be mounted on the robot arm or on the end-effector.
(157)
(158) The task SNN may be configured for classification based on a spike-count loss in the respective output vision/tactile modality representations compared to a desired spike count indexed by the output size. Preferably, the task SNN is configured for classification based on a weighted spike-count loss in the respective output vision/tactile modality representations compared to a desired weighted spike count indexed by the output size.
(159) Each of the first SNN encoder, the second SNN encoder, and the task SNN may be configured for applying a Spike response Model, SRM.
(160) Preferably, the tactile sensor comprises an event-based tactile sensor. Alternatively, the tactile sensor comprises a converter for converting an intrinsic output of the tactile sensor into the event-based output of the tactile sensor.
(161) Preferably, the vision sensor comprises an event-based vision sensor. Alternatively, the vision sensor comprises a converter for converting an intrinsic output of the vision sensor into the event-based output of the vision sensor.
(162) The method may comprise disposing one tactile element of the tactile sensor on each finger of a gripper of a robot arm.
(163) The method may comprise mounting the vision sensor on the robot arm or on the end-effector.
(164)
(165) The taxel electrodes e.g. 1606 of the electrode array may be arranged with a radially varying density around a centre of the electrode array. The density of the taxel electrodes e.g. 1606 may decrease with radial distance from of the centre.
(166) The tactile sensor may comprise a plurality of encoder elements e.g. 1614 connected to respective ones of the electrode lines e.g. 1608, the decoder elements e.g. 1614 configured to asynchronously transmit tactile information based on the electrical signals in the electrode lines e.g. 1608 via a common output conductor 1616.
(167) The carrier structure 1602 may be configured to be connectable to a robotic gripper.
(168) The electrode layer 1604 and/or the electrode lines e.g. 1608 may be flexible.
(169)
(170) The taxel electrodes of the electrode array may be arranged with a radially varying density around a centre of the electrode array. The density of the taxel electrodes may decrease with radial distance from of the centre.
(171) The method may comprise providing a plurality of encoder elements connected to respective ones of the electrode lines, and configuring the decoder elements to asynchronously transmit tactile information based on the electrical signals in the electrode lines via a common output conductor.
(172) The method may comprise configuring the carrier structure to be connectable to a robotic gripper.
(173) The electrode layer and/or the electrode lines may be flexible.
(174) As described above, an event-based perception framework is provided according to example embodiments that combines vision and touch to achieve better performance on two robot tasks. In contrast to conventional synchronous systems, the event-driven framework according to example embodiments can asynchronously process discrete events and as such, may achieve higher temporal resolution and low latency, with low power consumption.
(175) NeuTouch, a neuromorphic event tactile sensor according to example embodiments, and VT-SNN, a multi-modal spiking neural network that learns from raw unstructured event data according to example embodiments, have been described. Experimental results on container & weight classification, and rotational slip detection show that combining both modalities according to example embodiments is important for achieving high accuracies.
(176) Embodiments of the present invention can have one or more of the following features and associated benefits/advantages
(177) TABLE-US-00005 Feature Benefit/Advantage Incorporation of neuromorphic Fast and efficient capture of object robotic gripping tactile elements deformation and contact mechanics with neuromorphic visual inputs for effective object grasping tasks Capture dynamic pressure - crucial for dexterous manipulation tasks that require rapid detection of object slippage, object hardness, and local curvature Capture of dynamic visual elements - object deformation and dynamic changes in the object and environment. End effector gripper designed for Grasping arbitrary objects with less robotic grasping with appropriate slippage electrode design and materials Enhance the speed of robotic control loops Addition/removal of taxels Highly scalable (tactile pixels) in NeuTouch can be done Simple wiring Tactile information is transmitted via a single common conductor for signalling Flexible form factor NeuTouch can be designed to conform to a myriad of 3D shapes and surfaces. It can be easily retrofitted onto a wide range of end- effectors, including anthropomorphic robotic hands. Power efficiency The NeuTouch and the Prophesee camera have energy use in the mW range. Tested on an experimental neuromorphic chip (the Intel Loihi.sup.2), the VT-SNN can perform the same number of inferences per second (~350-300) but requiring orders of magnitude less energy per inference compared to standard GPU based machine learning hardware. .sup.2Davies, Mike, et al. Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro 38.1 (2018): 82-99.
(178) The various functions or processes disclosed herein may be described as data and/or instructions embodied in various computer-readable media, in terms of their behavioral, register transfer, logic component, transistor, layout geometries, and/or other characteristics. Computer-readable media in which such formatted data and/or instructions may be embodied include, but are not limited to, non-volatile storage media in various forms (e.g., optical, magnetic or semiconductor storage media) and carrier waves that may be used to transfer such formatted data and/or instructions through wireless, optical, or wired signaling media or any combination thereof. Examples of transfers of such formatted data and/or instructions by carrier waves include, but are not limited to, transfers (uploads, downloads, e-mail, etc.) over the internet and/or other computer networks via one or more data transfer protocols (e.g., HTTP, FTP, SMTP, etc.). When received within a computer system via one or more computer-readable media, such data and/or instruction-based expressions of components and/or processes under the system described may be processed by a processing entity (e.g., one or more processors) within the computer system in conjunction with execution of one or more other computer programs.
(179) Aspects of the systems and methods described herein may be implemented as functionality programmed into any of a variety of circuitry, including programmable logic devices (PLDs), such as field programmable gate arrays (FPGAs), programmable array logic (PAL) devices, electrically programmable logic and memory devices and standard cell-based devices, as well as application specific integrated circuits (ASICs). Some other possibilities for implementing aspects of the system include: microcontrollers with memory (such as electronically erasable programmable read only memory (EEPROM)), embedded microprocessors, firmware, software, etc. Furthermore, aspects of the system may be embodied in microprocessors having software-based circuit emulation, discrete logic (sequential and combinatorial), custom devices, fuzzy (neural) logic, quantum devices, and hybrids of any of the above device types. Of course the underlying device technologies may be provided in a variety of component types, e.g., metal-oxide semiconductor field-effect transistor (MOSFET) technologies like complementary metal-oxide semiconductor (CMOS), bipolar technologies like emitter-coupled logic (ECL), polymer technologies (e.g., silicon-conjugated polymer and metal-conjugated polymer-metal structures), mixed analog and digital, etc.
(180) The above description of illustrated embodiments of the systems and methods is not intended to be exhaustive or to limit the systems and methods to the precise forms disclosed. While specific embodiments of, and examples for, the systems components and methods are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the systems, components and methods, as those skilled in the relevant art will recognize. The teachings of the systems and methods provided herein can be applied to other processing systems and methods, not only for the systems and methods described above.
(181) It will be appreciated by a person skilled in the art that numerous variations and/or modifications may be made to the present invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects to be illustrative and not restrictive. Also, the invention includes any combination of features described for different embodiments, including in the summary section, even if the feature or combination of features is not explicitly specified in the claims or the detailed description of the present embodiments.
(182) In general, in the following claims, the terms used should not be construed to limit the systems and methods to the specific embodiments disclosed in the specification and the claims, but should be construed to include all processing systems that operate under the claims. Accordingly, the systems and methods are not limited by the disclosure, but instead the scope of the systems and methods is to be determined entirely by the claims.
(183) Unless the context clearly requires otherwise, throughout the description and the claims, the words comprise, comprising, and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is to say, in a sense of including, but not limited to. Words using the singular or plural number also include the plural or singular number respectively. Additionally, the words herein, hereunder, above, below, and words of similar import refer to this application as a whole and not to any particular portions of this application. When the word or is used in reference to a list of two or more items, that word covers all of the following interpretations of the word: any of the items in the list, all of the items in the list and any combination of the items in the list.
REFERENCES
(184) [1] A. Billard and D. Kragic, Trends and challenges in robot manipulation, Science, vol. 364, no. 6446, p. eaat8414, 2019. [2] D. Li, X. Chen, M. Becchi, and Z. Zong, Evaluating the energy efficiency of deep convolutional neural networks on cpus and gpus, 10 2016, pp. 477-484. [3] E. Strubell, A. Ganesh, and A. McCallum, Energy and policy considerations for deep learning in NLP, in Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, Jul. 28-Aug. 2, 2019, Volume 1: Long Papers, 2019, pp. 3645-3650. [Online]. Available: https://doi.org/10.18653/v1/p19-1355 [4] M. Pfeiffer and T. Pfeil, Deep Learning With Spiking Neurons: Opportunities and Challenges, Frontiers in Neuroscience, vol. 12, no. October, 2018. [5] S.-C. Liu, B. Rueckauer, E. Ceolini, A. Huber, and T. Delbruck, Eventdriven sensing for efficient perception: Vision and audition algorithms, IEEE Signal Processing Magazine, vol. 36, no. 6, pp. 29-37, 2019. [6] Y. A. LeCun, Y. Bengio, and G. E. Hinton, Deep learning, Nature, vol. 521, no. 7553, pp. 436-444, 2015. [7] M. Davies, N. Srinivasa, T. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain, Y. Liao, C. Lin, A. Lines, R. Liu, D. Mathaikutty, S. McCoy, A. Paul, J. Tse, G. Venkataramanan, Y. Weng, A. Wild, Y. Yang, and H. Wang, Loihi: A neuromorphic manycore processor with on-chip learning, IEEE Micro, vol. 38, no. 1, pp. 82-99, January 2018. [8] J. Sinapov, C. Schenck, and A. Stoytchev, Learning relational object categories using behavioral exploration and multimodal perception, in 2014 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2014, pp. 5691-5698. [9] Y. Gao, L. A. Hendricks, K. J. Kuchenbecker, and T. Darrell, Deep learning for tactile understanding from visual and haptic data, in 2016 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2016, pp. 536-543. [10] J. Li, S. Dong, and E. Adelson, Slip detection with combined tactile and visual information, in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 7772-7777. [11] M. A. Lee, Y. Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg, Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks, in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 8943-8950. [12] J. Lin, R. Calandra, and S. Levine, Learning to identify object instances by touch: Tactile recognition via multimodal matching, in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 3644-3650. [13] H. Liu, F. Sun et al., Robotic tactile perception and understanding, 2018. [14] P. Allen, Surface descriptions from vision and touch, in Proceedings. 1984 IEEE International Conference on Robotics and Automation, vol. 1. IEEE, 1984, pp. 394-397. [15] S. Luo, J. Bimbo, R. Dahiya, and H. Liu, Robotic tactile perception of object properties: A review, Mechatronics, vol. 48, pp. 54-67, 2017. [16] H. Liu, Y. Yu, F. Sun, and J. Gu, Visualtactile fusion for object recognition, IEEE Transactions on Automation Science and Engineering, vol. 14, no. 2, pp. 996-1008, 2016. [17] H. Soh, Y. Su, and Y. Demiris, Online spatio-temporal Gaussian process experts with application to tactile classification, in Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE, 2012, pp. 4489-4496. [18] J. Varley, D. Watkins, and P. Allen, Visual-tactile geometric reasoning, in RSS Workshop, 2017. [19] J. Reinecke, A. Dietrich, F. Schmidt, and M. Chalon, Experimental comparison of slip detection strategies by tactile sensing with the Biotac on the dlr hand arm system, in 2014 IEEE international Conference on Robotics and Automation (ICRA). IEEE, 2014, pp. 2742-2748. [20] Y. Bekiroglu, R. Detry, and D. Kragic, Learning tactile characterizations of object- and pose-specific grasps, in 2011 IEEE/RSJ international conference on Intelligent Robots and Systems. IEEE, 2011, pp. 1554-1560. [21] Z. Su, K. Hausman, Y. Chebotar, A. Molchanov, G. E. Loeb, G. S. Sukhatme, and S. Schaal, Force estimation and slip detection/classification for grip control using a biomimetic tactile sensor, in 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids). IEEE, 2015, pp. 297-303. [22] W. Yuan, S. Dong, and E. H. Adelson, Gelsight: High-resolution robot tactile sensors for estimating geometry and force, Sensors, vol. 17, no. 12, p. 2762, 2017. [23] R. Calandra, A. Owens, D. Jayaraman, J. Lin, W. Yuan, J. Malik, E. H. Adelson, and S. Levine, More than a feeling: Learning to grasp and regrasp using vision and touch, IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 3300-3307, 2018. [24] S. Luo, W. Yuan, E. Adelson, A. G. Cohn, and R. Fuentes, Vitac: Feature sharing between vision and tactile sensing for cloth texture recognition, in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 2722-2727. [25] G. Gallego, T. Delbr, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, K. Daniilidis, D. Scaramuzza, S. Leutenegger, and A. Davison, Eventbased Vision: A Survey, Tech. Rep., 2018. [26] A. Mitrokhin, C. Ye, C. Fermuller, Y. Aloimonos, and T. Delbruck, EVIMO: Motion Segmentation Dataset and Learning Pipeline for Event Cameras, in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019. [27] A. Z. Zhu and L. Yuan, EV-FlowNet: Self-Supervised Optical Flow Estimation for Event-based Cameras, in Robotics: Science and Systems, 2018. [28] A. I. Maqueda, A. Loquercio, G. Gallego, N. Garcn'nia, and D. Scaramuzza, Event-based vision meets deep learning on steering prediction for self-driving cars, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5419-5427. [29] A. Tavanaei, M. Ghodrati, S. R. Kheradpisheh, T. Masquelier, and A. Maida, Deep learning in spiking neural networks, Neural Networks, vol. 111, pp. 47-63, 2019. [Online]. Available: https://doi.org/10.1016/j.neunet.2018.12.002 [30] S. B. Shrestha and G. Orchard, Slayer: Spike layer error reassignment in time, in Advances in Neural Information Processing Systems, 2018, pp. 1412-1421. [31] G. Bellec, F. Scherr, E. Hajek, D. Salaj, R. Legenstein, and W. Maass, Biologically inspired alternatives to backpropagation through time for learning in recurrent neural nets, arXiv preprint arXiv:1901.09049, 2019. [32] M. Akrout, C. Wilson, P. Humphreys, T. Lillicrap, and D. B. Tweed, Deep learning without weight transport, in Advances in Neural Information Processing Systems, 2019, pp. 974-982. [33] P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura, B. Brezzo, I. Vo, S. K. Esser, R. Appuswamy, B. Taba, A. Amir, M. D. Flickner, W. P. Risk, R. Manohar, and D. S. Modha, A million spiking-neuron integrated circuit with a scalable communication network and interface, Science, vol. 345, no. 6197, pp. 668-673, 2014. [Online]. Available: https://science.sciencemag.org/content/345/6197/668 [34] S. Chevallier, H. Paugam-Moisy, and F. Lema{circumflex over ()}itre, Distributed processing for modelling real-time multimodal perception in a virtual robot. in Parallel and Distributed Computing and Networks, 2005, pp. 393-398. [35] N. Rathi and K. Roy, Stdp-based unsupervised multimodal learning with cross-modal processing in spiking neural network, IEEE Transactions on Emerging Topics in Computational Intelligence, pp. 1-11, 2018. [36] E. Mansouri-Benssassi and J. Ye, Speech emotion recognition with early visual cross-modal enhancement using spiking neural networks, in 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 2019, pp. 1-8. [37] T. Zhou and J. P. Wachs, Spiking neural networks for early prediction in human-robot collaboration, The International Journal of Robotics Research, vol. 38, no. 14, pp. 1619-1643, 2019. [Online]. Available: https://doi.org/10.1177/0278364919872252 [38] J. Konstantinova, A. Jiang, K. Althoefer, P. Dasgupta, and T. Nanayakkara, Implementation of tactile sensing for palpation in robot-assisted minimally invasive surgery: A review, IEEE Sensors Journal, vol. 14, no. 8, pp. 2490-2501, 2014. [39] Y. Wu, Y. Liu, Y. Zhou, Q. Man, C. Hu, W. Asghar, F. Li, Z. Yu, J. Shang, G. Liu et al., A skin-inspired tactile sensor for smart prosthetics, Science Robotics, vol. 3, no. 22, p. eaat0429, 2018. [40] Q.-J. Sun, X.-H. Zhao, Y. Zhou, C.-C. Yeung, W. Wu, S. Venkatesh, Z.-X. Xu, J. J. Wylie, W.-J. Li, and V. A. Roy, Fingertip-skin-inspired highly sensitive and multifunctional sensor with hierarchically structured conductive graphite/polydimethylsiloxane foams, Advanced Functional Materials, vol. 29, no. 18, p. 1808829, 2019. [41] J. He, P. Xiao, W. Lu, J. Shi, L. Zhang, Y. Liang, C. Pan, S.-W. Kuo, and T. Chen, A universal high accuracy wearable pulse monitoring system via high sensitivity and large linearity graphene pressure sensor, Nano Energy, vol. 59, pp. 422-433, 2019. [42] T. Callier, A. K. Suresh, and S. J. Bensmaia, Neural coding of contact events in somatosensory cortex, Cerebral Cortex, vol. 29, no. 11, pp. 4613-4627, 2019. [43] W. W. Lee, Y. J. Tan, H. Yao, S. Li, H. H. See, M. Hon, K. A. Ng, B. Xiong, J. S. Ho, and B. C. Tee, A neuro-inspired artificial peripheral nervous system for scalable electronic skins, Science Robotics, vol. 4, no. 32, p. eaax2198, 2019. [44] R. S. Johansson and J. R. Flanagan, Coding and use of tactile signals from the fingertips in object manipulation tasks, Nature Reviews Neuroscience, vol. 10, no. 5, pp. 345-359, 2009. [45] W. Gerstner, Time structure of the activity in neural network models, Physical review E, vol. 51, no. 1, p. 738, 1995. [46] B. Calli, A. Walsman, A. Singh, S. Srinivasa, P. Abbeel, and A. M. Dollar, Benchmarking in manipulation research: Using the yale-cmuberkeley object and model set, IEEE Robotics Automation Magazine, vol. 22, no. 3, pp. 36-52, September 2015. [47] D. Coleman, I. Sucan, S. Chitta, and N. Correll, Reducing the barrier to entry of complex robotic software: a moveit! case study, arXiv preprint arXiv:1404.3785, 2014. [48] K. Cho, B. van Merri{umlaut over ()}enboer, C. Gulcehre, D. Bandanau, F. Bougares, H. Schwenk, and Y. Bengio, Learning phrase representations using rnn encoderdecoder for statistical machine translation, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1724-1734. [49] P. Blouw, X. Choo, E. Hunsberger, and C. Eliasmith, Benchmarking keyword spotting efficiency on neuromorphic hardware, 2018, arXiv:1812.01739. [50] Lee, Wang Wei, et al. A neuro-inspired artificial peripheral nervous system for scalable electronic skins. Science Robotics 4.32 (2019): eaax2198. [51] J. M. Gandarias, F. Pastor, A. J. Garc{acute over ()}ta-Cerezo, and J. M. G{acute over ()}omezde Gabriel, Active tactile recognition of deformable objects with 3d convolutional neural networks, in 2019 IEEE World Haptics Conference (WHC). IEEE, 2019, pp. 551-555. [52] P. Blouw, X. Choo, E. Hunsberger, and C. Eliasmith, Benchmark-ing keyword spotting efficiency on neuromorphic hardware, 2018, arXiv:1812.01739]