您的位置: 专家智库 > >

国家自然科学基金(61272145)

作品数:5 被引量:9H指数:2
相关作者:文梅沈俊忠肖涛乔寓然杨乾明更多>>
相关机构:国防科学技术大学湖南省消防总队更多>>
发文基金:国家自然科学基金国家教育部博士点基金国家高技术研究发展计划更多>>
相关领域:自动化与计算机技术更多>>

文献类型

  • 5篇期刊文章
  • 1篇会议论文

领域

  • 6篇自动化与计算...
  • 1篇电子电信

主题

  • 2篇FPGA
  • 2篇OPENCL
  • 1篇手机
  • 1篇体系结构
  • 1篇逻辑
  • 1篇逻辑划分
  • 1篇矩阵
  • 1篇矩阵乘
  • 1篇矩阵乘法
  • 1篇集群体系结构
  • 1篇加速器
  • 1篇加速器设计
  • 1篇非均匀
  • 1篇分块
  • 1篇分块策略
  • 1篇MULTI-...
  • 1篇PERFOR...
  • 1篇PROGRA...
  • 1篇SHARED
  • 1篇BASED

机构

  • 2篇国防科学技术...
  • 1篇湖南省消防总...
  • 1篇国防科技大学

作者

  • 2篇文梅
  • 1篇杨乾明
  • 1篇乔寓然
  • 1篇肖涛
  • 1篇沈俊忠

传媒

  • 2篇计算机工程与...
  • 2篇Fronti...
  • 1篇Journa...
  • 1篇第十八届计算...

年份

  • 1篇2018
  • 1篇2017
  • 1篇2016
  • 1篇2015
  • 1篇2014
  • 1篇2013
5 条 记 录,以下是 1-6
排序方式:
一种支持优化分块策略的矩阵乘加速器设计被引量:4
2016年
在许多应用领域中,大规模浮点矩阵乘法往往是最耗时的计算核心之一。在新兴的应用中经常存在至少有一个维度很小的大规模矩阵,我们把具备这种特性的矩阵称为非均匀矩阵。由于FPGA上用以存储中间结果的片上存储器容量十分有限,计算大规模矩阵乘法时往往需要将矩阵划分成细粒度的子块计算任务。当加速非均匀矩阵乘法时,由于只支持固定分块大小,大多数现有的线性阵列结构的硬件矩阵乘法器将遭受很大的性能下降。为了解决这个问题,提出了一种有效的优化分块策略。在此基础上,在Xilinx公司的Zynq XC7Z045FPGA芯片上实现了一个支持可变分块的矩阵乘法器。通过集成224个处理单元,该矩阵乘法器在150 MHz的时钟频率下对于实际应用中的非均匀矩乘达到了48GFLOPS的实测性能,而所需带宽仅为4.8GB/s。实验结果表明,我们提出的分块策略相比于传统的分块算法实现了高达12%的性能提升。
沈俊忠肖涛乔寓然杨乾明文梅
关键词:FPGA矩阵乘法分块策略
Exploiting a depth context model in visual tracking with correlation filter
2017年
Recently correlation filter based trackers have attracted considerable attention for their high computational efficiency. However, they cannot handle occlusion and scale variation well enough. This paper aims at preventing the tracker from failure in these two situations by integrating the depth information into a correlation filter based tracker. By using RGB-D data, we construct a depth context model to reveal the spatial correlation between the target and its surrounding regions. Furthermore, we adopt a region growing method to make our tracker robust to occlusion and scale variation. Additional optimizations such as a model updating scheme are applied to improve the performance for longer video sequences. Both qualitative and quantitative evaluations on challenging benchmark image sequences demonstrate that the proposed tracker performs favourably against state-of-the-art algorithms.
Zhao-yun CHENLei LUODa-fei HUANGMei WENChun-yuan ZHANG
Efficient fine-grained shared buffer management for multiple OpenCL devices
2013年
OpenCL programming provides full code portability between different hardware platforms,and can serve as a good programming candidate for heterogeneous systems,which typically consist of a host processor and several accelerators.However,to make full use of the computing capacity of such a system,programmers are requested to manage diverse OpenCL-enabled devices explicitly,including distributing the workload between different devices and managing data transfer between multiple devices.All these tedious jobs pose a huge challenge for programmers.In this paper,a distributed shared OpenCL memory(DSOM) is presented,which relieves users of having to manage data transfer explicitly,by supporting shared buffers across devices.DSOM allocates shared buffers in the system memory and treats the on-device memory as a software managed virtual cache buffer.To support fine-grained shared buffer management,we designed a kernel parser in DSOM for buffer access range analysis.A basic modified,shared,invalid cache coherency is implemented for DSOM to maintain coherency for cache buffers.In addition,we propose a novel strategy to minimize communication cost between devices by launching each necessary data transfer as early as possible.This strategy enables overlap of data transfer with kernel execution.Our experimental results show that the applicability of our method for buffer access range analysis is good,and the efficiency of DSOM is high.
Chang-qing XUNDong CHENQiang LANChun-yuan ZHANG
关键词:OPENCL
一种面向片上集群体系结构的原型验证系统
对处理器设计的验证方法可分为软件方法和硬件方法。众核体系结构处理器的逻辑规模较大,软件方法验证速度较慢,一般采取使用FPGA搭建原型系统验证。在实践过程中发现,硬件原型系统往往存在以下问题:(1)由于单块FPGA容量无法...
王自伟乔寓然杨乾明伍楠文梅
关键词:CLUSTERCHIP逻辑划分
CNN卷积计算在移动GPU上的加速研究被引量:5
2018年
卷积神经网络(CNN)凭借其优秀的表现正在诸如图像分类、语音识别等领域里扮演着越来越重要的角色,已经有一些研究人员想要将这个深度学习过程复制到手机上。但是,由于CNN巨大的计算量,移植程序的性能一直难以令人满意。为了探讨如何解决这一问题,借助MXNet这样一个深度学习的框架在手机上实现了CNN的前向过程,并且将注意力放在了使用手机上另一个强大的计算设备——GPU上。最终选择使用OpenCL通用编程框架将前向过程中最耗时的卷积操作利用矩阵乘来完成,并转移到GPU上进行。在此基础之上还针对手机GPU做了一些优化。最终,实验结果显示我们成功地将前向过程的时间降低到了原来时间的一半。
王湘新时洋文梅
关键词:CNN手机OPENCL
Improving performance portability for GPU-specific Open CL kernels on multi-core/many-core CPUs by analysis-based transformations
2015年
OpenCL is an open heterogeneous programming framework. Although OpenCL programs are func- tionally portable, they do not provide performance portability, so code transformation often plays an irreplaceable role. When adapting GPU-specific OpenCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus has been extensively used. However, locality concerns exposed in GPU-specific OpenCL code are usually inherited without analysis, which may give side-effects on the CPU performance. Typi- cally, the use of OpenCL's local memory on multi-core/many-core CPUs may lead to an opposite performance effect, because local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by (1) removing all the unwanted local-memory arrays together with the obsolete barrier statements and (2) optimizing the coalesced kernel code with vectorization and locality re-exploitation. Moreover, we have developed an automated tool chain that makes this transformation of GPU-specific OpenCL kernels into a CPU-friendly form, which is accompanied with a scheduler that forms a new OpenCL runtime. Experiments show that the automated transformation can improve OpenCL kernel performance on a multi-core CPU by an average factor of 3.24. Satisfactory performance improvements axe also achieved on Intel's many-integrated-core coprocessor. The resultant performance on both architectures is better than or comparable with the corresponding OpenMP performance.
Mei WENDa-fei HUANGChang-qing XUNDong CHEN
共1页<1>
聚类工具0