Breadcrumb

Dynamic thermal management for multi/many core 3D microprocessors

Principle Investigators

Graduate Students

Current Students

  • Xin Huang
  • Sahana Swarup
  • Taeyoung Kim
  • Daniel Quach

Graduate Students (graduated)

  • Dr. Zao Liu (Intel Corp)
  • Dr. Xuexin Liu (Synopsys)

Industry Liaisons

  1. Dr. Valeriy Sukharev, Mentor Graphics Corporation
  2. Dr. Ashish X. Gupta, Intel Corporation
  3. Dr. Jinjun Xiong, IBM Research
  4. Dr. Logendran Bharatham, Freescale Semiconductor, Inc.

Academic Collaborators

  • Dr. Hai Wang, The University of Electronic Science and Technology of China, Chengdu, China
  • Dr. Haibao Chen, Shanghai Jiaotong University, Shanghai, China

Funding

We appreciate the following funding agencies for their generous supports of this project.

  1. National Science Foundation, NSF FRS (Failure Resistant Systems) program (CCF-1255899), "Thermal-Sensitive System-Level Reliability Analysis and Management for Multi-Core and 3D Microprocessors", $180K, April 1, 2013 to March. 31, 2016. PI (single PI).
  2. Semiconductor Research Corporation, NSF/SRC Multi-core Program (SRC 2013-TJ-2417), "Thermal-Sensitive System-Level Reliability Analysis and Management for Multi-Core and 3D Microprocessors", $120K, April 1st, 2013 to Match 30, 2016, PI.
  3. Academic Senate COR (committee on research) Fellowships, "Runtime Thermal Management for Multi/many Core and 3D Integrated Systems", $7500, July, 2012 to June 2013. PI.

Awards

  • Dr. Valeriy Sukharev received the prestigious SRC Mahboob Khan Outstanding Industry Liaison Award!
    • Mahboob Khan Outstanding Industry Liaison/Associate Awards recognizes those individuals who demonstrate outstanding commitment and effectiveness in facilitation of university research, mentoring of graduate students, and dissemination of knowledge and research results to industry. Dr. Sukharev has been selected as a recipient of one of the 2014 Mahboob Khan Outstanding Industry Liaison/Associate Awards. His dedication and personal contributions as a liaison to SRC research programs under the direction of Dr. Sheldon Tan, University of California Riverside on SRC research #2417.001 - Thermal-Sensitive System-Level Reliability Analysis and Management for Multi-Core and 3D Microprocessors has served to strengthen our industry. SRC laud his efforts and hold his accomplishments as a role model for others. The Mahboob Khan Outstanding Industry Liaison/Associate Awards will be presented at the SRC TECHCON 2014 banquet on Monday, September 8th in Austin, TX.
  • X. Huang, T. Yu, V. Sukharev, S. X.-D. Tan, "Physics-based electromigration assessment for power grid networks", Proc. IEEE/ACM Design Automation Conference (DAC'14), San Francisco, June, 2014. (Best Paper Award Nomination (12 out of 787 submissions, 1.5%))
  • H. Chen, S. X.-D. Tan, X. Huang, V. Sukharev, "New electromigration modeling and analysis considering time-varying temperature and current densities", Proc. Asia South Pacific Design Automation Conference (ASP-DAC'15), Chiba, Japan, Jan. 2015. .(Best Paper Award Nomination)

Project Descriptions

Background

Reliability has become a significant challenge for the current multi-core and emerging 3D microprocessor design. Aggressive transistor scaling and increasing processor power density leads to excessive on-chip temperature and increases the risk that microprocessors will fail. Many long-term failure mechanisms are very sensitive to the temperature or temperature changes such as electro-migration, stress migration and thermal-cycling. The elevated temperature and temperature gradients due to continuous integration in multi-core and emerging 3D microprocessors have significant adverse effects on those reliability issues. Wear-out based long-term reliability issues traditionally were addressed in the process and manufacturing stages. But as reliability becomes a major design constraint for nanometer VLSI systems, it must be addressed at different layers. As a result, there is an urgent need for reliability awareness and optimization at the micro-architectural design stage. Since temperature has exponential impacts on many failure issues, it is crucial to have accurate and fast thermal estimation for reliability analysis and even optimization at the architecture and package levels.

The motivations of this project

This project addresses the fundamental challenges in system-level reliability modeling, analysis and optimization. The project consists of the following thrusts:

First, we propose to develop architecture-level full-chip reliability modeling and analysis techniques considering new structures of integration techniques and dominant hard failure mechanisms. Then we will develop reliability-aware dynamic thermal management techniques for the multi-core and 3D stacking microprocessors. We will focus on the task migration and dynamic voltage and frequency scaling based thermal management techniques.

Second, we propose to develop full-chip thermal estimation and prediction techniques considering realistic conditions such as limited physical thermal sensors, presence of errors in thermal and power models, for run time system-level reliability analysis and optimization. For fast thermal analysis and estimation at the design stage, we also propose a module-based hierarchical thermal analysis techniques, which promises both accuracy and efficiency.

We expect the following results coming from this research:

  1. Development of architecture-level full-chip reliability modeling and analysis techniques.
  2. Development of reliability-aware dynamic thermal management techniques for the multi-core and 3D stacking microprocessors.
  3. Design full-chip thermal estimation and prediction techniques considering practical limited thermal sensors, noise errors, for run-time thermal management and optimization.

Research tasks and objectives

The objective of this project is to develop novel, efficient system and architecture level reliability analysis and optimization techniques for multi-core and 3D microprocessors. We seek to regulate on-chip temperature, which affect the wear-out faults the most, to manage the system reliability dynamically. Three thrusts in the task:

  1. Develop the fast thermal estimation and prediction techniques
  2. Full-chip failure rate and MTTF modeling analysis techniques
  3. New reliability-aware dynamic thermal management techniques

Features of the proposed methods

  1. Address the long-term thermal-sensitive reliability issues such EM, SM, TDDB, thermal cycling effects by system level thermal and power management.
  2. New fast physics-based EM assessment techniques which is more accurate and predictable than existing Black and Blech's equations.
  3. The thermal estimation and prediction techniques can consider the more realistic conditions.

Invited Presentations by Dr. Sheldon Tan and collaborators

  • Nanyang Technological University, School of Electrical and Electronic Engineering, Singapore, Singapore , "Thermal Modeling, Estimation and Prediction for Package Design and On-Chip Temperature Regulation",. Aug. 16, 2011.
  • The Hong Kong University of Science and Technology, Department of Electrical and Computer Engineering, Hong Kong, China, "Reliable Thermal Estimation and Prediction for On-Chip Temperature Regulation", Aug. 22, 2011.
  • Mentor Graphics Corp, Calibre Group, Fremont, CA, "Thermal Modeling and Analysis Research for High-Performance Package and Chip Design", Dec. 14, 2011.
  • MediaTek Singapore Pte Ltd, Singapore, "Thermal Analysis and Runtime Management Research for Multi-core Microprocessors", July 27, 2012.
  • International Talent Innovation and Entrepreneurship Week of Shanghai, 2012, Shanghai, "New Battery State of Charge Estimation Techniques for EV", Aug. 7, 2012.
  • International Workshop on Emerging Circuits and Systems (IWECS'13), University of Electronic Science and Technology of China (UESTC), Chengdu, Sichuan Province, China, "Thermal resistance modeling and characterization for TSV and TSV array", July 26, 2013.
  • Seoul National University, Embedded System Research Center (ESRC), Seoul, Korea, "Architecture Level Thermal Modeling, Management for Multi-core and 3D Microprocessors", Dec. 10, 2013. Host: Prof. Naehyuck Chang of SNU.
  • The University of Hong Kong, Department of Electrical and Electronic Engineering, Hong Kong, China, "New More Physics-Based Full-Chip Electron-migration Modeling and Analysis", Jan. 24, 2014. Host: Prof. Ngai Wong of Univ. of HK.
  • The University of California at San Diego, Department of Electrical and Computer Engineering, San Diego, CA. "New Physics-Based Full-Chip Electron-Migration Analysis and System-level Reliability Management", April 23, 2014. Host: Prof. Chung-Kuan Cheng of UCSD.
  • The Institute of Computing Technologies, State Key Lab of Computer Architecture, Chinese Academy of Science, Beijing, China, "Physics-Based Full-Chip Electron-Migration Analysis and System-level Reliability Management", July 4th, 2014. Host: Prof. Yu Hu of ICT, CAS.
  • 2nd International Workshop on Cross-layer Resiliency (IWCR 2014), USC Information Science Institute (ISI), Marina del Rey, CA, "Physics-Based Full-Chip Electron-Migration Modeling and System-level Reliability Management", July 28, 2014.
  • EDA workshop, Daejeon Convention Center, Daejeon, Korea, "Physics-Based Full-Chip Electron-Migration Modeling and Cross-Layer Reliability Management", August 26, 2014.
  • University of Electronic Science and Technology of China (UESTC), School of Microelectronics and Solid State Electronics, Chengdu, China, "Physics-Based Full-Chip Electron-Migration Modeling and Cross-Layer Reliability Management", Sept. 10, 2014.
  • 13th International Workshop on Stress-Induced Phenomena in Microelectronics (Stress Workshop), The University of Texas at Austin, Austin, "Physics-Based Electromigration Assessment for Power Grid Networks", Oct. 15th, 2014.

Tutorial Presentations by Dr. Sheldon Tan

  • Valeriy Sukharev, Sheldon Tan, Marko Chew, "Full-chip Electromigration Assessment and System-level EM Reliability Management", embedded tutorial, IEEE/ACM International Conference on Computer-Aided Design (ICCAD14), Nov. 2014.

Software Download

The new physics-based EM models and analysis methods have been released in Github (see physics_based_em_assessment_analysis)

Publications

Journal publications

  • J1 D. Li, S. X.-D. Tan, E. H. Pacheco, M. Tirumala, "Parameterized architecture-level thermal modeling for multi-core microprocessors", ACM Transaction on Design Automation of Electronic Systems (TODAES), vol. 15, no. 2, pp.1-22, February 2010 (one of top 10 downloaded ACM TODAES Articles published in 2010).
  • J2 T. Eguia, S. X.-D. Tan, R. Shen, D. Li, E. H. Pacheco, M. Tirumala, L. Wang, "General parameterized thermal modeling for high-performance microprocessor design", IEEE Transactions on Very Large Scale Integrated Systems (TVLSI), Vol. 20, No. 2, pp.221-224, Feb. 2012. 10.1109/TVLSI.2010.2098054.
  • J3 H. Wang, S. X.-D. Tan, D. Li, A. Gupta, Y. Yuan, "Composable Thermal Modeling and Simulation for Architecture-Level Thermal Designs of Multi-core Microprocessors", ACM Transactions on Design Automation of Electronic Systems (TODAES), vol. 18, no. 2, March 2013.
  • J5 J64. Z. Liu, S. X.-D. Tan, X. Huang and H. Wang, “Task migrations for distributed thermal management considering transient effects”, IEEE Transactions on Very Large Scale Integrated Systems (TVLSI), vol. 23, no. 2, Feb. 2015.
  • J6 Z. Liu, S. Swarup, S. X.-D. Tan, H. Chen, H. Wang, "Compact lateral thermal resistance model of TSVs for fast finite-difference based thermal analysis of 3D stacked ICs", IEEE Transaction on Computer-Aided Design of Integrated Circuits and Systems (TCAD), vol. 33, no. 10. Oct. 2014.
  • J7 H. Chen, S. X-.D. Tan, D. H. Shin, X. Huang, H. Wang and G. Shi, “H2-Matrix-based Finite Element Linear Solver for Fast Transient Thermal Analysis of High-Performance ICs”, Int. J. Circ. Theor. Appl. (in press), DOI: 10.1002/cta.2051.
  • J8 H. Chen, Y. Li, S. X.-D. Tan, X. Huang, H. Wang and N. Wong, “H-matrix based finite-element-based thermal analysis for 3D ICs”, ACM Transactions on Design Automation of Electronic Systems (TODAES), vol. 20, no. 47, pp. 47:1-25, 2015.
  • J10 X. Huang, A. Kteyan, S. X.-D. Tan, V. Sukharev, “Physics-based electromigration models and full-chip assessment for power grid networks”, IEEE Transaction on Computer-Aided Design of Integrated Circuits and Systems (TCAD), Vol. 35, No. 11, pp.1848-1861, Nov. 2016.
  • J11 H. Chen, S. X.-D. Tan, X. Huang, T. Kim, V. Sukharev, “Analytical modeling and characterization of electromigration effects for multi-branch interconnect trees”, IEEE Transaction on Computer-Aided Design of Integrated Circuits and Systems (TCAD), Vol. 35, No. 11, pp.1811-1824, Nov. 2016.
  • J12 H. Wang, J. Ma, S. X.-D. Tan, C. Zhang, H. Tang, and K. Huang, “Hierarchical dynamic thermal management method for high-performance many-core microprocessors”, ACM Transactions on Design Automation of Electronic Systems (TODAES), Vol 22, No.1, 1:1-1:21, July 2016.
  • J13 X. Huang, V. Sukharev, J.-H. Choy, M. Chew, T. Kim, S. X.-D. Tan, “Electromigration assessment for power grid networks considering temperature and thermal stress effects”, Integration, The VLSI Journal, , Volume 55, September 2016, Pages 307-315, ISSN 0167-9260, https://doi.org/10.1016/j.vlsi.2016.04.001.
  • J14 X. Huang, V. Sukharev, T. Kim, S. X.-D. Tan, "Dynamic electromigration modeling for transient stress evolution and recovery under time-dependent current and temperature stressing," Integration, the VLSI Journal, Available online 12 November 2016, ISSN 0167-9260, https://doi.org/10.1016/j.vlsi.2016.10.007.

 

Conference publications

  • C1 H. Wang, S. X.-D. Tan, X. Liu, A. Gupta, "Runtime power estimator calibration for high-performance microprocessors", Proc. Design, Automation and Test in Europe (DATE'12), pp.352-357, Dresden, Germany, March 2012.
  • C2 Z. Liu, S. X.-D. Tan, H. Wang, A. Gupta, and S. Swarup , "Compact nonlinear thermal modeling of packaged integrated systems", Proc. Asia South Pacific Design Automation Conference (ASP-DAC'13), pp. 157-162, Yokohama, Japan, Jan. 2013
  • C3 Z. Liu, T. Xu, S. X.-D. Tan, and H. Wang, "Dynamic thermal management for multi-core microprocessors considering transient thermal effects", Proc. Asia South Pacific Design Automation Conference (ASP-DAC'13), pp.473-478, Yokohama, Japan, Jan. 2013.
  • C4 H. Wang, S. X.-D. Tan, S. Swarup, and X. Liu, "A power-driven thermal sensor placement algorithm for dynamic thermal management", Proc. Design, Automation and Test in Europe (DATE'13), pp.1215-1220, Grenoble, France, March 2013.
  • C5 Z. Liu, S. Swarup, and S. X-D. Tan, "Compact lateral thermal resistance modeling and characterization for TSV and TSV array", Proc. IEEE/ACM International Conf. on Computer-Aided Design (ICCAD'13), San Jose, CA, Nov. 2013.
  • C6 Z. Liu, X. Huang, S. X.-D. Tan, H. Wang, H. Tang, "Distributed task migration for thermal hot spot reduction in many-core microprocessors", in Proc. International Conference on ASIC (ASICON'13), Shenzhen, China, Oct. 2013
  • C7 Y. Chi, S. X.-D. Tan, T. Yu, X. Huang and N. Wong, "Direct finite-element-based solver for 3D-IC thermal analysis via H-matrix representation", Proc. Int. Symposium on Quality Electronic Design (ISQED'14), San Jose, CA, March, 2014.
  • C8 X. Huang, T. Yu, V. Sukharev, S. X.-D. Tan, "Physics-based electromigration assessment for power grid networks", Proc. IEEE/ACM Design Automation Conference (DAC'14), San Francisco, June, 2014. (Best Paper Award Nomination (12 out of 787 submissions, 1.5%))
  • C9 Z. Liu, X. Huang, V. Sukharev and S. X.-D. Tan, "EM-reliability system modeling and performance optimization for high-performance microprocessors", TECHCON'14 , Austin, TX, Sept. 2014.
  • C10 V. Sukharev, X. Huang, H. Chen and S. X.-D. Tan, "IR-drop based electromigration assessment: parametric failure chip-scale analysis", Proc. IEEE/ACM International Conf. on Computer-Aided Design (ICCAD'14), San Jose, CA, Nov. 2014.
  • C11 T. Kim, B. Zheng, H. Chen, Q. Zhu, V. Sukharev and S. X.-D. Tan, "Lifetime optimization for real-time embedded systems considering electromigration effects", Proc. IEEE/ACM International Conf. on Computer-Aided Design (ICCAD'14), San Jose, CA, Nov. 2014.
  • C12 J. Ma, H. Wang, S. X.-D. Tan, C. Zhang, H. Tang, "Hybrid dynamic thermal management method with model predictive control", IEEE Asia Pacific Conference on Circuit and Systems (APCCAS'15), Ishigaki Island, Okinawa, Japan, Nov. 2014
  • C13 H. Chen, S. X.-D. Tan, X. Huang, V. Sukharev, "New electromigration modeling and analysis considering time-varying temperature and current densities", Proc. Asia South Pacific Design Automation Conference (ASP-DAC'15), Chiba, Japan, Jan. 2015. .(Best Paper Award Nomination)
  • C14 H. Chen, X. Huang, V. Sukharev, S. X.-D. Tan, T. Kim, "Interconnect reliability modeling and analysis for multi-branch interconnect trees", Proc. IEEE/ACM Design Automation Conference (DAC'15), San Francisco, June, 2015
  • C14 H. Chen, X. Huang, V. Sukharev, S. X.-D. Tan, T. Kim, "Interconnect reliability modeling and analysis for multi-branch interconnect trees", Proc. IEEE/ACM Design Automation Conference (DAC'15), San Francisco, June, 2015
  • C15 T. Kim, X. Huang, V. Sukharev and S. X.-D. Tan, "A dynamic reliability management framework for dark silicon", TECHCON'2015, Austin, TX, September 2015.
  • C16 X. Huang, V. Sukharev, J.-H. Choy, H. Chen, E. Tlelo-Cuautle and S. X.-D. Tan, "Full-chip electromigration assessment: effect of cross-layout temperature and thermal stress distributions", International Conference on Synthesis, Modeling, Analysis and Simulation Methods and Applications to Circuit Design (SMACD), Istanbul, Turkey, Sept. 2015.
  • C17 T. Kim and X. Huang, V. Sukharev, S. X.-D. Tan, "Learning-based reliability management for dark silicon systems", 6th IEEE International Workshop on Testing 3D Stacked ICs (3D-Test), Anaheim, CA, Oct., 2015.
  • C18 X. Huang, V. Sukharev, T. Kim, H. Chen and S. X-D. Tan, "Electromigration recovery modeling and analysis under time-dependent current and temperature stressing", Proc. Asia South Pacific Design Automation Conference (ASP-DAC'16), Macao, China, Jan. 2016.
  • C19 T. Kim, X. Huang, H. Chen, V. Sukharev and S. X.-D. Tan, "Learning-based dynamic reliability management for dark silicon processor considering EM effects", Proc. Design, Automation and Test in Europe (DATE'16), Dresden, March 2016.
  • C20 X. Huang, V. Sukharev, Z. Qi, T Kim and S. X.-D. Tan, "Physics-based full-chip TDDB assessment for BEOL Interconnects", Proc. IEEE/ACM Design Automation Conference (DAC'16), Austin, TX, June, 2016.
  • C21 T. Kim, Z. Sun, C. Cook, H. Zhao, R. Li, D. Wong and S. X.-D. Tan, "Cross-layer modeling and optimization for electromigration induced reliability", Proc. IEEE/ACM Design Automation Conference (DAC'16), Austin, TX, June 2016. (Invited)
  • C22 C. Cook, Z. Sun, T. Kim and S. X.-D. Tan, "Finite difference method for electromigration analysis of multi-branch interconnects", International Conference on Synthesis, Modeling, Analysis and Simulation Methods and Applications to Circuit Design (SMACD'16), Lisbon, Portugal, June 2016.
  • C23 H. Wang, M. Zhang, S. X.-D. Tan, C. Zhang, Y. Yuan, K. Huang and Z. Zhang, "New power budgeting and thermal management scheme for multi-core systems in dark silicon", 29th IEEE International SoC Conference (SOCC'16), Seattle, WA Sept, 2016.
  • C24 C. Cook, Z. Sun, T. Kim and S. X.-D. Tan, "Finite difference time domain analysis of stress evolution and void growth for general interconnect wires", TECHCON'2016, Austin, TX, September 2016.
  • C25 H. Zhao, S. X.-D. Tan, H. Wang, H. Chen, "Online unusual behavior detection for temperature sensor networks", 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI'16), pp. 59-62, Pittsburg, PA, Sept. 2016.
  • C26 X. Chen, H. Chen, W. Ma, X. Li, S. X.-D. Tan, "Energy-efficient wireless temperature sensoring for smart building application", Int, Conf. Solid State and Integrated Circuit Technology (ICSICT'16), Hangzhou, China, Oct. 2016. (invited)
  • C27 Z. Sun, E. Demircan, M. Shroff, T. Kim, X. Huang, S. X.-D. Tan, "Voltage-based electromigration immortality check for general multi-branch interconnects", Proc. IEEE/ACM International Conf. on Computer-Aided Design (ICCAD'16), Austin, TX, Nov. 2016.
  • C28 T. Kim, Z. Sun, J. Gaddipati, H. Wang, H. Chen, S. X.-D. Tan, "Dynamic reliability management for near-threshold dark silicon processors", Proc. IEEE/ACM International Conf. on Computer-Aided Design (ICCAD'16), Austin, TX, Nov. 2016. (Invited)
  • C29 L. Xu, H. Wang, S. X.-D. Tan, C. Zhang, Y. Yuan, K. Huang, Z. Zhang, "Distributed model predictive control for dynamic thermal management of multi-core systems", Int., Conf. Solid State and Integrated Circuit Technology (ICSICT'16), Hangzhou, China, Oct. 2016.
  • C30 J. Wan, H. Wang, J. He, S. X.-D. Tan, Y. Cai, S. Yang "A fast full-chip static power estimation method", Int., Conf. Solid State and Integrated Circuit Technology (ICSICT'16), Hangzhou, China, Oct. 2016.
  • C31 S. Wang, H. Zhao, S. X.-D. Sheldon Tan and M. Tahoori, "Recovery-aware proactive TSV repair for electromigration in 3D ICs", Proc. Design, Automation and Test in Europe (DATE'17), Lausenne, Switzerland, March 2017.
  • C32 X. Wang, H. Wang, J. He, S. X.-D. Tan, Y. Cai and S. Yang, "Physics-based electromigration modeling and assessment for multi-segment interconnects in power grid networks", Proc. Design, Automation and Test in Europe (DATE'17), Lausanne, Switzerland, March 2017.