Transient Thermal Analysis of MCM towards Understanding Failure Mechanism of Intermittent Faults

Guanqian Deng, Peng Yang, Jing Qiu*, Guanjun Liu, Kehong Lv

Laboratory of Science and Technology on Integrated Logistics Support, College of Mechatronics and Automation, National University of Defense Technology, Changsha 410073, China
qiujing16@sina.com

Temperature fluctuation which can be treated as fault event has been recognized as a major cause of the intermittent faults (IFs). In order to investigate the failure mechanism of IFs correlated with temperature, the regulation of IF evolution through fault and reset events is studied and transient thermal analysis is carried out. Analysis result shows that the temperature of MCM is fluctuant due to the variations in the power input and the ambient temperature which may activate or deactivate IFs. Finally, temperature test is carried out to confirm the IFs phenomenon. This work can help in improving the thermal design and understanding the occurrence of IFs and, hence, in implementing IFs tolerance, diagnosis and recovery techniques.

1. Introduction

The continuous scaling semiconductor technology has led to remarkable performance gains. At the same time, the lower transistor and interconnect sizes, the higher clock frequency, and lower supply voltage have contributed to higher rates of occurrence of certain types of faults which are typically partitioned as permanent, intermittent and transient (Constantinescu, 2008). Field collected data and failure analysis have shown that intermittent faults (IFs) constitute a large proportion of logical errors in semiconductor devices. Even under an optimal environment, these faults can occur 10 to 30 times as frequent as the permanent faults (PFs) (Ismaeel, 1997). Further, in-progress wear-out and residual manufacturing errors are likely to result in more IFs. Therefore, IFs are likely to be a significant concern in future systems, despite the extensive use of fault avoidance techniques (Rashid, 2010).

There are many different factors contributing to IFs (Rashid, 2010). All the phenomena of semiconductor devices have been confirmed closely related to its junction temperature. Temperature fluctuation has recognized as a prominent cause as external disturbances for the occurrence of IFs in electronic devices. Increasing temperature will accelerate aging, decrease timing margins and then promote the occurrence of failure (Haiyu, 2008). To avoiding the faults caused by overheat, research has been devoted to the thermal analysis of semiconductor devices (Boudreaux, 2004). Finite element method (FEM) is one of the most familiar ways. Steady-state thermal analyses are carried out extensively in which the thermal load is assumed as steady. The electronic devices are usually designed to work within a certain temperature range based on the analysis results. However, in the practice, the power input and the ambient temperature of electronic devices is time-varying, and then some plague status may appear (Guilhemsang, 2010). Therefore, transient thermal analysis is needed. As the advantages such as small size, high wiring capability, and so on, the Multi-chip module (MCM) is extensively applied into various fields such as aerospace field. In this paper, transient thermal analysis of MCM is carried out to analyse the thermal field using ANSYS.

2. Regulations of fault evolution

The temperature value of MCM usually changes continuously in a range rather than certain fixed value. When the temperature value exceeds its operating limit (OL), the MCM usually work abnormally; once the
temperature value falls back to normal, the failure disappears automatically. When the temperature value exceeds destroy limit (DL), the failure is usually irreversible and the fault state will keep stay until the faulty chip to be replaced (Constantinescu, 2007). Therefore, we divide the temperature into three intervals according to OL and DL and denote them by normal, PF and IF stress interval, respectively. Since temperature are the prominent causes for faults. We treat it as fault event, hence the three intervals are denoted by normal event, PF event $f_{IP}$ and IF event $f_{ir}$, respectively. Since IF behavior often occurs intermittently, with fault event followed by corresponding “reset” event for this fault, followed by new occurrences of fault event, and so forth (Contant, 2004). Thus, $f_{IP}$ includes the present IF event $f_{IP}$ and the reset IF event $r_i$. The IF occur intermittently through $f_{IP}$ and $r_i$. Each $f_{IP}$ has its corresponding $r_i$, where $r_i$ can't happen until $f_{IP}$ occurs at least once. IFs can be activated and deactivated frequently but irregularly for a period of time (Ricks, 2010). An example of IFs behavior activated and deactivated by temperature is shown in Figure 1.

![Figure 1: An example of fault behaviour of IFs](image)

In Figure 1, IFs behavior usually follows a square wave pattern. The fault amplitude varies, time between faulty behaviors varies, and the duration time of faulty behavior may vary as well. The high parts of the wave caused by $f_{IP}$ represent points in time with an IF and on-going, and the low parts caused by $r_i$ represent an IF but not currently on-going. Each high-low pattern represents one cycle in the wave. Since fault events are usually unobservable and the behavior of the current IFs and PFs is similar, it may be unclear whether it is intermittent or persistent when a fault first occurs. Since temperature may lead to IFs and PFs, the ascend part of temperature fluctuation can be $f_{IP}$ or $f_{IP}$, and the descending part may be $r_i$. In this way, temperature fluctuation can be arbitrary events mentioned above. We denote the current fault by $FD$, where $D$ stands for “to be diagnosed”. Since it is difficult to recognize the fault types (with bounded delay), we denote the current fault event by $f_{ID}$. $f_{ID}$ is either $f_{IP}$ or $r_i$. The system evolution through $f_{IP}$, $f_{IP}$ and $r_i$ is illustrated in Figure 2.

![Figure 2: The system evolution through fault and reset events](image)

In Figure 2, $N$ denotes the normal state, $\epsilon$ denotes empty event. The faults evolution follows some regulations, IFs can automatically recover once they have occurred, and can become and will eventually turn to PFs by $f_{IP}$. $FD$ can be recognized through the succeeding observable events, combined with the regulations of faults evolution.
3. Finite element model of MCM

The finite element model is the basis of transient thermal analysis. MCM YHFT is considered in this paper. The power devices on the front face of MCM are chips SDRAM, YHFT_DSP and TPS70445, Chip Flash is on the reverse side. Chip YHFT_DSP is on the substrate with flip-chip bonding, and the solders which is array distribution connect the substrate to PC board (PCB). The other chips are connected to PCB by leads. The passive devices are neglected because of no or micro dissipation generated. The dimension and material parameter of MCM are listed in Table 1.

Table 1: The dimension and material parameter of MCM

<table>
<thead>
<tr>
<th>Component</th>
<th>Material</th>
<th>Dimension (mm)</th>
<th>Thermal conduction coefficient (W/m·k)</th>
<th>Modulus of elasticity EX(GPa)</th>
<th>Poisson ratio PRXY</th>
<th>Thermal expansion coefficient ALPX/(1/K)</th>
</tr>
</thead>
<tbody>
<tr>
<td>YHFT_DSP</td>
<td>Si</td>
<td>23<em>23</em>1</td>
<td>82</td>
<td>131</td>
<td>0.3</td>
<td>2.8E-6</td>
</tr>
<tr>
<td>SDRAM</td>
<td>Si</td>
<td>22<em>10</em>1</td>
<td>82</td>
<td>131</td>
<td>0.3</td>
<td>2.8E-6</td>
</tr>
<tr>
<td>TPS70445</td>
<td>Si</td>
<td>8<em>4</em>1</td>
<td>82</td>
<td>131</td>
<td>0.3</td>
<td>2.8E-6</td>
</tr>
<tr>
<td>Substrate</td>
<td>polyimide</td>
<td>27<em>27</em>0.5</td>
<td>0.2</td>
<td>22</td>
<td>0.28</td>
<td>18E-6</td>
</tr>
<tr>
<td>Solder</td>
<td>63Pb37Sn</td>
<td>272,r:0.445, H:0.65,Pitch:1.27</td>
<td>50</td>
<td>43.25</td>
<td>0.363</td>
<td>21E-6</td>
</tr>
<tr>
<td>PCB</td>
<td>FR4</td>
<td>90<em>76</em>2</td>
<td>8.37,8.37,0.32</td>
<td>22</td>
<td>0.28</td>
<td>18E-6</td>
</tr>
<tr>
<td>Lead</td>
<td>Cu</td>
<td>L:0.8;L:0.5;W:0.2; H:0.4;D:0.5;N:43</td>
<td>390</td>
<td>120.658</td>
<td>0.345</td>
<td>17E-6</td>
</tr>
<tr>
<td>Paste</td>
<td>paste</td>
<td>0.15</td>
<td>1.1</td>
<td>5.2</td>
<td>0.3</td>
<td>40E-6</td>
</tr>
</tbody>
</table>

The finite element model of YHFT is constructed as shown in Figure 3.

Figure 3: The finite element model of YHFT

In Figure 3, the mesh utilized is not uniform. The regions closer to the connections are assigned with finer meshes than the other regions.

4. Transient thermal analysis OF MCM

In order to simulate the actual work condition, the ambient temperature of MCM should be obtained first. An example of the ambient temperature of MCM is shown in Figure 4 (YU, 2000).

Figure 4: The ambient temperature of internal avionics pods
In Figure 4, one can see that the surrounding temperature of MCM varies from 37.5°C to 50°C, the most serious thermal load is at about 30 minute. The power dissipations of SDRAM, YHFT_DSP, TPS70445 and Flash are 1W, 3W, 0.5 W and 0.6W respectively. The air forced convection coefficient is $40 \left(\text{W} \cdot \text{m}^2 \cdot \text{C}\right)^{-1}$. The transient thermal analysis is carried out, and the temperature contour of MCM is shown in Figure 5.

![Figure 5: The temperature contour of MCM](image)

In Figure 5, one can see that the temperature distribution of MCM is heterogeneous. The temperature on the chips is higher and descends gradually from the center to the edge. The highest temperature is on the surface of the chip YHFT_DSP. It explains that the effect of dissipation can’t be neglected. To analyze the effect of ambient temperature, the nodes on the surface of chips are selected respectively, and then the temperature curves along with time are shown in Figure 6.

![Figure 6: Temperature curves of the selected nodes on MCM](image)

In Figure 6, one can find that the node temperatures of MCM vary with time and the change tendency is similar to that of ambient temperature. It accounts for that the change tendency of surface temperature of MCM is depended on the ambient temperature. According to the analysis in Section 2, IFs can be activated or deactivated because of the variation of MCM temperature.

5. Case study

In order to confirm the simulation analysis results and the occurrence of IFs phenomenon, experiment test system is constructed. The temperature-humidity box CH2000VT-15 which imported from Italy is used to
provide the ambient temperature shown in Figure 4. The MCM is offered 5V DC by connecting to the direct-current power supply, the chip YHFT_DSP works at 240MHz, free convection heat transfer is adopted. To validate the occurrence of IFs of chip YHFT_DSP, light-emitting diode LED1 and LED2 are used to indicate the power supply condition and monitor the working situation respectively. LED1 and LED2 flickering at fixed slow and fast frequency denotes it's normal. The temperature test is shown in Figure 7.

Figure 7: Temperature test of MCM

LED1 and LED2 operate normal in the beginning. When the test lasts about 29 minute, LED2 flickers all the time which shows the chip YHFT_DSP failures. The surface temperature of chip YHFT_DSP is found to exceed the maximum permissible temperature by examining the temperature sensors. Several minutes later, with the temperature falling down, the failure recovers automatically. Repetitious experiments are carried out and similar results are received. All chips are tested normally after experiment. We thus can confirm the fault caused by temperature fluctuation is an IF.

6. Conclusions

IFs can be activated or deactivated by $f_{IP}$ and $r_i$. The transient thermal analysis result shows that the temperature of MCM fluctuates due to variations in the power input and ambient temperature. Since temperature can be treated as fault event, the temperature fluctuation may be each of $f_{IP}$, $f_{WP}$ and $r_i$. Therefore, temperature fluctuation is a major cause of promoting the occurrence of IFs. For the future work, we plan to discriminate IFs from PFs by evaluating the fault event of the correlative ambient temperature, combined with the regulation of fault evolution.

References

Rashid L., Pattabiraman K., Gopalakrishnan S., 2010, Towards understanding the effects of intermittent hardware faults on programs, International conference on dependable systems and network workshops, Chicago, IL, USA, 101-106.

