Behrooz Parhami's website banner

Menu:

Behrooz Parhami's ECE 257A Course Page for Fall 2019

Collapsed bridge

Fault-Tolerant Computing

Page last updated on 2019 December 09

Enrollment code: 13946
Prerequisite: ECE 154 (computer architecture), or equivalent
Class meetings: MW 10:00-11:30, North Hall 1111
Instructor: Professor Behrooz Parhami
Open office hours: M 12:00-2:00, W 1:00-2:00; HFH 5155
Course announcements: Listed in reverse chronological order
Course calendar: Lecture, homework, and exam schedules
Homework assignments: Four assignments, worth a total of 30%
Exams: Midterm, worth 30%; Final, worth 40% (both open-book)
Research paper: Not required for fall 2019
Research paper guidlines: Brief guide to format and contents
Poster presentation tips: Brief guide to format and structure
Policy on academic integrity: Please read very carefully
Grades: Statistics for homework and exam grades
References: Textbook and other sources (Textbook's web page)
Lecture slides: Via the textbook's Web page
Miscellaneous information: Motivation, catalog entry, history

Course Announcements

Megaphone

2019/12/09: The fall 2019 offering of ECE 257A is officially over and grades have been reported to the Registrar. Happy holidays and hope to see some of you in my ECE 254B class next quarter!
2019/12/05: Corrections to HW4 solutions: In part d of the solution to Problem 27.3, remove "5-out-of-8," leaving "an 8-input threshold voting unit." For Problem 23.1, the problem's statement rather than solution was provided in the handout. I will e-mail the solution to you.
2019/12/03: During our last class on W 12/04, we will discuss two ongoing research projects of mine. One has resulted in the just-accepted paper "Reliability Inversion: A Cautionary Tale" (IEEE Computer, to appear, PDF) and the other deals with synthesizing large majority circuits in a recursive manner.
2019/11/25: As announced in class today, I have cancelled the class meeting on Wednesday 11/27. We will cover Chapters 26 and 28 of the textbook on Monday 12/02 instead. I will still hold my office hour on W 11/27, 1:00-2:00 PM.
2019/11/22: Homework 4 has been posted to the homework area below. As we enter the discussion of failures and failure confinement in Part VII of the textbook, you may find my review of Henry Petroski's The Evolution of Useful Things: How Everyday Artifacts—From Forks and Pins to Paper Clips and Zippers—Came to Be as They Are on GoodReads or Facebook of some interest. Among many interesting topics, Petroski talks about "innovation by failure," whereby inventors act as critics who find faults with existing gadgets/processes and are also equipped to do something about it.
2019/11/18: Slide #125 in Part I has several links at the beginning that are broken. I failed to check and update these during the last revision. Here are updated versions of the links.
[PTC Windchill's fault-tree analysis tool and Markov analysis tool]
[UV Galileo fault-tree analysis tool: The link is to the user's manual. I couldn't find a link to the the tool itself, which may have been discontinued.] [Iowa State University's HIMAP tool]
In general, whenever you encounter a broken link, a Google search with a description of the information sought is likely to lead you to the new location for the information. This is why it is important to include identifying information with each posted link.
2019/11/09: Homework 3 has been posted to the homework area below. Here's a report on yesterday's lecture by Moshe Y. Vardi entitled "An Ethical Crisis in Computing?"
2019/10/26: Recommended lecture: Ethics and ethical behavior are big components of ensuring dependability in computing systems, as we will discuss in our class session on November 25. Accordingly, I highly recommend that you attend the CS Distinguished Lecture by Dr. Moshe Y. Vardi, entitled "An Ethical Crisis in Computing?" (Friday, November 8, Life Sciences 1001, 3:30 PM). Vardi will argue that technology "brings with it not only societal benefits, but also significant costs, such as labor polarization, disinformation, and smart-phone addiction. ... The real issue is how to deal with technology's impact on society. Technology is driving the future, but who is doing the steering?"
2019/10/24: Please correct the last number in the solution to Example 8.1 from 44% to 56%.
2019/10/18: Homework 2 has been posted to the homework area below. Study of bridges and their reliability/safety provides important lessons to designers of safety-critical systems. If interested to dig deeper, watch the episode of Nova (PBS science prorgram) entitled "Why Bridges Collapse" (53-minute video).
2019/10/16: Latest development related to Problem 1.25 of HW1: The US Federal Aviation Administration and Boeing found at fault over their problematic certification of the 737 Max.
2019/10/14: Notes on radiation hardening: Because of questions arising in class today, I did some digging about radiation-hardened electronic devices and how they are rated. First, I mis-spoke about the cumulative radiation effect during a one-way trip to Mars: It is 1000 kilo-rad (1 mega-rad), which is close to the limit of what can be tolerated now. Wikipedia has an excellent article about radiation hardening. The cumulative effect of radiation results from lattice displacement, a form of structural damage, in semiconductors. There is also possible damage from intense, short-term radiation, such as what we have in the vicinity of a nuclear explosion. This can be likened to the study of power dissipation in electronics, which entails paying attention to both peak power and average power (indicative of the energy used). Regarding rating of radiation hardness, some useful info is presented in the Wikipedia article under "Radiation-hardening techniques: Physical." The Wikipedia article also has an extensive list of rad-hard computers that are available. One example is RAD6000, which boasts radiation hardness of at least 1 mega-rad total dose, immunity to latch-ups, and a tiny probability (< 10^–9) of errors due to SEUs.
2019/10/05: Homework 1 has been posted to the homework area below.
2019/09/29: Welcome to the ECE 257A web page for fall 2019. Our classroom has 42 seats, with 11 students enrolled at this time. I plan to update the lecture slides and textbook chapters over the course of the fall quarter, with each revised chapter becoming available shortly before discussion in class. Updated versions of the first part of the book and lecture slides will be posted later today.

Course Calendar

Calendar

Course lectures, homework assignments, and exams, have been scheduled as follows. This schedule will be strictly observed. In particular, no extension is possible for homework due dates. Please begin work on your assignments early. Each lecture corresponds to topics in 1-2 chapters of the instructor's forthcoming textbook on dependable computing. Chapter numbers are provided in parentheses, after day & date.

Day & Date (book chapters) Lecture topic [Homework posted/due] {Special notes}
M 09/30 (0-1) Background and motivation
W 10/02 (1-2) Dependability attributes

M 10/07 (3) Combinational modeling [HW1 posted, chs. 1-4]
W 10/09 (4) State-space modeling

M 10/14 (5, 7) Defect avoidance; Shielding and hardening
W 10/16 (6, 8) Defect circumvention; Yield enhancement [HW1 due]

M 10/21 (9, 11) Fault testing; Design for testability [HW2 posted, chs. 5-12]
W 10/23 (10, 12) Fault masking; Replication with voting

M 10/28 (13, 15) Error detection; Self-checking modules
W 10/30 (14, 16) Error correction; Redundant disk arrays [HW2 due]

M 11/04 (1-12) Midterm exam, open-book/notes, 10:00-11:45 AM
W 11/06 (17, 19) Malfunction diagnosis; Standby redundancy

M 11/11 No lecture (Veterans' Day) [HW3 posted, chs. 13-20]
W 11/13 (18, 20) Malfunction tolerance; Robust parallel processing

M 11/18 (21, 23) Degradation allowance; Resilient algorithms
W 11/20 (22, 24) Degradation management; Software redundancy [HW3 due]

M 11/25 (25, 27) Failure confinement; Agreement and adjudication [HW4 posted, chs. 21-28]
W 11/27 No lecture (class cancelled)

M 12/02 (26, 28) Failure recovery; Fail-safe systems {Instructor and course evaluations}
W 12/04 (Special research presentation) Reliability inversion; Design of voting circuits [HW4 due]

M 12/09 (13-28) Final exam {Will be held in our regular classroom from 8:30 to 11:00 AM}
W 12/18 {Course grades due by midnight}

Homework Assignments

Homework image

-Turn in your solutions as a PDF file attached to an e-mail sent by the due date/time.
-Because solutions will be handed out on the due date, no extension can be granted.
-Include your name, course name, and assignment number at the top of the first page.
-If homework is handwritten and scanned, make sure that the PDF is clean and legible.
-Although some cooperation is permitted, direct copying will have severe consequences.

Homework 1: Dependability and its modeling (chs. 1-4, due W 2019/10/16, 10:00 AM)
Do the following problems from the textbook or defined below: 1.4, 1.25 (see below), 2.14, 3.19, 4.4, 4.8
[Please use the following updated version of Problem 1.25's statement.]
1.25 The troubles of Boeing 737 Max 8  Following two crashes of Boeing 737 Max 8 passenger jets in late 2018 and early 2019, killing hundreds, airlines and various aviation authorities around the world grounded the planes until crash causes could be determined and attendant design flaws corrected. When Boeing 737 Max 8 was tested following its introduction, certain stability problems were detected, but rather than redesigning the plane, Boeing chose to augment them with a software system to compensate for the problems in flight. Both crashes were attributed to flaws in the aforementioned software and lack of certain monitoring instruments that could have warned the pilots of emerging challenges. As of late 2019, the planes remained grounded, because Boeing's purported fixes have not satisfied aviation authorities. Using on-line sources, study the causes of the two crashes, reasons for grounding the planes, and results of investigations conducted after the grounding, presenting your results in a 2-page report.

Homework 2: Defects and faults (chs. 5-12, due W 2019/10/30, 10:00 AM)
Do the following problems from the textbook: 5.1, 7.3, 8.4, 9.3, 11.2ab, 12.5

Homework 3: Errors and malfunctions (chs. 13-20, due M 2019/11/20, 10:00 AM)
Do the following problems from the textbook: 13.4, 14.9, 16.3, 17.8, 19.1, 20.5

Homework 4: Degradations and failures (chs. 21-28, due W 2019/12/04, 10:00 AM)
Do the following problems from the textbook: 21.2, 23.1, 24.8, 25.2, 26.1, 27.3

Sample Exams and Study Guide

Answer sheet

The following sample exam problems are meant to indicate the types and levels of problems, rather than the coverage (which is outlined in the course calendar).
Students are responsible for all sections and topics in the textbook and class handouts that are not explicitly excluded in the study guide that follows each sample exam, even if the material was not covered in class lectures.

Sample Midterm Exam (105 minutes)
Problems 3.12, 4.4, 9.4, and 12.1 from the textbook.

Midterm Exam Study Guide
Study Chapters 1-12 and review the problems in homework assignments 1-2. The following textbook sections are excluded: 6.6, 7.6, 8.6, 9.4, 9.6, 11.6

Sample Final Exam (120 minutes)
Problems 15.5, 17.1, 21.2, and 27.3 from the textbook.

Final Exam Study Guide
Study Chapters 13-28 and review the problems in homework assignments 3-4. The following textbook sections are excluded: 13.6, 14.6

Research Paper and Presentation (not applicable to fall 2019)

Colored marbles
Each student will review a subfield of dependable computing or do original research on a selected and approved topic. A preliminary list of research topics is provided below (new topics, and new references for the current topics, may be added later). However, students should feel free to propose their own topics for approval. To propose a topic, send via e-mail a one-page narrative, including 2-3 key references, to the instructor.

A publishable report earns an "A" for the course, regardless of homework and midterm grades. See the course calendar for schedule and due dates and Research Paper Guidlines for formatting tips.

This year's suggested research topics for ECE 257A are built around the theme "Robustness of Interconnection networks." You can get started on each topic by taking a look at the following two common references, plus one topic-specific reference that is provided further down on this page. The two common references are:

[Parh10] Parhami, B., "Robustness Attributes of Interconnection Networks for Parallel Processing," Keynote Lecture at the First Int'l Supercomputing Conf., Guadalajara, Mexico, March 2010. {PPT and PDF slides are available from B. Parhami's Publications Web page; see publication [262].}

[Sall12] Salles, R. M. and D. A. Marion Jr., "Strategies and Metric for Resilience in Computer Networks," Computer J., Vol. 55, No. 6, pp. 728-739, June 2012.

1. Effects of Missing Nodes on Network Diameter and Average Distance (Assigned to: Adrian Fiorito)
[Kris87] Krishnamoorthy, M.S. and B. Krishnamurthy, "Fault Diameter of Interconnection Networks," Computers & Mathematics with Applications, Vol. 13, Nos. 5/6, pp. 577-582, 1987.

2. Effects of Missing Links on Network Diameter and Average Distance (Assigned to: TBD)
[Kris87] Krishnamoorthy, M.S. and B. Krishnamurthy, "Fault Diameter of Interconnection Networks," Computers & Mathematics with Applications, Vol. 13, Nos. 5/6, pp. 577-582, 1987.

3. Synthesis of Interconnection Networks with Maximal Fault Tolerance (Assigned to: TBD)
[Chen09] W. Chen, W. J. Xiao, and B. Parhami, "Swapped (OTIS) Networks Built of Connected Basis Networks are Maximally Fault Tolerant," IEEE Trans. Parallel and Distributed Systems, Vol. 20, pp. 361-366, March 2009.

4. Adaptive Schemes for Point-to-Point Communication in Networks (Assigned to: Xuan Wang)
[Ngai91] Ngai, J. Y. and C. L. Seitz, "A Framework for Adaptive Routing in Multicomputer Networks," Computer Architecture News, Vol. 19, No. 1, pp. 6-14, March 1991.

5. Adaptive Schemes for Collective Communication in Networks (Assigned to: Prashansa Mukim)
[Pand95] Panda, D. K., "Issues in Designing Efficient and Practical Algorithms for Collective Communication on Wormhole-Routed Systems," Proc. Int'l Conf. Parallel Processing Workshop on Challenges for Parallel Processing, 1995, pp. 8-15.

6. Deadlocks in Adaptive Routing and How to Avoid or Detect Them (Assigned to: Fengqiao Sang)
[Dall93] Dally, W. J. and H. Aoki, "Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels," IEEE Trans. Parallel and Distributed Systems, Vol. 4, No. 4, pp. 466-475, April 1993.

7. Diagnosability of Regular Degree-d Interconnection Networks (Assigned to: Sixin Tao)
[Chan05] Chang, G.-Y., G. J. Chang, and G.-H. Chen, "Diagnosabilities of Regular Networks," IEEE Trans. Parallel and Distributed Systems, Vol. 16, No. 4, pp. 314-323, April 2005

8. Diagnosability of Hierarchical or Multilevel Interconnection Networks (Assigned to: Nan Wu)
[Xu09] Xu, M., K. Thulasiraman, and X.-D. Hu, "Conditional Diagnosability of Matching Composition Networks Under the PMC Model," IEEE Trans. Circuits and Systems II, Vol. 56, No. 11, pp. 875-879, November 2009.

9. Synthesis of Interconnection Networks with Maximal Diagnosability (Assigned to: Yiming Gan)
[Chan05] Chang, G.-Y., G. J. Chang, and G.-H. Chen, "Diagnosabilities of Regular Networks," IEEE Trans. Parallel and Distributed Systems, Vol. 16, No. 4, pp. 314-323, April 2005

Topics outside the main theme for the quarter

a. Reasoning Under Uncertainly, with Applications to Dependable Computing (Assigned to: TBD)
[IJAR16] Int'l J. Approximate Reasoning, Vol. 71, pp. 1-62, December 2016 (Five review articles on 40 years of Dempster-Shafer Theory)

b. Probabilistic Analysis of Program Correctness Under Soft Errors (Assigned to: TBD)
[Carb16] Carbin, M., S. Misailovic, and M. C. Rinard, "Verifying Quantitative Reliability for Programs that Execute on Unreliable Hardware," Communications of the ACM, Vol. 59, No. 8, pp. 83-91, August 2016.

c. Effects of Temporal Resistance-State Variation on ReRAM Reliability (Proposed by: Abanti Basak)
[Ref 1] "Modeling Framework for Cross-Point Resistive Memory Design Emphasizing Reliability and Variability Issues"

d. Computation-Oriented Fault Tolerance Schemes for RRAM-Based Systems (Proposed by: Wenqin Huangfu)
[Chen15] Chen, C.-Y., et al., "RRAM Defect Modeling and Failure Analysis Based on March Test and a Novel Squeeze-Search Scheme," IEEE Trans. Computers, Vol. 64, No. 1, pp. 180-190, January 2015.

Poster Presentation Tips

Poster format

Here are some guidelines for preparing your research poster. The idea of the poster is to present your research results and conclusions thus far, get oral feedback during the session from the instructor and your peers, and to provide the instructor with something to comment on before your final report is due. Please send a PDF copy of the poster via e-mail by midnight on the poster presentation day.

Posters prepared for conferences must be colorful and eye-catching, as they are typically competing with dozens of other posters for the attendees' attention. Here is an example of a conference poster. Such posters are often mounted on a colored cardboard base, even if the pages themselves are standard PowerPoint slides. In our case, you should aim for a "plain" poster (loose sheets, to be taped to the wall in our classroom) that conveys your message in a simple and direct way. Eight to 10 pages, each resembling a PowerPoint slide, would be an appropriate goal. You can organize the pages into 2 x 4 (2 columns, 4 rows), 2 x 5, or 3 x 3 array on the wall. The top two of these might contain the project title, your name, course name and number, and a very short (50-word) abstract. The final two can perhaps contain your conclusions and directions for further work (including work that does not appear in the poster, but will be included in your research report). The rest will contain brief description of ideas, with emphasis on diagrams, graphs, tables, and the like, rather than text which is very difficult to absorb for a visitor in a very limited time span.

Grade Statistics

Chart

All grades listed are in percent, unless otherwise noted.
HW1 grades (letter): Range = [A–, A+], Mean = 3.91, Median = A
HW2 grades (letter): Range = [B+, A+], Mean = 3.70, Median = A–
HW3 grades (letter): Range = [B+, A+], Mean = 3.86, Median = A
HW4 grades (letter): Range = [B+, A+], Mean = 3.81, Median = A–
Overall homework grades (out of 16): Range = [13.6, 17.2], Mean = 15.3, Median = 15.0
Midterm exam grades: Range = [63, 88], Mean = 75, Median = 77
Final exam grades: Range = [64, 89], Mean = 76, Median = 77
Course grades (letter): Range = [B, A], Mean = 3.47, Median = B+

References

Image of a reference book

Required text: B. Parhami, Dependable Computing: A Multilevel Approach, chapters will be posted as they are updated. Please visit the textbook's web page for general information. Lecture slides are also available there.
Some useful books (not required):
Koren/Krishna, Fault-Tolerant Systems, Morgan Kaufmann, 2007 (ISBN 0-12-088525-5)
Shooman, Reliability of Computer Systems and Networks, Wiley, 2002 (ISBN 0-471-29342-3)
Siewiorek/Swarz, Reliable Computer Systems, Digital Press, 1992 (ISBN 1-55558-075-0)
Johnson, Design and Analysis of Fault-Tolerant Digital Systems, Addison Wesley, 1989 (ISBN 0-201-07570-9)

Research resources:
Proc. IEEE/IFIP Int'l Conf. Dependable Systems and Networks (DSN), formerly known as Fault-Tolerant Computing Symp. (FTCS), annual, since 1971.
IEEE Trans. Dependable and Secure Computing, published since 2004
IEEE Trans. Reliability, published since 1955
IEEE Trans. Computers, published since 1952
UCSB library's electronic journals, collections, and other resources

Miscellaneous Information

Motivation: Dependability concerns are integral parts of engineering design. Ideally, we would like our computer systems to be perfect, always yielding timely and correct results. However, just as bridges collapse and airplanes crash occasionally, so too computer hardware and software cannot be made totally immune to unpredictable behavior. Despite great strides in component reliability and programming methodology, the exponentially increasing complexity of integrated circuits and software systems makes the design of prefect computer systems nearly impossible. In this course, we study the causes of computer system failures (impairments to dependability), techniques for ensuring correct and timely computations despite such impairments, and tools for evaluating the quality of proposed or implemented solutions.

Catalog entry: 257A. Fault-Tolerant Computing. (4) PARHAMI. Prerequisites: ECE 154. Lecture, 3 hours. Basic concepts of dependable computing. Reliability of nonredundant and redundant systems. Dealing with circuit-level defects. Logic-level fault testing and tolerance. Error detection and correction. Diagnosis and reconfiguration for system-level malfunctions. Degradation management. Failure modeling and risk assessment.

History: Professor Parhami took over the teaching of ECE 257A in the fall quarter of 1998. Previously, the course had been taught primarily by Dr. John Kelly, who instituted the two-course sequence ECE 257A/B, the first covering general topics and the second (now discontinued) devoted to his research focus on software fault tolerance. Borrowing from his experience in teaching dependable computing at other universities and based on an extensive survey of the field that he published in 1994, Professor Parhami oriented the course toward an original multilevel view of impairments to computer system dependability and techniques for avoiding or tolerating them. The levels of this models, in increasing order of abstraction, are: defects, faults, errors, malfunctions, degradations, and failures. A textbook based on this multilevel model of dependable computing is in preparation.
Offering of ECE 257A in fall 2018
Offering of ECE 257A in fall 2016 (PDF file)
Offering of ECE 257A in fall 2015 (PDF file)
Offering of ECE 257A in winter 2015 (PDF file)
Offering of ECE 257A in fall 2013 (PDF file)
Offering of ECE 257A in fall 2012 (PDF file)
Offering of ECE 257A in fall 2009 (PDF file)
Offering of ECE 257A in fall 2007 (PDF file)
Offerings of ECE 257A in 1998 and 2006 (PDF file)