Behrooz Parhami's website banner

Menu:

Behrooz Parhami's ECE 257A Course Page for Fall 2021

Collapsed bridge

Fault-Tolerant Computing

Page last updated on 2021 December 14

Enrollment code: 12948
Prerequisite: ECE 154 (computer architecture), or equivalent
Class meetings: MW 10-11, Phelps 1437 (inverted classroom)
Instructor: Professor Behrooz Parhami
Open office hours: MW 11-11:30, Phelps 1437; W 1-2, HFH 5155
Course announcements: Listed in reverse chronological order
Course calendar: Lecture, homework, and exam schedules
Homework assignments: Four assignments, worth a total of 40%
Exams: None for fall 2021
Research paper: Report 50%; Poster 10%
Research paper guidlines: Brief guide to format and contents
Poster presentation tips: Brief guide to format and structure
Policy on academic integrity: Please read very carefully
Grades: Statistics for homework and other grades
References: Textbook and other sources (Textbook's web page)
Lecture slides: Via the textbook's Web page
Miscellaneous information: Motivation, catalog entry, history

Course Announcements

Megaphone

2021/12/14: The fall 2021 offering of ECE 257A is officially over. I have reported the grades to the Registrar and also e-mailed them to you, along with feedback on your submitted research paper & poster. During winter 2022, I will be seeing several of you in my ECE 254B course on parallel processing, which is structured similarly to this course, in that it will have flipped classes and the same 40/60% weights for homework/research. Happy holidays and best wishes for continued success in your academic pursuits!
2021/11/20: HW4 will be due by 10:00 AM on M 11/29. The next research milestone is submission of a final list of references and a provisional abstract of your work by M 11/22 (any time). Your poster in PDF format will be due on W 12/01 (any time). Please remember the 12/01 technical talk of my previous announcement. Next week, 11/29-12/03, we won't have regular classes. Instead, we will take M 11/19 off for advancing your research and have a poster-presentation session on W 12/01.
2021/11/13: HW3 will be due by 10:00 AM on M 11/15. I will hand out a solution sheet in class. HW4, the last one for the course, will be posted no later than W 11/17. I recommend that you attend the 12/01, 9:00 AM PST, talk by Dr. Daniel Jackson (MIT), entitled "The Essence of Software (or Why Systems Often Fail by Design, and How to Fix Them)." [Free registration]
2021/11/06: HW3 has been posted to the homework area below. Each of you has an assigned topic, plus a reasonable set of references for your research. The next milestone is submission of a final list of references and a provisional abstract of your work by M 11/22 (any time). I copied you on an e-mail sent to our 11/03 guest speaker, Dr. Jessica Santana. Please let me know if you have any comments on the material therein. Have a great week!
2021/10/30: There are no new lectures for the 6th week of classes, November 1-5, 2021, allowing you to focus on advancing your research project, after submitting your preliminary list of references on Monday 11/01 (any time). HW3 will be posted by W 11/03. On W 11/03, Dr. Jessica Santana (Assistant Professor, UCSB Technology Management Program) will give a guest lecture to our class on engineering ethics, during our normal 10:00-11:00 AM time slot. Please make every effort to be present for her interesting presentation. Have a great week!
2021/10/23: HW2 will be due by 10:00 AM on W 10/27. The next research milestone is submission of your preliminary list of references by M 11/01 (any time). I will provide feedback on your list in 1-2 days after submission. Once you have collected a reasonable set of references, you can concentrate on research during week 6, when we will have no lecture on Monday and a special in-person lecture (outside the main course content) on engineering ethics on Wednesday. Have a great week!
2021/10/16: HW2 has been posted to the homework area below. Each student has been assigned a research topic. The next milestone is your submission of a preliminary list of references by M 11/01. This is just a checkpoint, with no judgement or assessment. The idea is for you to have a reasonable set of sources, which may be pruned or augmented after 11/01, and for me to know if you are on the right track or need help in finding appropriate sources. Two upcoming free technical events may be of interest to you:
- Dr. Tevfik Bultan's 10/20, 6:30 PM, Zoom talk, "Computing, Logic, and Security" [Details & registration]
- International (Virtual) Conference on Computer Design, 10/24-27. [Registration]
In this course and in our daily social interactions, we talk a lot about the ill effects and dangers of technology. For balance, we should note that there are positive sides to high-tech, as discussed in Clive Thompson's Smarter than You Think: How Technology Is Changing Our Minds for the Better. [My detailed book review]
2021/10/10: Here is some information about a couple of items that I mentioned in class. Congressional testimony by Facebook whistleblower Francis Haugen is a watershed moment in holding social-media accountable for the harm done by spreading disinformation and misinformation. Haugen also had a "60 Minutes" interview, during which her identity was revealed (she was anonymous prior to that). The Facebook case is an important development, because harm from computers (and advanced technology, more generally) comes not only from system defects/faults/errors/malfunctions, that we as engineers try to prevent or remedy, but also from greed, as well as intentional or unintentional abuse. A new book by Jessica M. Smith, Extracting Accountability: Engineers and Corporate Social Responsibility (MIT Press, 2021, open-access link), is an interesting and useful read in this regard.
2021/10/07: The list of research topics has been updated for fall 2021. Please send me your top-3 choices of topics no later than Monday, October 11, 2021. This is a soft deadline. I will start assigning topics to students on Tuesday 10/12. If you send me your choices after 10/11, you will be less likely to get your top choices. In fact, it would be best if you check the course Web page to avoid topics that have already been assigned to other students. This information will be provided within parentheses after the topic.
2021/10/01: Hope you are settled into the fall quarter and have been able to re-adjust to in-person instruction, as we begin the second week of classes. HW1 has been posted to the homework area below, a few days ahead of schedule. Please listen to Lectures 3 and 4, covering Chapters 3-4, and bring your questions to class on 10/4 and 10/06. I will cancel my W 1:00-2:00 PM office hour this week, due to conflict with an important meeting.
The historical talk, entitled "Eight Key Ideas in Computer Architecture from Eight Decades of Innovation," which I presented as part of IEEE Computer Society's Distinguished Visitors Program and in connection with the Society's 75th anniversary celebration, has been recorded and I will include a link to the recording here as soon as I receive it.
I attended several of CRML 2021 Summit's talks on Friday 2021/10/01, and prepared a brief report on its opening keynote talk by Dr. Stuart Russel (UC Berkeley), entitled "Human-Compatible AI."
2021/09/20: I look forward to meeting you in the first inverted class meeting on M 9/27. Please watch Lecture 1, "Background and motivation," before class, so that you can ask questions about the material. I will also prepare thought-provoking questions for in-class and subsequent discussion. Please ignore references to Zoom office hours in the lecture video, as we will have in-person meetings MW 10:00-11:00 and 30 minutes of general open office hour after that (11:00-11:30), as well as W 1:00-2:00 in my office. Stay safe! See you soon!
2021/09/15: Please note the following important campus-wide requirements for in-person instruction during fall 2021: Everyone on campus must be vaccinated and, until further notice, wearing a mask is mandatory for all indoor activities. Here is a Web page where you can find information and resources dealing with campus and UC policies on the pandemic.
2021/05/27: Welcome to the ECE 257A web page for fall 2021. The course will be research-based, with 60% of your grade determined by your research report and poster presentation and 40% based on homework. I plan to update the lecture slides and textbook chapters over the summer months and through the fall quarter, with each revised chapter becoming available before discussion in class.
I will use an inverted classroom model. Video of each lecture will be made available and must be watched before the scheduled date on the course calendar. The first hour of our in-person class meeting will be devoted to discussion and Q&A on the topic, with the following half-hour serving as an open office hour held in the same classroom. Students will be free to arrive or leave after the one-hour discussion session.

Course Calendar

Calendar

Course lectures, homework assignments, and research paper deadlines have been scheduled as follows. This schedule will be strictly observed. In particular, no extension is possible for homework due dates. Please begin work on your assignments early. Each lecture corresponds to topics in 1-2 chapters of the instructor's forthcoming textbook on dependable computing. Chapter numbers are provided in parentheses, after day & date.

Day & Date (book chapters) Lecture topic [Homework posted/due] {Special notes}
M 09/27 (1) Background and motivation {Lecture 1}
W 09/29 (2) Dependability attributes {Lecture 2}

M 10/04 (3) Combinational modeling [HW1 posted, chs. 1-4] {Lecture 3}
W 10/06 (4) State-space modeling {Lecture 4}

M 10/11 (5, 7) Defect avoidance; Shielding and hardening {Research topics specified} {Lecture 5}
W 10/13 (6, 8) Defect circumvention; Yield enhancement [HW1 due] {Lecture 6}

M 10/18 (9, 11) Fault testing; Design for testability [HW2 posted, chs. 5-12] {Lecture 7}
W 10/20 (10, 12) Fault masking; Replication with voting {Research assignments finalized} {Lecture 8}

M 10/25 (13, 15) Error detection; Self-checking modules {Lecture 9}
W 10/27 (14, 16) Error correction; Redundant disk arrays [HW2 due] {Lecture 10}

M 11/01 Research-focus week: Getting started {Preliminary reference list due}
W 11/03 Special in-person lecture on Engineering Ethics [HW3 posted, chs. 13-20]

M 11/08 (17, 19) Malfunction diagnosis; Standby redundancy {Lecture 11}
W 11/10 (18, 20) Malfunction tolerance; Robust parallel processing {Lecture 12}

M 11/15 (21, 23) Degradation allowance; Resilient algorithms [HW3 due] {Lecture 13}
W 11/17 (22, 24) Degradation management; Software redundancy [HW4 posted, chs. 21-28] {Lecture 14}

M 11/22 (25, 27) Failure confinement; Agreement and adjudication {Ref's & abst. due} {Lecture 15}
W 11/24 (26, 28) Failure recovery; Fail-safe systems {Lecture 16}

M 11/29 Research-focus week: Finishing up [HW4 due]
W 12/01 Poster presentations {PDF of poster due} {Instructor and course evaluations}

W 12/08 {Full research paper PDF file due by midnight}
W 12/15 {Course grades due by midnight}

Homework Assignments

Homework image

- Turn in your solutions as a PDF file attached to an e-mail sent by the due date/time.
- Because solutions will be handed out on the due date, no extension can be granted.
- Include your name, course name, and assignment number at the top of the first page.
- If homework is handwritten and scanned, make sure that the PDF is clean and legible.
- Although some cooperation is permitted, direct copying will have severe consequences.

Homework 1: Dependability and its modeling (chs. 1-4, due W 2021/10/13, 10:00 AM)
Do the following problems from the textbook: 1.7, 2.9, 2.27 (defined below), 3.22, 4.1, 4.23
2.27   Trustworthiness of AI systems
Read the paper [Wing21] and, based on what you learn from it, add at least two terms to the diagram in Slide 56 of our textbook's Part 1 (which lists the -ilities, plus safety, robustness, and so on). Explain your choice of the terms and why they are important. Why do you think the application of formal systems to AI faces additional challenges?
[Wing21] Wing, J. M., "Trustworthy AI," Communications of the ACM, Vol. 64, No. 10, pp. 64-71, October 2021.

Homework 2: Defects and faults (chs. 5-12, due W 2021/10/27, 10:00 AM)
Do the following problems from the textbook: 5.4, 7.2, 8.3, 10.2, 11.1, 12.2

Homework 3: Errors and malfunctions (chs. 13-20, due M 2021/11/15, 10:00 AM)
Do the following problems from the textbook: 13.8, 14.7, 16.5, 17.4, 18.3ab, 19.4 (defined below)
19.4   Long-life computers for space missions
On October 16, 2021, NASA's billion-dollar Lucy probe began its 12-year, 4-billion-mile quest to make close fly-bys of eight carbon-rich asteroids (located in two clusters ahead and behind Jupiter along its orbit around the sun), that may hold keys to the origins of life in the solar system. Named after the 3.2-million-year-old bones of a celebrated early human ancestor, the probe will travel along a trajectory that takes it through three velocity-boosting Earth fly-bys. Success of a 12-year unmanned space mission, where repair is impossible, rests upon highly-reliable, long-life computing and guidance systems. Write a one-page single-spaced report on the methods used to ensure reliable digital systems on board Lucy.

Homework 4: Degradations and failures (chs. 21-28, due W 2021/11/29, 10:00 AM)
Do the following problems from the textbook: 21.3, 23.3, 24.9, 25.1, 26.1, 27.5

Sample Exams and Study Guide (does not apply to fall 2021)

Answer sheet

The following sample exam problems are meant to indicate the types and levels of problems, rather than the coverage (which is outlined in the course calendar).
Students are responsible for all sections and topics in the textbook and class handouts that are not explicitly excluded in the study guide that follows each sample exam, even if the material was not covered in class lectures.

Sample Midterm Exam (105 minutes)
Problems 3.12, 4.4, 9.4, and 12.1 from the textbook.

Midterm Exam Study Guide
Study Chapters 1-12 and review the problems in homework assignments 1-2. The following textbook sections are excluded: 6.6, 7.6, 8.6, 9.4, 9.6, 11.6

Sample Final Exam (120 minutes)
Problems 15.5, 17.1, 21.2, and 27.3 from the textbook.

Final Exam Study Guide
Study Chapters 13-28 and review the problems in homework assignments 3-4. The following textbook sections are excluded: 13.6, 14.6

Research Paper and Presentation

Colored marbles Each student will review a subfield of dependable computing or do original research on a selected and approved topic. A preliminary list of research topics is provided below (new topics, and new references for the current topics, may be added later). However, students should feel free to propose their own topics for approval. To propose a topic, send via e-mail a one-page narrative, including 2-3 key references, to the instructor.

A publishable report earns an "A" for the course, regardless of homework grades. See the course calendar for schedule and due dates and Research Paper Guidlines for formatting tips.

Our research for fall 2021 will focus on diagnosis of malfunctions according to the PMC model, within a network of intelligent nodes. The first 55 minutes of Lecture 11 will give you the needed background.
Topics 1-12 below can be chosen for the current quarter. Topics 1-6 deal with diagnosability of classes of interconnection networks, while Topics 7-12 emphasize the interplay between diagnosability and other network attributes. Topics 13-22 are from a previous offering of the course, with the theme "robustness of interconnection networks." Topics 23-30 cover miscellaneous research domains for possible future use.
You can get started on a topic by taking a look at the two common references [Parh16] and [Chen20], plus 1-2 topic-specific reference(s) provided further down on this page. Once a topic has been assigned to you, the search for additional references should begin. Use keywords from the topic and sample references to identify additional sources; ask for help if you run into problems. You will be submitting a preliminary list of references by Monday 11/01 to show satisfactory progress toward completing your research.

[Parh16] B. Parhami, N. Wu, and S. Tao, "Taxonomy and Overview of Distributed Malfunction Diagnosis in Networks of Intelligent Nodes," J. Computer Science and Engineering, Vol. 13, No. 2, pp. 23-31, 2016.

[Chen20] E. Cheng, K. Qiu, and Z. Shen, Z., "Diagnosability of Interconnection Networks: Past, Present and Future," Int'l J. Parallel, Emergent and Distributed Systems, Vol. 35, No. 1, pp. 2-8, 2020.

1. Diagnosability of Biswapped Interconnection Networks (Assigned to: Zhaodong Chen)
[Tsai13] C. H. Tsai and J. C. Chen, "Fault Isolation and Identification in General Biswapped Networks Under the PMC Diagnostic Model," Theoretical Computer Science, Vol. 501, pp. 62-71, 2013.
[Xiao11] W. Xiao, B. Parhami, W. Chen, M. He, and W. Wei, "Biswapped Networks: A Family of Interconnection Architectures with Advantages over Swapped or OTIS Networks," Int'l J. Computer Mathematics, Vol. 88, No. 13, pp.2669-2684, 2011.

2. Diagnosability of Bus-Based Interconnection Networks (Assigned to: Noah De Los Santos)
[Bist16] F. Bistouni and M. Jahanshahi, "Reliability Analysis of Fault-Tolerant Bus-Based Interconnection Networks," J. Electronic Testing, Vol. 32, No. 5, pp. 541-568, 2016.

3. Diagnosability of Data-Center Interconnection Networks (Assigned to: Guyue Huang)
[Gu19] M. M. Gu, R. X. Hao, and S. Zhou, "Fault Diagnosability of Data Center Networks," Theoretical Computer Science, Vol. 776, pp. 138-147, 2019.

4. Diagnosability of k-ary n-cube Interconnection Networks (Assigned to: Destin Wong)
[Fan20] L. Fan and J. Yuan, J., "The Diagnosability of k-ary n-cubes with Missing Edges," Int'l J. Parallel, Emergent and Distributed Systems, Vol. 35, No. 1, pp. 57-68, 2020.

5. Diagnosabioity of Multistage Interconnection Networks (Assigned to: TBD)
[Feng81] T. Y. Feng and C. L. Wu, "Fault-Diagnosis for a Class of Multistage Interconnection Networks," IEEE Trans. Computers, Vol. 100, No. 10, pp. 743-758, 1981.
[Wang01] S. J. Wang, "Distributed Diagnosis in Multistage Interconnection Networks," J. Parallel and Distributed Computing, Vol. 61, No. 2, pp. 254-264, 2001.

6. Diagnosability of Regularly-Connected Processor Arrays (Assigned to: TBD)
[Chan05] G. Y. Chang, G. J. Chang, and G. H. Chen, "Diagnosabilities of Regular Networks," IEEE Trans. Parallel and Distributed Systems, Vol. 16, No. 4, pp. 314-323, 2005.

7. Is Symmetry Good or Bad for Diagnosability? (Assigned to: Zexi Liu)
[Chen17] E. Cheng, K. Qiu, and Z. Shen, "On Diagnosability of Interconnection Networks," Int'l J. Unconventional Computing, Vol. 13, No. 3, pp. 245-251, 2017.
[Cai10] Z. Cai, W. Xiao, Q. Zhang, and Y. Liu, "Principle of Symmetry for Network Topology with Applications to Some Networks," J. Networks, Vol. 5, No. 9, p. 994, 2010.

8. Restructuring of Interconnection Networks for Better Diagnosability (Assigned to: Kerr Ding)
[Ghaf90] A. Ghafoor, "Partitioning of Even Networks for Improved Diagnosability," IEEE Trans. Reliability, Vol. 39, No. 3, pp. 281-286, 1990.

9. The Relationship Between Network Connectivity and Diagnosability (Assigned to: TBD)
[Lin17] L. Lin, L. Xu, R. Chen, S. Y. Hsieh, and D. Wang, "Relating Extra Connectivity and Extra Conditional Diagnosability in Regular Networks," IEEE Trans. Dependable and Secure Computing, Vol. 16, No. 6, pp. 1086-1097, 2017.

10. The Relationship Between Network Regularity and Diagnosability (Assigned to: TBD)
[Gu20] M. M. Gu, R. X. Hao, and E. Cheng, "Fault Diagnosability of Regular Graphs," Theory and Applications of Graphs, Vol. 7, No. 2, p. 4, 2020.

11. Intermittent-Fault Diagnosability in Interconnection Networks (Assigned to: Jackie Burd)
[Lian17] J. R. Liang, H. Feng, and X. Du, "Intermittent Fault Diagnosability of Interconnection Networks," J. Computer Science and Technology, Vol. 32, No. 6, pp. 1279-1287, 2017.
[Sun20] X. Sun, S. Zhou, M. Lv, J. Liu, and G. Lian, "Intermittent Fault Diagnosability of Some General Regular Networks," Computer J., Vol. 63, No. 1, pp. 16-24, 2020.

12. Effects of Network Composition on Diagnosability (Assigned to: TBD)
[Arak00] T. Araki and Y. Shibata, "Diagnosability of Networks Represented by the Cartesian Product," IEICE Trans. Fundamentals of Electronics, Communications and Computer Sciences, Vol. 83, No. 3, pp. 465-470, 2000.
[Hsie08] S. Y. Hsieh and T. Y. Chuang, "The Strong Diagnosability of Regular Networks and Product Networks Under the PMC Model," IEEE Trans. Parallel and Distributed Systems, Vol. 20, No. 3, pp. 367-378, 2008.

The following research topics do not apply to fall 2021 and are listed for information only.

13. Effects of Missing Nodes on Network Diameter and Average Distance
[Kris87] Krishnamoorthy, M.S. and B. Krishnamurthy, "Fault Diameter of Interconnection Networks," Computers & Mathematics with Applications, Vol. 13, Nos. 5/6, pp. 577-582, 1987.

14. Effects of Missing Links on Network Diameter and Average Distance
[Kris87] Krishnamoorthy, M.S. and B. Krishnamurthy, "Fault Diameter of Interconnection Networks," Computers & Mathematics with Applications, Vol. 13, Nos. 5/6, pp. 577-582, 1987.

15. Synthesis of Interconnection Networks with Maximal Fault Tolerance
[Chen09] W. Chen, W. J. Xiao, and B. Parhami, "Swapped (OTIS) Networks Built of Connected Basis Networks are Maximally Fault Tolerant," IEEE Trans. Parallel and Distributed Systems, Vol. 20, pp. 361-366, March 2009.

16. Adaptive Schemes for Point-to-Point Communication in Networks
[Ngai91] Ngai, J. Y. and C. L. Seitz, "A Framework for Adaptive Routing in Multicomputer Networks," Computer Architecture News, Vol. 19, No. 1, pp. 6-14, March 1991.

17. Adaptive Schemes for Collective Communication in Networks
[Pand95] Panda, D. K., "Issues in Designing Efficient and Practical Algorithms for Collective Communication on Wormhole-Routed Systems," Proc. Int'l Conf. Parallel Processing Workshop on Challenges for Parallel Processing, 1995, pp. 8-15.

18. Deadlocks in Adaptive Routing and How to Avoid or Detect Them
[Dall93] Dally, W. J. and H. Aoki, "Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels," IEEE Trans. Parallel and Distributed Systems, Vol. 4, No. 4, pp. 466-475, April 1993.

19. Diagnosability of Regular Degree-d Interconnection Networks
[Chan05] Chang, G.-Y., G. J. Chang, and G.-H. Chen, "Diagnosabilities of Regular Networks," IEEE Trans. Parallel and Distributed Systems, Vol. 16, No. 4, pp. 314-323, April 2005

20. Diagnosability of Hierarchical or Multilevel Interconnection Networks
[Xu09] Xu, M., K. Thulasiraman, and X.-D. Hu, "Conditional Diagnosability of Matching Composition Networks Under the PMC Model," IEEE Trans. Circuits and Systems II, Vol. 56, No. 11, pp. 875-879, November 2009.

21. Synthesis of Interconnection Networks with Maximal Diagnosability
[Chan05] Chang, G.-Y., G. J. Chang, and G.-H. Chen, "Diagnosabilities of Regular Networks," IEEE Trans. Parallel and Distributed Systems, Vol. 16, No. 4, pp. 314-323, April 2005.

22. Network Virtualization as a Tool for Fault Tolerance in Interconnection Networks
[Chan05] Fischer, A., J. F. Botero, M. T. Beck, H. De Meer, and X. Hesselbach, "Virtual Network Embedding: A Survey," IEEE Communications Surveys & Tutorials, Vol. 15, No. 4, pp. 1888-1906, 2013.

23. Reasoning Under Uncertainly, with Applications to Dependable Computing
[IJAR16] Int'l J. Approximate Reasoning, Vol. 71, pp. 1-62, December 2016 (Five review articles on 40 years of Dempster-Shafer Theory)

24. Probabilistic Analysis of Program Correctness Under Soft Errors
[Carb16] Carbin, M., S. Misailovic, and M. C. Rinard, "Verifying Quantitative Reliability for Programs that Execute on Unreliable Hardware," Communications of the ACM, Vol. 59, No. 8, pp. 83-91, August 2016.

25. Reliability of Reconfigurable 2D Processor Arrays with Distributed Switching
[Parh20] Parhami, B., "Reliability and Modelability Advantages of Distributed Switching for Reconfigurable 2D Processor Arrays," Proc. 11th Annual IEEE Information Technology, Electronics and Mobile Communication Conf., November 2020, to appear.

26. Reliability Considerations in the Design of Neuromorphic Chips
[Gree20] Greengard, S., "Neuromorphic Chips Take Shape," Communications of the ACM, Vol. 63, No. 8, pp. 9-11, August 2020.

27. Dependable Computing Under Extreme Nanoscale Parameter Variations
[Ghos10] Ghoseh, S. and K. Roy, "Parameter Variation Tolerance and Error Resiliency: New Design Paradigm for the Nanoscale Era," Proceedings of the IEEE, Vol. 98, No. 10, pp. 1718-1751, August 2010.

28. Fault-Tolerant and Easily-Testable Fourier Transform Networks
[Jou88] Jou, J. Y. and J. A. Abraham, "Fault-Tolerant FFT Networks," IEEE Trans. Computers, Vol. 37, No. 5, pp. 548-561, May 1988.
[Lu05] Lu, S. K., J. S. Shih, and S. C. Huang, "Design-for-Testability and Fault-Tolerant Techniques for FFT Processors," IEEE Trans. VLSI Systems, Vol. 13, No. 6, pp. 732-741, 2005.

29. Robustness and Fault Tolerance in Natural and Artificial Neural Networks
[Torr17] Torres-Huitzil, C. and B. Girau, "Fault and Error Tolerance in Neural Networks," IEEE Access, Vol. 5, pp. 17322-17341, August 2017.

30. Reliability Considerations in the Design of Analog Mixed-Signal Chips
[Joks20] Joksas, D., P. Freitas, et al., "Committee Machines—A Universal Method to Deal with Non-Idealities in Memristor-Based Neural Networks," Nature Communications, Vol. 11, No. 1, pp. 1-10, 2020.
[Rekh19] Rekhi, A. S., et al., "Analog/Mixed-Signal Hardware Error Modeling for Deep Learning Inference," Proc. 56th Annual Design Automation Conf., 2019, pp. 1-6.

Poster Presentation Tips

Poster format

Here are some guidelines for preparing your research poster. The idea of the poster is to present your research results and conclusions thus far, get oral feedback during the session from the instructor and your peers, and to provide the instructor with something to comment on before your final report is due. Please send a PDF copy of the poster via e-mail by midnight on the poster presentation day.

Posters prepared for conferences must be colorful and eye-catching, as they are typically competing with dozens of other posters for the attendees' attention. Here is an example of a conference poster. Such posters are often mounted on a colored cardboard base, even if the pages themselves are standard PowerPoint slides. In our case, you should aim for a "plain" poster (loose sheets, to be taped to the wall in our classroom) that conveys your message in a simple and direct way. Eight to 10 pages, each resembling a PowerPoint slide, would be an appropriate goal. You can organize the pages into 2 x 4 (2 columns, 4 rows), 2 x 5, or 3 x 3 array on the wall. The top two of these might contain the project title, your name, course name and number, and a very short (50-word) abstract. The final two can perhaps contain your conclusions and directions for further work (including work that does not appear in the poster, but will be included in your research report). The rest will contain brief description of ideas, with emphasis on diagrams, graphs, tables, and the like, rather than text which is very difficult to absorb for a visitor in a very limited time span.

Grade Statistics

Chart

All grades listed are in percent, unless otherwise noted.
HW1 grades (letter): Range = [A–, A+], Mean = 4.00, Median = A
HW2 grades (letter): Range = [B+, A], Mean = 3.68, Median = A–
HW3 grades (letter): Range = [B, A], Mean = 3.53, Median = A–
HW4 grades (letter): Range = [B+, A+], Mean = 3.82, Median = A
Overall homework grades: Range = [86, 102], Mean = 95, Median = 96
Research grades (letter): Range = [B–, A], Mean = 3.52, Median = A–
Research grades: Range = [68, 100], Mean = 88, Median = 93
Course grades (letter): Range = [B, A], Mean = 3.67, Median = ~A

References

Image of a reference book

Required text: B. Parhami, Dependable Computing: A Multilevel Approach, chapters will be posted as they are updated. Please visit the textbook's web page for general information. Lecture slides are also available there.
Some useful books (not required):
Koren/Krishna, Fault-Tolerant Systems, Morgan Kaufmann, 2007 (ISBN 0-12-088525-5)
Shooman, Reliability of Computer Systems and Networks, Wiley, 2002 (ISBN 0-471-29342-3)
Siewiorek/Swarz, Reliable Computer Systems, Digital Press, 1992 (ISBN 1-55558-075-0)
Johnson, Design and Analysis of Fault-Tolerant Digital Systems, Addison Wesley, 1989 (ISBN 0-201-07570-9)

Research resources:
Proc. IEEE/IFIP Int'l Conf. Dependable Systems and Networks (DSN), formerly known as Fault-Tolerant Computing Symp. (FTCS), annual, since 1971.
IEEE Trans. Dependable and Secure Computing, published since 2004
IEEE Trans. Reliability, published since 1955
IEEE Trans. Computers, published since 1952
UCSB library's electronic journals, collections, and other resources

Miscellaneous Information

Motivation: Dependability concerns are integral parts of engineering design. Ideally, we would like our computer systems to be perfect, always yielding timely and correct results. However, just as bridges collapse and airplanes crash occasionally, so too computer hardware and software cannot be made totally immune to unpredictable behavior. Despite great strides in component reliability and programming methodology, the exponentially increasing complexity of integrated circuits and software systems makes the design of prefect computer systems nearly impossible. In this course, we study the causes of computer system failures (impairments to dependability), techniques for ensuring correct and timely computations despite such impairments, and tools for evaluating the quality of proposed or implemented solutions.

Catalog entry: 257A. Fault-Tolerant Computing. (4) PARHAMI. Prerequisites: ECE 154. Lecture, 3 hours. Basic concepts of dependable computing. Reliability of nonredundant and redundant systems. Dealing with circuit-level defects. Logic-level fault testing and tolerance. Error detection and correction. Diagnosis and reconfiguration for system-level malfunctions. Degradation management. Failure modeling and risk assessment.

History: Professor Parhami took over the teaching of ECE 257A in the fall quarter of 1998. Previously, the course had been taught primarily by Dr. John Kelly, who instituted the two-course sequence ECE 257A/B, the first covering general topics and the second (now discontinued) devoted to his research focus on software fault tolerance. Borrowing from his experience in teaching dependable computing at other universities and based on an extensive survey of the field that he published in 1994, Professor Parhami oriented the course toward an original multilevel view of impairments to computer system dependability and techniques for avoiding or tolerating them. The levels of this models, in increasing order of abstraction, are: defects, faults, errors, malfunctions, degradations, and failures. A textbook based on this multilevel model of dependable computing is in preparation.
Offering of ECE 257A in fall 2021
Offering of ECE 257A in fall 2020
Offering of ECE 257A in fall 2019
Offering of ECE 257A in fall 2018
Offering of ECE 257A in fall 2016 (PDF file)
Offering of ECE 257A in fall 2015 (PDF file)
Offering of ECE 257A in winter 2015 (PDF file)
Offering of ECE 257A in fall 2013 (PDF file)
Offering of ECE 257A in fall 2012 (PDF file)
Offering of ECE 257A in fall 2009 (PDF file)
Offering of ECE 257A in fall 2007 (PDF file)
Offerings of ECE 257A in 1998 and 2006 (PDF file)