Page last updated on 2025 December 17
Enrollment code: 57638
Prerequisite: ECE 154 (computer architecture), or equivalent
Class meetings: MW 10-11, Phelps 1431 (flipped classroom)
Instructor: Professor Behrooz Parhami
Open office hours: MW 11:15-11:45, Phelps 1431
Course announcements: Listed in reverse chronological order
Course calendar: Class, homework, and research schedules
Homework assignments: Four assignments, worth a total of 40%
Exams: None for fall 2025
Research paper: Written paper in PDF format, 60%
Research paper guidlines: Brief guide to format and contents
Poster presentation tips: Not applicable to fall 2025
Policy on academic integrity: Please read very carefully
Grade stats: Statistics for homework and other grades
References: Textbook and other sources (Textbook's web page)
Lecture slides: Via the textbook's Web page
Miscellaneous information: Motivation, catalog entry, history
2025/12/17: The fall 2025 offering of ECE 257A is officially over and grades have been reported to the Registrar. Over the next couple of days, each student will receive a personalized e-mail message with feedback on his/her research paper, along with the course letter grade. As I end my teaching activites at UCSB due to retirement, I wish all of you a joyous & relaxing holiday season and continued success in your academic & personal endeavors.
2025/12/08: HW4, our last homework assignment, has been graded. Your final PDF research report will be due on W 12/10 (any time). The submission deadline is firm, with no extension possibe. I will provide feedback on your paper and report the course grades by 12/17.
2025/11/23: A reminder about class and office hour being cancelled on Wed. 11/26. The last regular class will be on Mon. 12/01, with a special lecture entitled "The Machine Stops" planned for Wed. 12/03 (PDF slides).
2025/11/15: HW4, the last one for the course, has been posted to the homework area below. I am still missing preliminary list of references from 3 students (they have been e-mailed).
2025/11/02: HW3 has been posted to the homework area below.
2025/10/21: Research topics have been assigned to all students who submitted their top-4 preferences among the 20 pre-approved topics or whose self-defined research proposals have been approved. In the latter case, there may be some changes to the submitted title and/or defining references. Please do not change your title without seeking my approval. Details can be found in the "Research Paper and Presentation" section below.
2025/10/12: HW2 has been posted to the homework area below a couple of days ahead of schedule. Twenty pre-approved research topics have been finalized and a new topic proposed by one of the students has been approved and added. Tomorrow, I will talk about the research topics and the research process. Your prioritized list of 4 topics is due by Monday, 10/20. This deadline is soft. If you submit your topic preferences by the deadline, you will get priority in topic assignment. Otherwise, your top choices may be assigned to others and you may have to iterate with a new list of preferences (each topic will be assigned to a different student). I encourage you to submit a research topic that you define for my approval. I will talk about the criteria and process for such submissions in class.
2025/09/27: HW1 has been posted to the homework area below a couple of days ahead of schedule. Please watch Lecture 1, linked under "Course Calendar" below, before our first flipped class. Please also e-mail a 1-page PDF document to me containing an introduction to yourself (background, interests, goals, photo). Looking forward to meeting you on Monday 9/29.
2025/09/08: Welcome to the ECE 257A Web page for fall 2025. The course will be research-based, with 60% of your grade determined by your research paper and 40% based on homework. There will be no poster or oral presentation of your research, given the large class size (30 enrolled as of today).
I will use a flipped classroom model. Video of each lecture must be watched before the scheduled date on the course calendar. The first hour of our in-person class meeting will be devoted to discussion and Q&A on the topic, with the following 45 minutes serving as an open office hour held in the same classroom. Students will be free to leave after the one-hour discussion session.
Course lectures, homework assignments, and research paper deadlines have been scheduled as follows. This schedule will be strictly observed. In particular, no extension is possible for homework due dates. Please begin work on your assignments early. Each lecture corresponds to topics in 1-2 chapters of the instructor's forthcoming textbook on dependable computing. Chapter numbers are provided in parentheses, after day & date.
Day & Date (book chapters) Lecture topic [Homework posted/due] {Special notes}
M 09/29 (1) Background and motivation [HW1 posted, chs. 1-4] {Lec. 1}
W 10/01 (2) Dependability attributes {Lec. 2}
M 10/06 (3) Combinational modeling {Lec. 3}
W 10/08 (4) State-space modeling {Lec. 4}
M 10/13 Special presentation on research topics for fall 2025 [HW1 due] {Research topics defined}
W 10/15 (5, 7) Defect avoidance; Shielding and hardening [HW2 posted, chs. 5-12] {Lec. 5}
M 10/20 (6, 8) Defect circumvention; Yield enhancement {Research topic preferences due} {Lec. 6}
W 10/22 (9, 11) Fault testing; Design for testability {Research topics assigned} {Lec. 7}
M 10/27 (10, 12) Fault masking; Replication with voting {Lec. 8}
W 10/29 No class or office hour (instructor away at a conference)
M 11/03 (13, 15) Error detection; Self-checking modules [HW2 due] [HW3 posted, chs. 13-20] {Lec. 9}
W 11/05 (14, 16) Error correction; RAID systems {Prelim. references due} {Lec. 10}
M 11/10 (17, 19) Malfunction diagnosis; Standby redundancy {Lec. 11}
W 11/12 (18, 20) Malfunction tolerance; Robust parallel processing {Lec. 12}
M 11/17 (21, 23) Degradation allowance; Resilient alg's [HW3 due] [HW4 posted, chs. 21-28] {Lec. 13}
W 11/19 (22, 24) Degradation mgmt; SW redundancy {Ref's & provisional abst. due} {Lec. 14}
M 11/24 (25, 27) Failure confinement; Agreement and adjudication {Lec. 15}
W 11/26 No class or office hour on this day before Thanksgiving
M 12/01 (26, 28) Failure recovery; Fail-safe systems [HW4 due] {Lec. 16}
W 12/03 Special presentation, "The Machine Stops" {PDF slides}
W 12/10 {Full research paper PDF file due by midnight}
W 12/17 {Course grades due by midnight}
- Turn in your solutions as a PDF file attached to an e-mail sent by the due date/time.
- Because solutions will be handed out on the due date, no extension can be granted.
- Include your name, course name, and assignment number at the top of the first page.
- If homework is handwritten and scanned, make sure that the PDF is clean and legible.
- Although some cooperation is permitted, direct copying will have severe consequences.
Homework 1: Dependability and its modeling (chs. 1-4, due M 2025/10/13, 10:00 AM)
Do the following problems from the textbook: 1.7, 1.19, 2.14, 3.15, 4.1, 4.12
Homework 2: Defects and faults (chs. 5-12, due M 2025/11/03, 10:00 AM)
Do the following problems from the textbook: 5.2, 7.3, 8.2, 9.3, 11.1, 12.2
Homework 3: Errors and malfunctions (chs. 13-20, due M 2025/11/17, 10:00 AM)
Do the following problems from the textbook: 13.1, 14.7, 16.5, 17.9, 18.3ab, 19.3
Homework 4: Degradations and failures (chs. 21-28, due M 2025/12/01, 10:00 AM)
Do the following problems from the textbook: 21.3, 23.3, 24.6, 25.1, 27.6, 28.1
The following sample exam problems are meant to indicate the types and levels of problems, rather than the coverage (which is outlined in the course calendar).
Students are responsible for all sections and topics in the textbook and class handouts that are not explicitly excluded in the study guide that follows each sample exam, even if the material was not covered in class lectures.
Sample Midterm Exam (105 minutes)
Problems 3.12, 4.4, 9.4, and 12.1 from the textbook.
Midterm Exam Study Guide
Study Chapters 1-12 and review the problems in homework assignments 1-2. The following textbook sections are excluded: 6.6, 7.6, 8.6, 9.4, 9.6, 11.6
Sample Final Exam (120 minutes)
Problems 15.5, 17.1, 21.2, and 27.3 from the textbook.
Final Exam Study Guide
Study Chapters 13-28 and review the problems in homework assignments 3-4. The following textbook sections are excluded: 13.6, 14.6
Each student will review a subfield of dependable computing or do original research on a selected and approved topic. A list of pre-approved research topics is provided below. However, students should feel free to propose their own topics for approval. To propose a topic, send via e-mail a one-page narrative, including 2-3 key references, to the instructor.
A publishable report earns an "A" for the course, regardless of homework grades. See the course calendar for schedule & due dates and Research Paper Guidlines for formatting tips.
Our research for fall 2025 will focus on fault tolerance and robustness in biological systems, whose attributes may allow us to build ultra-reliable biologically-inspired systems. A side benefit of biologically-inspired systems is low power consumption. The following are titles and starting references for individual research papers.
01. Biologically-inspired Methods of Self-Repair [Assigned to: Parth P. Kulkarni]
Self-repair is any method that allows a system to automatically return to full or at least better functionality after an undesirable event has "injured" it.
Stauffer, A., Mange, D., & Tempesti, G. (2006). Bio-inspired computing machines with self-repair mechanisms. Proc. Int'l Workshop on Biologically Inspired Approaches to Advanced Information Technology, pp. 128-140. Springer, Berlin, Heidelberg.
Samie, M., Dragffy, G., & Pipe, T. (2009). Novel bio-inspired self-repair algorithm for evolvable fault tolerant hardware systems. Proc. 11th Annual Conf. on Genetic and Evolutionary Computation: Late Breaking Papers pp. 2143-2148.
02. Trade-offs Between Efficiency and Robustness in Biological Systems [Assigned to: ]
We know that efficiency optimizations in computer systems are done at the expense of robustness. To what extent is the same true in biological systems?
Vardi, M. (2020), A Computational Lens on Economics, CACM.
https://cacm.acm.org/magazines/2020/7/245686-a-computational-lens-on-economics/fulltext
Carlson, J. M., & Doyle, J. (2002). Complexity and robustness. Proc. National Academy of Sciences, 99(suppl_1), 2538-2545.
03. Robust Computation in Biological Systems [Assigned to: Jash Shah]
A computation is robust if its quality is not affected by minor perturbations in system resources or data. How is this desirable property achieved in biological systems?
Kitano, H. (2007). Towards a theory of biological robustness. Molecular Systems Biology, 3(1), 137.
Krakauer, D. C. (2006). Robustness in Biological Systems: a provisional taxonomy. In Complex Systems Science in Biomedicine (pp. 183-205). Springer, Boston, MA.
04. Approximation Schemes in Biological Systems [Assigned to: Sijie Kong]
Biological computations are either analog or low-precision. How do these properties affect the accuracy of results and how are the ensuing inaccuracies tolerated?
Hopfield, J. J. (1994). Physics, computation, and why biology looks so different. J. Theoretical Biology, 171(1), 53-60.
Chelly Dagdia, Z., Avdeyev, P., & Bayzid, M. (2021). Biological computation and computational biology: survey, challenges, and discussion. Artificial Intelligence Review, 54(6), 4169-4235.
05. Genetic Redundancy and Its Benefits [Assigned to: Eliah R. Reeves]
Redundancy is one of the most-important methods of ensuring dependability. Nature too uses redundancy. One example is redundancy in genes. Try to relate the two redundancy methods and draw conclusions.
Nowak, M. A., Boerlijst, M. C., Cooke, J., & Smith, J. M. (1997). Evolution of genetic redundancy. Nature, 388(6638), 167-171.
Laruson, A. J., Yeaman, S., & Lotterhos, K. E. (2020). The importance of genetic redundancy in evolution. Trends in Ecology & Evolution, 35(9), 809-822.
06. The Role of Redundancy in the Human Nervous System [Assigned to: ]
Studies of brains with various kinds of damage shows that many essential functions are still performed, either by using the brain's natural redundancy or by remapping functions from one region to another.
Mizusaki, B. E., & O'Donnell, C. (2021). Neural circuit function redundancy in brain disorders. Current Opinion in Neurobiology, 70, 74-80.
Neilson, P. D., & Neilson, M. D. (2005). An overview of adaptive model theory: solving the problems of redundancy, resources, and nonlinear interactions in human movement control. J. Neural Engineering, 2(3), S279.
07. Regeneration and Self-Repair in Biological Systems [Assigned to: Leon Gold]
Most cells can repair injuries inflicted on them by various sources. Some creatures are capable of regenerating lost organs. These are examples of self-repair without external assistance.
Yang, I., Jung, S. H., & Cho, K. H. (2016). Self-repairing digital system based on state attractor convergence inspired by the recovery process of a living cell. IEEE Trans. VLSI Systems, 25(2), 648-659.
Koop, F. (2022). Scientists map the brain of the axolotl—a unique creature that can create new neurons, ZME Science.
https://www.zmescience.com/science/scientists-map-the-brain-of-the-axolotl-a-salamander-that-can-create-new-neurons-05092022/
08. Functional Redundancy in Humans and Other Animals [Assigned to: Edison Chen]
Redundancy in function is an effective complement to redundancy in resources. If multiple parts can perform the same function, then tasks can be prioritized and re-allocated, even in the absence of redundant resources.
Rosenfeld, J. S. (2002). Logical fallacies in the assessment of functional redundancy. Conservation Biology, 16(3), 837-839.
Biggs, C. R., Yeager, L. A., Bolser, D. G., Bonsell, C., Dichiera, A. M., Hou, Z., ... & Erisman, B. E. (2020). Does functional redundancy affect ecological stability and resilience? A review and meta-analysis. Ecosphere, 11(7), e03184.
09. Use of Repeated Computation and Voting in the Brain's Decision Processes [Assigned to: Yu Chen Chen]
We have seen that replication (in space or time) along with voting is an effective method of fault- and malfunction-tolerance. To what extent does the brain use these methods to improve on result correctness?
Bischoff, I., Neuhaus, C., Trautner, P., & Weber, B. (2013). The neuroeconomics of voting: Neural evidence of different sources of utility in voting. J. Neuroscience, Psychology, and Economics, 6(4), 215.
Hunt, L. T., & Hayden, B. Y. (2017). A distributed, hierarchical and recurrent framework for reward-based choice. Nature Reviews Neuroscience, 18(3), 172-182.
10. Use of Error Codes in Biological Systems [Assigned to: Dingjiang Liang]
Error-detecting and error-correcting codes are ubiquitous in computer and communication systems. How are these codes used in the human brain and other biological systems?
Battail, G. (2019). Error-correcting codes and information in biology. BioSystems, 184, 103987.
Leeson, M. S., & Higgins, M. D. (2012). Forward error correction for molecular communications. Nano Communication Networks, 3(3), 161-167.
11. Reconfiguration and Reprogramming in Biological Systems [Assigned to: Manan Gupta]
One way to achieve robustness and longevity is to reconfigure systems around non-functioning parts or to reprogram one part to perform the tasks of another part. How are these methods used in biological systems?
Finc, K., Bonna, K., He, X., Lydon-Staley, D. M., Kuhn, S., Duch, W., & Bassett, D. S. (2020). Dynamic reconfiguration of functional brain networks during working memory training. Nature Communications, 11(1), 1-15.
MacArthur, B. D., Ma'ayan, A., & Lemischka, I. R. (2009). Systems biology of stem cell fate and cellular reprogramming. Nature Reviews Molecular Cell Biology, 10(10), 672-681.
12. Self-Healing Biological Cells [Assigned to: Tzu-Chen Liang]
Most cells can recover from injuries inflicted on them by various sources. What are the biological bases for self-healing and to what extent are they trasferable to computer systems?
Ghosh, D., Sharman, R., Rao, H. R., & Upadhyaya, S. (2007). Self-healing systems—survey and synthesis. Decision Support Systems, 42(4), 2164-2185.
Diesendruck, C. E., Sottos, N. R., Moore, J. S., & White, S. R. (2015). Biomimetic Self-Healing. Angewandte Chemie International Edition, 54(36), 10428-10447.
13. Self-Healing Materials and Their Biological Bases [Assigned to: Kane Deng]
One of the domains where self-healing methods have been used rather successfully is in materials science. What are these methods and to what extent are they inspired by biological systems?
Harrington, M. J., Speck, O., Speck, T., Wagner, S., & Weinkamer, R. (2015). Biological archetypes for self-healing materials. Self-Healing Materials, 307-344.
Bekas, D. G., Tsirka, K., Baltzis, D., & Paipetis, A. S. (2016). Self-healing materials: A review of advances in materials, evaluation, characterization and monitoring techniques. Composites Part B: Engineering, 87, 92-119.
14. Adaptation Schemes in Biological Systems to Improve Longevity [Assigned to: ]
Besides evolutionary changes that occur rather slowly, other adaptation schemes are at work for improving longevity. What are these adaptation schemes and how can we apply them to computing systems?
Peck, J. R., & Waxman, D. (2018). What is adaptation and how should it be measured? J. Theoretical Biology, 447, 190-198.
Gozhenko, A., Biryukov, V., Muszkieta, R., & Zukow, W. (2018). Physiological basis of human longevity: the concept of a cascade of human aging mechanism. Collegium Antropologicum, 42(2), 139-146.
15. Redundant Signaling in Biological Systems [Assigned to: Volkan Ozten]
Redundant signalling in the form of error-detecting and error-correcting codes has long been used in computer communications. Do biological systems used similar or vastly-different methods?
Teng, K. K., & Hempstead, B. L. (2004). Neurotrophins and their receptors: signaling trios in complex biological systems. Cellular and Molecular Life Sciences, 61(1), 35-48.
Zimmermann, M. (1989). The nervous system in the context of information theory. In Human Physiology (pp. 166-173). Springer, Berlin, Heidelberg.
16. Robust Information Storage and Retrieval in Biological Systems [Assigned to: Kelly M. Flippo]
Correct storage of data and correct retrieval of what is stored are important in ensuring correct operation of an information system. How are these critical properties achieved in biological systems?
Yim, S. S., McBee, R. M., Song, A. M., Huang, Y., Sheth, R. U., & Wang, H. H. (2021). Robust direct digital-to-biological data storage in living cells. Nature Chemical Biology, 17(3), 246-253.
Yim, A. K. Y., Yu, A. C. S., Li, J. W., Wong, A. I. C., Loo, J. F., Chan, K. M., ... & Chan, T. F. (2014). The essential component in DNA-based information storage system: robust error-tolerating module. Frontiers in Bioengineering and Biotechnology, 2, 49.
17. Error Codes and Other Forms of Redundancy in Biological Systems [Assigned to: Zeiler Randall-Reed]
At the most fundamental level, the genetic code contains a built-in form of redundancy. What are some of the other examples of informational redundancy in nature and how can we mimic them?
Battail, G. (2019). Error-correcting codes and information in biology, Biosystems, 184, 103987.
Liebovitch, L. S., Y. Tao, A. T. Todorov, and L. Levine (1996). Is there an error correcting code in the base sequence in DNA? Biophysical J. 71(3), 1539–1544.
18. Self-Healing Systems Mimicking the Biological Immune System [Assigned to: Michael D. Smith]
Exploring the possibility of creating "immunotronic" circuits that can detect and repair hardware failures, similar to how the body identifies and eliminates pathogens.
Bradley, D. and A. Tyrrell (2000). Immunotronics: Hardware Fault Tolerance Inspired by the Immune System. Lecture Notes in Computer Science 1801.
Widhalm D., K. M. Goeschka, W. Kastner (2023). A Review on immune-inspired node fault detection in wireless sensor networks with a focus on the danger theory. Sensors 23(3):1166.
19. Analogs of Aging and Biodegradation in Artificial Systems [Assigned to: Kogan Sam]
Both artificial and natural systems degrade with age; ditto for any fault-tolerance mechanism utilized. What can we learn from longevity studies in both types of system?
Li, Y., X. Tian, J. Luo, T. Bao, S. Wang, and X. Wu (2024). Molecular mechanisms of aging and anti‑aging strategies. Cell Communication and Signaling (22):285.
20. System Failures Due to Overload of Fault Tolerance Provisions [Assigned to: Yu Su]
The breakdown and overload of fault tolerance mechanisms contribute to conditions like Alzheimer's and Parkinson's diseases. Are similar mechnisms at play in artificial systems?
Perez, I. A., D. B. Porath, C. E. La Rocca1, L. A. Braunstein1, and S. Havlin (2024). Critical behavior of cascading failures in overloaded networks. Physical Reviews E (109):034302.
21. Fault Tolerance in Tensorized Neural Networks [Assigned to: Paolo Tonelli]
Deep learning is increasingly deployed in safety-critical settings where hardware faults can silently corrupt computations and degrade accuracy at precisely the wrong time.
M. A. Neggaz, I. Alouani, P. R. Lorenzo, and S. Niar (2018). A reliability study on CNNs for critical embedded systems. Proc. IEEE 36th Int'l Conf. Computer Design, pp. 476-479.
I. V. Oseledets (2011). Tensor-train decomposition. SIAM J. Scientific Computing 33(5), pp. 2295-2317.
M. Sabbagh, G. Cheng, Y. Fei, and Y. Wang (2019). Evaluating fault resiliency of compressed deep neural
networks. Proc. IEEE Int'l Conf. Embedded Software and Systems, pp. 1-7.
22. Accuracy-Performance-Robustness Trade-offs in Tensor-Decomposed Pruned Neural Networks [Assigned to: Chih-Hao Wang]
Model compression methods such as pruning and tensor decomposition can reduce computational cost but may adversely affect accuracy or robustness to faults.
M. Sabbagh, G. Cheng, Y. Fei, and Y. Wang (2019). Evaluating fault resiliency of compressed deep neural networks. Proc. IEEE Int’l Conf. Embedded Software and Systems, pp. 1-7.
T. G. Kolda and B. W. Bader (2009). Tensor decompositions and applications. SIAM Review 51(3), pp. 455-500.
S. Rabanser, O. Shchur, and S. Günnemann (2017). Introduction to tensor decompositions and their applications in machine learning. arXiv preprint arXiv:1711.10781.
C. Hawkins, H. Yang, M. Li, L. Lai, and V. Chandra (2021). Low-rank + sparse tensor compression for neural networks. arXiv preprint arXiv:2111.01697.
23. Fault-Tolerance in Modern and Future Robotics [Assigned to: Shaoqian Zhou]
Due to their tight integration with the physical world in areas such as manufacturing, transportation, & healthcare, a high degree of fault tolerance is expected of modern robotic systems.
M. Qumar et al. (2025). Fault-tolerant control strategies for industrial robots: state of the art and future perspective on AI-based fault management.
R. T. R. W. Al-Musawi et al. (2025). Reinforcement learning-based fault-tolerant control for quadrotor with online transformer adaptation.
K. Chen et al. (2024), FogROS2-FT: Fault-tolerant cloud robotics.
24. Robustness Versus Efficiency Trade-offs in Key-Value Cache Strategies for Large Language Model Inference [Assigned to: Wei-Chung Lu]
Bridging the gap between high-performance key-value cache optimization and dependable system design through understanding the trade-offs between cache efficiency and inference reliability.
Xie et al. (2024). Efficient streaming language models with attention Sinks. ICLR.
Han et al. (2024). H2O: Heavy-hitter oracle for efficient generative inference of large language models. arXiv:2402.09268.
Xu et al. (2024). KIVI: Plug-and-play 2-bit KV cache quantization with streaming asymmetric quantization. arXiv:2405.16712.
Li et al. (2024). CacheGen: KV cache compression and sharing for fast and memory-efficient LLM inference. NeurIPS.
25. Process-Aware Fault Tolerance: Manufacturing for Reliable VLSI Devices [Assigned to: Lucas Liang]
As technology scales and devices become increasingly complex, we need to shift from traditional post-fab resilience to process-stage procedures that reduce the burden on BIST/ECC.
P. D. Hodgson, D. Lane, P. J. Carrington, E. Delli, R. Beanland, and M. Hayne (2022). ULTRARAM: A low-energy, high-endurance, compound-semiconductor memory on silicon. Advanced Electronic Materials 8(3), p. 2101103.
S. Smidstrup, T. Markussen, P. Vancraeyveld, J. Wellendorff, J. Schneider, T. Gunst, et al. (2020). QuantumATK: An integrated platform of electronic and atomic-scale modelling tools. J. Physics: Condensed Matter 32(1), p. 015901.
X. Wang, D. Vasudevan, and H.-H. S. Lee (2012). Global built-in self-repair for 3D Memories with redundancy sharing and parallel testing. Proc. IEEE Int’l Conf. 3D Systems Integration, pp. 1–6.
26. Fault Tolerance by Anyons in Quantum Computation [Assigned to: Kriteen Shrestha]
Though quantum error correction algorithms exist, existing algorithms require high levels of redundancy. Computation can be carried out by creating pairs of anyons, braiding them around each other to enact logical gates, and fusing them to perform measurements. Because these operations depend only on the topology of the braiding path, not small local errors, the resulting quantum computation is fundamentally protected against noise.
A. Y. Kitaev (2003). Fault-tolerant quantum computation by anyons. Annals of Physics 303(1), pp. 2-30.
C. Nayak et al. (2008). Non-Abelian anyons and topological quantum computation. Reviews of Modern Physics 80(3), pp. 1083-1159.
27. Balancing Approximate Computation and Fault Resilience in Quantifying TinyML Systems [Assigned to: Zejun Huang]
As TinyML is being deployed in safety-critical applications such as wearable health monitoring, the impact of approixmate computations necessitated by limited power and hardware resources on its resilience must be further studied.
U. Sharif, D. Mueller-Gritschneder, R. Stahl, and U. Schlichtmann (2023). Efficient software-implemented HW fault tolerance for TinyML inference in safety-critical applications. Proc. Design, Automation & Test in Europe Conf.
A. Malhotra and S. K. Gupta (2022). FlatENN: Train flat for enhanced fault tolerance of quantized deep neural networks. arXiv:2301.00675
Shaojie Zhuo et al. (2022). Balancing approximate computation and fault resilience in quantifying TinyML Systems. arXiv:2203.05492
28. Fault-Tolerant Voters Based on Sorting and Selection Networks [Assigned to: Yuchen He]
In reliable systems based on replication and voting, voter failure presents a real challenge. If the voter design is based on a sorting/selection network, making the latter fault-tolerant is a viable approach to having robust voters.
A. C. Yao and F. F. Yao (1985). On Fault-tolerant networks for sorting. SIAM J. Computing 14(1), pp.
P. Balasubramanian and K. Prasad (2017). A fault tolerance improved majority voter for TMR system architectures. arXiv:1605.03771
J. Sun, E. Cerny, and J. Cecsei (1994). Fault tolerance in a class of sorting networks. IEEE Trans. Computers. 43(7), pp. 827-837.
Y. Q. Aguiar et al. (2020). Design exploration of majority voter architectures based on the signal probability for TMR strategy optimization in space applications. Microelectronics Reliability 114, p. 113877.
Here are some guidelines for preparing your research poster. The idea of the poster is to present your research results and conclusions thus far, get oral feedback during the session from the instructor and your peers, and to provide the instructor with something to comment on before your final report is due. Please send a PDF copy of the poster via e-mail by midnight on the poster presentation day.
Posters prepared for conferences must be colorful and eye-catching, as they are typically competing with dozens of other posters for the attendees' attention. Here is an example of a conference poster. Such posters are often mounted on a colored cardboard base, even if the pages themselves are standard PowerPoint slides. In our case, you should aim for a "plain" poster (loose sheets, to be taped to the wall in our classroom) that conveys your message in a simple and direct way. Eight to 10 pages, each resembling a PowerPoint slide, would be an appropriate goal. You can organize the pages into 2 x 4 (2 columns, 4 rows), 2 x 5, or 3 x 3 array on the wall. The top two of these might contain the project title, your name, course name and number, and a very short (50-word) abstract. The final two can perhaps contain your conclusions and directions for further work (including work that does not appear in the poster, but will be included in your research report). The rest will contain brief description of ideas, with emphasis on diagrams, graphs, tables, and the like, rather than text which is very difficult to absorb for a visitor in a very limited time span.
All grades listed are in percent, unless otherwise noted.
HW1 grades (letter): Range = [B, A+], Mean = 3.78, Median = A, SD = 0.37
HW2 grades (letter): Range = [B, A+], Mean = 3.71, Median = A–, SD = 0.33
HW3 grades (letter): Range = [B, A+], Mean = 3.59, Median = A–, SD = 0.37
HW4 grades (letter): Range = [B–, A+], Mean = 3.78, Median = A, SD = 0.46
Overall homework grades: Range = [79, 104], Mean = 93, Median = 94, SD = 7.0
Research grades (letter): Range = [B, A+], Mean = 3.84, Median = A, SD = 0.39
Research grades: Range = [75, 108], Mean = 96, Median = 100, SD = 9.7
Course grades (letter): Range = [B, A+], Mean = 3.76, Median = A–, SD = 0.36
Required text: B. Parhami, Dependable Computing: A Multilevel Approach, chapters will be posted as they are updated. Please visit the textbook's web page for general information. Lecture slides are also available there.
Some useful books (not required):
Koren/Krishna, Fault-Tolerant Systems, Morgan Kaufmann, 2007 (ISBN 0-12-088525-5)
Shooman, Reliability of Computer Systems and Networks, Wiley, 2002 (ISBN 0-471-29342-3)
Siewiorek/Swarz, Reliable Computer Systems, Digital Press, 1992 (ISBN 1-55558-075-0)
Johnson, Design and Analysis of Fault-Tolerant Digital Systems, Addison Wesley, 1989 (ISBN 0-201-07570-9)
Iyer/Kalbarczyk/Nakka, Dependable Computing: Design and Assessment, IEEE Press, 2024 (ISBN 9781118709443)
Research resources:
Proc. IEEE/IFIP Int'l Conf. Dependable Systems and Networks (DSN), formerly known as Fault-Tolerant Computing Symp. (FTCS), annual, since 1971.
IEEE Trans. Dependable and Secure Computing, published since 2004
IEEE Trans. Reliability, published since 1955
IEEE Trans. Computers, published since 1952
UCSB library's electronic journals, collections, and other resources
Motivation: Dependability concerns are integral parts of engineering design. Ideally, we would like our computer systems to be perfect, always yielding timely and correct results. However, just as bridges collapse and airplanes crash occasionally, so too computer hardware and software cannot be made totally immune to unpredictable behavior. Despite great strides in component reliability and programming methodology, the exponentially increasing complexity of integrated circuits and software systems makes the design of prefect computer systems nearly impossible. In this course, we study the causes of computer system failures (impairments to dependability), techniques for ensuring correct and timely computations despite such impairments, and tools for evaluating the quality of proposed or implemented solutions.
Catalog entry: 257A. Fault-Tolerant Computing. (4) PARHAMI. Prerequisites: ECE 154. Lecture, 3 hours. Basic concepts of dependable computing. Reliability of nonredundant and redundant systems. Dealing with circuit-level defects. Logic-level fault testing and tolerance. Error detection and correction. Diagnosis and reconfiguration for system-level malfunctions. Degradation management. Failure modeling and risk assessment.
History: Professor Parhami took over the teaching of ECE 257A in the fall quarter of 1998. Previously, the course had been taught primarily by Dr. John Kelly, who instituted the two-course sequence ECE 257A/B, the first covering general topics and the second (now discontinued) devoted to his research focus on software fault tolerance. Borrowing from his experience in teaching dependable computing at other universities and based on an extensive survey of the field that he published in 1994, Professor Parhami oriented the course toward an original multilevel view of impairments to computer system dependability and techniques for avoiding or tolerating them. The levels of this models, in increasing order of abstraction, are: defects, faults, errors, malfunctions, degradations, and failures. A textbook based on this multilevel model of dependable computing is in preparation.
Offering of ECE 257A in fall 2023
Offering of ECE 257A in fall 2022
Offering of ECE 257A in fall 2021
Offering of ECE 257A in fall 2020
Offering of ECE 257A in fall 2019
Offering of ECE 257A in fall 2018
Offering of ECE 257A in fall 2016 (PDF file)
Offering of ECE 257A in fall 2015 (PDF file)
Offering of ECE 257A in winter 2015 (PDF file)
Offering of ECE 257A in fall 2013 (PDF file)
Offering of ECE 257A in fall 2012 (PDF file)
Offering of ECE 257A in fall 2009 (PDF file)
Offering of ECE 257A in fall 2007 (PDF file)
Offerings of ECE 257A in 1998 and 2006 (PDF file)