Istilah “ergonomi” berasal dari bahasa Latin yaitu ERGON (KERJA) dan NOMOS (HUKUM ALAM). dapat didefInisikan juga sebagai studi tentang aspek-aspek manusia dalam lingkungan keijanya yang ditinjau secara anatomi, fisiologi, psikologi, engineering, manajemen dan desain/perancangan. Ergonomi berkenaari pula dengan optimasi, efisiensi, kesehatan, keselamatan dan kenyamanan manusia di tempat kerja, di rumah, dan tempat rekreasi.
Ergonomi disebut juga sebagai “Human Factors”.
Penerapan ergonomi pada umumnya merupakan aktivitas rancang bangun (desain) ataupun rancang ulang (re-desain). Hal ini dapat meliputi perangkat keras seperti misalnya perkakas keija (tools), bangku kerja (benches), platform, kursi, pegangan alat kerja (workholders), sistem pengendali (controls). Alat peraga (displays), jalan/ lorong (acces ways), pintu (doors), jendela (ruindows), dan Iain-lain. Masih dalam kaitan dengan hal tersebut diatas adalah bahasan mengenai rancang bangun lingkungan kerja (working environment), karena jika sistem perangkat keras berubah maka akan berubah pula lingkungan kerjanya.
Ergonomi dapat berperan pula sebagai desain pekerjaan pada suatu organisasi. Sebagai contoh: penentuan jumlah jam istirahat, pemilihan jadwal pergantian waktu kerja (shift kerja), meningkatkan variasi pekerjaan, dan Iain-lain. Ergonomi dapat pula berfungsi sebagai desain perangkat lunak karena dengan semakin banyaknya pekerjaan yang berkaitan erat dengan komputer. Penyampaian informasi dalam suatu sistem komputer harus pula diusahakan sekompatibel mungkin sesuai dengan kemampuan pemrosesan informasi oleh manusia.
Disamping itu ergonomi juga memberikan peranan penting dalam meningkatkan faktor keselamatan dan kesehatan kerja, misalnya: desain suatu sistem kerja untuk mengurangi rasa nyeri dan rigilu pada sistem kerangka dan otot manusia, desain stasiun kerja untuk alat peraga visual (visual display unit station). Hal itu adalah untuk mengurangi ketidaknyamanan visual dan postur kerja, desain suatu perkakas kerja (handtools) untuk mengurangi kelelahan kerja, desain suatu peletakan instrumen dan sistem pengendali agar didapat optimasi dalam proses transfer informasi dengan dihasilkannya suatu respon yang cepat dengan meminimumkan risiko kesalahan, serta supaya didapatkan optimasi, efisiensi kerja dan hilangnya risiko kesehatan akibat metoda kerja yang kurang tepat.
Penerapan faktor ergonomi lainnya yang tidak kalah penu’ngnya adalah untuk desain dan evaluasi produk. Produk-produk ini haruslah dapat dengan mudah diterapkan (dimengerti dan digunakan) pada sejumlah populasi masyarakat tertentu tanpa mengakibatkan bahaya/ risiko dalam penggunaannya.
Istilah “ergonomi” mulai dicetuskan pada tahun 1949, akan tetapi aktivitas yang berkenaan dengannya telah bermunculan puluhan tahun sebelumnya. Beberapa kejadian periling diilustrasikan sebagai berikut :
C.T. THACKRAH, ENGLAND, 1831.
Thackrah adalah seorang dokter dari Inggris/England yang meneruskan pekerjaan dari seorang Italia bernama Ramazzini, dalam serangkaian kegiatan yang berhubungan dengan lingkungan kerja yang tidak nyaman yang dirasakan oleh para operator ditempat kerjanya. la mengamati postur tubuh pada saat bekerja sebagai bagian dari masalah kesehatan. Pada saat itu Thackrah mengamati seorang penjahit yang bekerja dengan posisi dan dimensi kursi-meja yang kurang sesuai secara antropometri, serta pencahayaan yang tidak ergonomis sehingga mengakibatkan membungkuknya badan dan iritasi indera penglihatan. Disamping itu juga mengamati para pekerja yang berada pada lingkungan kerja dengan temperatur tinggi, kurangnya ventilasi, jam kerja yang panjang, dan gerakan kerja yang berulang-ulang (repetitive work).
F.W. TAYLOR, U.S.A., 1898.
Frederick W. Taylor adalah seorang insinyur Amerika. Yang menerapkan metoda ilmiah untuk menentukan cara yang terbaik dalam melakukan suatu pekerjaan. Beberapa metodanya merupakan konsep ergonomi dan manajemen modern.
F.B. GILBRETH, U.S.A., 1911.
Gilbreth juga mengamati dan mengoptimasi metoda kerja, dalam hal ini lebih mendetail dalam Analisa Gerakan dibandingkan dengan Taylor. Dalam bukunya Motion Study yang diterbitkan pada tahun 1911 ia menunjukkan bagaimana postur membungkuk dapat diatasi dengan mendesain suatu sistem meja yang dapat diatur naik-turun (adjustable).
BADAN PENELITIAN UNTUK KELELAHAN INDUSTRI (INDUSTRIAL FATIGUE RESEARCH BOARD), ENGLAND, 1918.
Badan ini didirikan sebagai penyelesaian masalah yang terjadi di pabrik amunisi pada Perang Dunia Pertama. Mereka menunjukkan bagaimana output setiap harinya meningkat dengan jam kerja per hari-nya yang menurun. Disamping itu mereka juga mengamati waktu siklus optimum untuk sistem kerja berulang (repetitive work systems) dan menyarankan adanya variasi dan rotasi pekerjaan.
E. MAYO dan teman-temannya, U.S.A., 1933.
Elton Mayo seorang warga negara Australia, memulai beberapa studi di suatu Perusahaan Listrik yaitu Western Electric Com-pany, Hawthorne, Chicago. Tujuan studinya adalah untuk mengkuantifikasi pengaruh dari variabel fisik seperti misalnya pencahayaan dan lamanya waktu istirahat terhadap faktor efisiensi dari para operator kerja pada unit perakitan.
PERANG DUNIA KEDUA, ENGLAND DAN U.S.A.
Masalah operasional yang terjadi pada peralatan militer yang berkembang secara cepat (seperti misalnya pesawat terbang) harus melibatkan sejumlah kelompok interdisiplin ilmu secara bersama-sama sehingga mempercepat perkembangan ergonomi pesawat terbang.
Masalah yang ada pada saat itu adalah penempatan dan identifikasi untuk pengendali pesawat terbang, efektifitas alat peraga (display), handel pembuka, ketidak-nyamanan karena terlalu panas atau terlalu dingin, desain pakaian untuk suasana kerja yang terlalu panas atau terlalu dingin dan pengaruhnya pada kinerja operator.
PEMBENTUKAN KELOMPOK ERGONOMI
Pembentukan Masyarakat Peneliti Ergonomi (the Ergonomics Research Society) di England pada tahun 1949 melibatkan beberapa profesional yang telah banyak berkecimpung dalam bidang ini. Hal ini menghasilkan jurnal (majalah ilmiah) pertama dalam bidang ERGONOMI pada Nopember 1957. Perkumpulan Ergonomi Internasional (The International Ergonomics Association) terbentuk pada tahun 1957, dan The Human Factors Society di Amerika pada tahun yang sama.
Di samping itu patut diketahui pula bahwa Konperensi Ergonomi Australia yang pertama diselenggarakan pada tahun 1964, dan hal ini mencetuskan terbentuknya Masyarakat Ergonomi Aus¬tralia dan New Zealand (The Ergonomics Society of Australia and New Zealand).
1.3. DASAR KEILMUAN DARI ERGONOMI
Bariyak penerapan ergonomi yang hanya berdasarkan sekedar “common sense” (dianggap suatu hal yang sudah biasa terjadi), dan hal itu benar, jika sekiranya suatu kevmtungan yang besar bisa didapat hanya sekedar dengan penerapan suatu prinsip yang sederhana. Hal ini biasanya merupakan kasus dimana ergonomi belum dapat diterima sepenuhnya sebagai alat untuk proses desain, akan tetapi masih banyak aspek ergonomi yang jauh dari kesadaran manusia. Karakteristik fungsional dari manusia seperti kemampuan penginderaan, waktu respon/tanggapan, daya ingat, posisi optimum tangan dan kaki untuk efisiensi kerja otot, dan Iain-lain adalah merupakan suatu hal yang belum sepenuhnya dipahami oleh masyarakat awam. Agar didapat suatu perancangan pekerjaan maupun produk yang optimum daripada tergantung dan harm dengan “trial and error” maka pendekatan ilmiah harus segera diadakan.
Ilmu-ilmu terapan yang banyak berhubungan dengan fungsi tubuh manusia adalah anatomi dan fisiologi. Untuk menjadi ergonom diperlukan pengetahuan dasar tentang fungsi dari sistem kerangka otot. Yang berhubungan dengan hal tersebut adalah KINESIOLOGI (mekanika pergerakan manusia/mechanics of human movement) dan BIOMEKANIKA (aplikasi ilmu mekanika teknik untuk analisis sistem kerangka-otot manusia). Ilmu-ilmu ini akan memberikan modal dasar untuk mengatasi masalah postur dan pergerakan manusia di tempat dan ruang kerjanya.
Disamping itu, suatu hal yang vital pada penerapan ilmiah untuk ergonomi adalah ANTROPOMETRI (kalibrasi tubuh manusia). Dalam hal ini terjadi penggabungan dan pemakaian data antropometri dengan ilmu-ilmu statistik yang menjadi prasyarat utamanya.
1.4. STUDI TENTANG SISTEM KERJA SECARA GLOBAL
Dalam penerapan ergonomi, adalah penting untuk secara langsung mengikutsertakan pembahasan tentang sistem secara menyeluruh agar tidak perlu adanya studi lanjut maupun re-desain.
Sebagai contoh adalah dalam mendesain ruang kerja untuk pengemudi kendaraan misalnya, hal-hal seperti berikut perlu dipertimbangkan:
• Acces (getting in and out) – masalah utama untuk desain interior alat transportasi.
• Restraint- pemasangan sabuk pengaman pada alat transportasi.
• Visibility – untuk para pejalan kaki (pedestrian), lampu parkir, alat transportasi, blind spots, dll.
• Seating – memberikan penyangga punggung (back support), penyangga lengan, beban merata untuk distribusi berat tubuh pada tempat duduk, penyerap getaran, mampu atur (adjustability), dll.
• Displays (instrumen)- beberapa hal utama antara lain: visibility, lighting, clarity.
• Controls – mudah dijangkau, mudah diidentifikasi dan operasi, posisi dan pergerakan yang standard.
• Lingkungan – cukup ventilasi, hindari pengaruh panas lang-sung yang berlebihan, hindari bentuk yang meruncing/tajam (sharp contour) pada panel instrumen.
A central and fundamental concept in human factors is the system. Various authors have proposed different definitions for the term; however, we adopt a very simple one here. A system is an entity that exists to carry out some purpose (Bailey, 1982). A system is composed of humans, machines, and other things that work together (interact) to accomplish some goal which these same components could not produce independently. Thinking in terms of systems serves to structure the approach to the development, analysis, and evaluation of complex collections of humans and machines. As Bailey (1982) states,
The concept of a system implies that we recognize a purpose; we carefully analyze the purpose; we understand what is required to achieve the purpose; we design the system’s parts to accomplish the requirements; and we fashion a well-coordinated system that effectively meets our purpose, (p. 192)
We discuss aspects of human-machine systems and then present a few charac¬teristics of systems in general. Finally, we introduce the concept of system reliability.
We can consider a human-machine system as a combination of one or more human beings and one or more physical components interacting to bring about, from given inputs, some desired output. In this frame of reference, the common concept of machine is too restricted, and we should rather consider a “ma¬chine” to consist of virtually any type of physical object, device, equipment, facility, thing, or what have you that people use in carrying out some activity that is directed toward achieving some desired purpose or in performing some function. In a relatively simple form, a human-machine system (or what we sometimes refer to simply as a system) can be a person with a hoe, a hammer, or a hair curler. Going up the scale of complexity, we can regard as systems the family automobile, an office machine, a lawn mower, and a roulette wheel, each equipped with its operator. More complex systems include aircraft, bot¬tling machines, telephone systems, and automated oil refineries, along with their personnel. Some systems are less delineated and more amorphous than these, such as the servicing systems of gasoline stations and hospitals and other health services, the operation of an amusement park or a highway and traffic system, and the rescue operations for locating an aircraft downed at sea.
The essential nature of people’s involvement in a system is an active one, interacting with the system to fulfill the function for which the system is designed.
The typical type of interaction between a person and a machine is illustrated in Figure 1-1. This shows how the displays of a machine serve as stimuli for an operator, trigger some type of information processing on the part of the oper¬ator (including decision making), which in turn results in some action (as in the operation of a control mechanism) that controls the operation of the machine.
One way to characterize human-machine systems is by the degree of manual versus machine control. Although the distinctions between and among systems in terms of such control are far from clear-cut, we can generally consider systems in three broad classes: manual, mechanical, and automatic.
Manual Systems A manual system consists of hand tools and other aids which are coupled by a human operator who controls the operation. Operators of such systems use their own physical energy as the power source.
Mechanical Systems These systems (also referred to as semiautomatic sys¬tems) consist of well-integrated physical parts, such as various types of powered machine tools. They are generally designed to perform their functions with little variation. The power typically is provided by the machine, and the operator’s function is essentially one of control, usually by the use of control devices.
Automated Systems When a system is fully automated, it performs all operational functions with little or no human intervention. Robots are a good example of an automated system. Some people have the mistaken belief that since automated systems require no human intervention, they are not human-machine systems and involve no human factors considerations. Nothing could be further from the truth. All automated systems require humans to install, program, reprogram, and maintain them. Automated systems must be designed with the same attention paid to human factors that would be given to any other type of human-machine system.
Characteristics of Systems
We briefly discuss a few fundamental characteristics of systems, especially as they relate to human-machine systems.
Systems Are Purposive In our definition of a system, we stressed that a system has a purpose. Every system must have a purpose, or else it is nothing more than a collection of odds and ends. The purpose of a system is the system goal, or objective, and systems can have more than one.
Systems Can Be Hierarchical Some systems can be considered to be parts of larger systems. In such instances, a given system may be composed of more molecular systems (also called subsystems). When faced with the task of describing or analyzing a complex system, one often asks, “Where does one start and where does one stop?” The answer is, “It depends.” Two decisions must be made. First, one has to decide on the boundary of the system, that is, what is considered part of the system and what is considered outside the system. There is no right or wrong answer, but the choice must be logical and must result in a system that performs an identifiable function. The second decision is where to set the limit of resolution for the system. That is, how far down into the system is one to go? At the lowest level of analysis one finds components. A component in one analysis may be a subsystem in another analysis that sets a lower limit of resolution. As with setting system boundaries, there is no right or wrong limit of resolution. The proper limit depends on why one is describing or analyzing the situation.
Systems Operate in an Environment The environment of a system is every¬thing outside its boundaries. Depending on how the system’s boundaries are drawn, the environment can range from the immediate environment (such as a workstation, a lounge chair, or a typing desk) through the intermediate (such as a home, an office, a factory, a school, or a football stadium) to the general (such as a neighborhood, a community, a city, or a highway system). Note that some aspects of the physical environment in which we live and work are part of the natural environment and may not be amenable to modification (although one can provide protection from certain undesirable environmental conditions such as heat or cold). Although the nature of people’s involvement with their physical environment is essentially passive, the environment tends to impose certain constraints on their behavior (such as limiting the range of their movements or restricting their field of view) or to predetermine certain aspects of behavior (such as stooping down to look into a file cabinet, wandering through a labyrinth in a supermarket to find the bread, or trying to see the edge of the road on a rainy night).
Components Serve Functions Every component (the lowest level of analy¬sis) in a system serves at least one function that is related to the fulfillment of one or more of the system’s goals. One task of human factors specialists is to aid in making decisions as to whether humans or machines (including software) should carry out a particular system function. (We discuss this allocation of function process in more detail in Chapter 22.)
Components serve various functions in systems, but all typically involve a combination of four more basic functions: sensing (information receiving), information storage, information processing and decision, and action function; they are depicted graphically in Figure 1-2. Since information storage interacts with all the other functions, it is shown above the others. The other three functions occur in sequence.
1 Sensing (information receiving): One of these functions is sensing, or information receiving. Some of the information entering a system is from outside the system, for example, airplanes entering the area of control of a control-tower operator, an order for the production of a product, or the heat that sets off an automatic fire alarm. Some information, however, may origi¬nate inside the system itself. Such information can be feedback (such as the reading on the speedometer from an action of the accelerator), or it can be information that is stored in the system.
2 Information storage: For human beings, information storage is syn¬onymous with memory of learned material. Information can be stored in physical components in many ways, as on magnetic tapes and disks, templates, records, and tables of data. Most of the information that is stored for later use is in coded or symbolic form.
3 Information processing and decision: Information processing embraces various types of operations performed with information that is received (sensed) and information that is stored. When human beings are involved in information processing, this process, simple or complex, typically results in a decision to act (or, in some instances, a decision not to act). When mechanized or automated machine components are used, their information processing must be programmed in some way. Such programming is, of course, readily under¬stood if a computer is used. Other methods of programming involve the use of various types of schemes, such as gears, cams, electric and electronic circuits, and levers.
4 Action functions: What we call the action functions of a system generally are those operations which occur as a consequence of the decisions that are made. These functions fall roughly into two classes. The first is some type of physical control action or process, such as the activation of certain control mechanisms or the handling, movement, modification, or alteration of mate¬rials or objects. The other is essentially a communication action, be it by voice (in human beings), signals, records, or other methods. Such functions also involve some physical actions, but these are in a sense incidental to the communication function.
Components Interact To say that components interact simply means that the components work together to achieve system goals. Each component has an effect, however small, on other components. One outcome of a system’s analysis is the description and understanding of these component and sub¬system relationships.
Systems, Subsystems, and Components Have Inputs and Outputs At all
levels of a complex system there are inputs and outputs. The outputs of one subsystem or component are the inputs to another. A system receives inputs from the environment and makes outputs to the environment. It is through inputs and outputs that all the pieces interact and communicate. Inputs can be physical entities (such as materials and products), electric impulses, mechan¬ical forces, or information.
It might be valuable at this time to distinguish between open-loop and closed-loop systems. A closed-loop system performs some process which re¬quires continuous control (such as in vehicular operation and the operation of certain chemical processes), and requires continuous feedback for its suc¬cessful operation. The feedback provides information about any error that should be taken into account in the continuing control process. An open-loop system, when activated, needs no further control or at least cannot be further controlled. In this type of system the die is cast once the system has been put into operation; no further control can be exercised, such as in firing a rocket that has no guidance system. Although feedback with such systems obviously cannot serve continuous control, feedback can improve subsequent operations of the system.
In a system’s analysis all the inputs and outputs required for each compo¬nent and subsystem to perform its functions are specified. Human factors specialists are especially qualified to determine the inputs and outputs neces¬sary for the human components of systems to successfully carry out their functions.
Unfortunately, nothing lasts forever. Things break or just fail to work, usually at the worst possible time. When we design systems, of course, we would like them to continue working. In this context, engineers speak of the reliability of a system or component to characterize its dependability of performance (includ¬ing people) in carrying out an intended function. Reliability is usually expressed as the probability of successful performance (this is especially applicable when the performance consists of discrete events, such as starting a car). For exam¬ple, if an automated teller machine gives out the correct amount of money 9999 times out of 10,000 withdrawal transactions, we say that the reliability of the machine, to perform the function, is .9999. (Reliabilities for electronic and mechanical devices are often carried out to four or more decimal places; reliabilities for human performance, on the other hand, usually are carried no further than three decimal places.)
Another measure of reliability is mean time to failure (abbreviated MTF). There are several possible variations, but they all relate to the amount of time a system or individual performs successfully, either until failure or between failures; this index is most applicable to continuous types of activities. Other variations could also be mentioned. For our present discussion, however, let us consider reliability in terms of the probability of successful performance.
If a system includes two or more components (machine or human or both), the reliability of the composite system will depend on the reliability of the individual components and how they are combined within the system. Compo¬nents can be combined within a system in series, in parallel, or in a combina¬tion of both.
Components in Series In many systems the components are arranged in series (or sequence) in such a manner that successful performance of the total system depends on successful performance of each and every component, person or machine. By taking some semantic liberties, we could assume com¬ponents to be in series that may, in fact, be functioning concurrently and interdependently, such as a human operator using some type of equipment. In analyzing reliability data in such cases, two conditions must be fulfilled: (1) failure of any given component results in system failure, and (2) the component failures are independent of each other. When these assumptions are fulfilled, the reliability of the system for error-free operation is the product of the reliabilities of the several components. As more components are added in series, the reliability of the system decreases. If a system consisted of 100 components in series, each with a reliability of .9900, the reliability of the entire system would be only .365 (that is, 365 times out of 1000 the system would properly perform its function). The maximum possible reliability in a series system is equal to the reliability of the least reliable component, which often turns out to be the human component. In practice, however, the overall reliability of a series system is often much less than the reliability of the least reliable component.
Components in Parallel The reliability of a system whose components are in parallel is entirely different from that whose components are in a series. With parallel components, two or more in some way are performing the same function. This is sometimes referred to as a backup, or redundancy, arrange¬ment—one component backs up another so that if one fails, the other can successfully perform the function. In order for the entire system to fail, all the components in parallel must fail. Adding components in parallel increases the reliability of the system. For example, a system with four components in parallel, each with a reliability of .70, would have an overall system reliability of .992. Because humans are often the weak link in a system, it is common to see human-machine systems designed to provide parallel redundancy for some of the human functions.
Discussion We have been discussing system reliability as if it were static and unchanging. As our own experience illustrates, however, reliability changes as a function of time (usually it gets worse). The probability that a 10-year-old car will start is probably lower than it was when the car was 1 year old. The same sort of time dependency applies to the reliability of humans, only over shorter periods. The probability of successful human performance often deteriorates over just a few hours of activity. Human reliability is discussed further in Chapter
2.COVERAGE OF THIS TEXT
Since a comprehensive treatment of the entire scope of human factors would fill a small library, this text must be restricted to a rather modest segment of the total human factors domain. The central theme is the illustration of how the achievement of the two primary human factors objectives (i.e., functional effectiveness and human welfare) can be influenced by the extent to which relevant human considerations have been taken into account during the design of the object, facility, or environment in question. Further, this theme is followed as it relates to some of the more commonly recognized human factors content areas (such as the design of displays for presenting information to people, human control processes, and physical environment). Pursuing this
theme across the several subareas would offer an overview of the content areas of human factors.
The implications of various perceptual, mental, and physical characteristics as they might affect or influence the objectives of human factors probably can best be reflected by the result of relevant research investigations and of docu¬mented operational experience. Therefore the theme of the text will generally be carried out by presenting and discussing the results of illustrative research and by bringing in generalizations or guidelines supported by research or experiences that have relevance to the design process in terms of human factors considerations. Thus, in the various subject or content areas much of the material in this text will consist of summaries of research that reflect the relationships between design variables, on the one hand, and criteria of func¬tional effectiveness or human welfare, on the other hand.
We recognize that the illustrative material brought in to carry out this theme is in no way comprehensive, but we hope it will represent in most content areas some of the more important facets.
Although the central theme will, then, deal with the human factors aspects of the design of the many things people use, there will be some modest treatment of certain related topics, such as how human factors fits in with
HUMAN FACTORS RESEARCH METHODOLOGIES
Human factors is in large part an empirical science. The central approach of human factors is the application of relevant information about human ca¬pabilities and behavior to the design of objects, facilities, procedures, and environments that people use. This body of relevant information is largely based on experimentation and observation. Research plays a central role in this regard, and the research basis of human factors is emphasized throughout this book.
In addition to gathering empirically based information and applying it to the design of things, human factors specialists also gather empirical data to evalu¬ate the “goodness” of their designs and the designs of others. Thus, empirical data, and hence research, play a dual role in the development of systems: at the front end as a basis for the design and at the back end as a means of evaluating and improving the design. For this reason, in this chapter we deal with some basic concepts of human research as it relates to human factors. Our purpose is not to present a handbook of research methods, but rather to introduce some of the purposes, considerations, and trade-offs involved in the research process. For a more complete discussion of research methodologies relevant to human factors, refer to Meister (1985), Meister and Rabideau (1965), Wilson and Corlett (1990), Keppel (1982), and Rosenthal and Rosnow (1984).
As one would expect, most human factors research involves the use of human beings as subjects, so we focus our attention there. Not all human factors research, however, involves human subjects. Sanders and Krohn (1983), for example, surveyed underground mining equipment to assess the field of view from the operator’s compartment. Aside from the data collectors, no humans were involved. Human factors research can usually be classified into one of three types: descriptive studies, experimental research, or evaluation research. Actually, not all human factors research fits neatly into only one category; often a particular study will involve elements of more than one category. Although each category has different goals and may involve the use of slightly different methods, all involve the same basic set of decisions: choosing a research setting, selecting variables, choosing a sample of subjects, deciding how the data will be collected, and deciding how the data will be analyzed.
We describe each type of research briefly and then discuss the basic re¬search decisions listed above. This will give us an opportunity to say a few things we want to say and introduce some concepts that will be popping up now and again in later chapters.
Generally speaking, descriptive studies seek to characterize a population (usu¬ally of people) in terms of certain attributes. We present the results of many such studies throughout this book. Examples include surveys of the dimensions of people’s bodies, hearing loss among people of different ages, people’s expectations as to how a knob should be turned to increase the value on a display, and weights of boxes people are willing to lift.
Although descriptive studies are not very exciting, they are very important to the science of human factors. They represent the basic data upon which many design decisions are based. In addition, descriptive studies are often carried out to assess the magnitude and scope of a problem before solutions are suggested. A survey of operators to gather their opinions about design deficien¬cies and operational problems would be an example. In fact, the Nuclear Regulatory Commission (1981) required such a survey as part of its mandated human factors control room review process.
The purpose of experimental research is to test the effects of some variable on behavior. The decisions as to what variables to investigate and what behaviors to measure are usually based on either a practical situation which presents a design problem, or a theory that makes a prediction about variables and behaviors. Examples of the former include comparing how well people can edit manuscripts with partial-line, partial-page, and full-page computer displays (Neal and Darnell, 1984) and assessing the effect of seat belts and shoulder harnesses on functional arm reach (Garg, Bakken, and Saxena, 1982). Experi¬mental research of a more theoretical nature would include a study by Hull, Gill, and Roscoe (1982) in which they varied the lower half of the visual field to investigate why the moon looks so much larger when it is near the horizon than when it is overhead.
Usually in experimental research the concern is whether a variable has an effect on behavior and the direction of that effect. Although the level of performance is of interest, usually only the relative difference in performance between conditions is of concern. For example, one might say that subjects missed, on the average, 15 more signals under high noise than under low noise. In contrast, descriptive studies are usually interested in describing a population parameter, such as the mean, rather than assessing the effect of a variable. When descriptive studies compare groups that differ on some variable (such as sex or age), the means, standard deviations, and percentiles of each group are of prime interest. This difference in goals between experimental and descrip¬tive studies, as we will see, has implications for subject selection.
Evaluation research is similar to experimental research in that its purpose is to assess the effect of “something.” However, in evaluation research the some¬thing is usually a system or product. Evaluation research is also similar to descriptive research in that it seeks to describe the performance and behaviors of the people using the system or product.
Evaluation research is generally more global and comprehensive than ex¬perimental research. A system or product is evaluated by comparison with its goals; both intended consequences and unintended outcomes must be as¬sessed. Often an evaluation research study will include a benefit-cost analysis. Examples of evaluation research include evaluating a new training program, a new software package for word processing, or an ergonomically designed life jacket. Evaluation research is the area where human factors specialists assess the “goodness” of designs, theirs and others, and make recommendations for improvement based on the information collected.
Evaluation is part of the overall systems design process, and we discuss it as such in Chapter 22. Suffice it to say, evaluation research is probably one of the most challenging and frustrating types of research endeavors to undertake. The rewards can be great because the results are often used to improve the design of an actual system or product, but conducting the research can be a nightmare. Murphy’s law seems to rule: “If anything can go wrong, it will.” Extra attention, therefore, must be paid to designing data collection procedures and devices to perform in the often unpredictable and unaccommodating field setting. One of your authors recalls evaluating the use of a helicopter patrol for fighting crime against railroads. Murphy’s law prevailed; the trained observers never seemed to be available when they were needed, the railroad coordination chairman had to resign, the field radios did not work, the railroads were reluctant to report incidences of crime, and finally a gang of criminals threat¬ened to shoot down the helicopter if they saw it in the air.
CHOOSING A RESEARCH SETTING
In choosing the research setting for a study, the fundamental decision is whether to conduct the study in the field, also referred to as the “real world,” or in the laboratory.
With descriptive studies the choice of research setting is somewhat moot. The primary goal of descriptive studies is to generate data that describe a particular population of people, be they coal miners, computer operators, or the general civilian population. To poll such people, we must go to the real world. As we will see, however, the actual data collection may be done in a laboratory— often a mobile laboratory—which is, in essence, like bringing the mountain to Mohammad.
The choice of research setting for experimental research involves complex trade-offs. Research carried out in the field usually has the advantage of realism in terms of relevant task variables, environmental constraints, and subject characteristics including motivation. Thus, there is a better chance that the results obtained can be generalized to the real-world operational environment. The disadvantages, however, include cost (which can be prohibitive), safety hazards for subjects, and lack of experimental control. In field studies often there is no opportunity to replicate the experiment a sufficient number of times, many variables cannot be held constant, and often certain data cannot be collected because the process would be too disruptive.
The laboratory setting has the principal advantage of experimental control; extraneous variables can be controlled, the experiment can be replicated al¬most at will, and data collection can be made more precise. For this advantage, however, the research may sacrifice some realism and generalizability. Meister (1985) believes that this lack of realism makes laboratory research less than adequate as a source of applied human factors data. He believes that conclu¬sions generated from laboratory research should be tested in the real world before they are used there.
For theoretical studies, the laboratory is the natural setting because of the need to isolate the subtle effects of one or more variables. Such precision probably could not be achieved in the uncontrolled real world. The real world, on the other hand, is the natural setting for answering practical research questions. Often, a variable that shows an effect in the highly controlled laboratory “washes out” when it is compared to all the other variables that are affecting performance in the real world.
In some cases, field research can be carried out with a good deal of control— although some people might say of the same situation that it is laboratory research that is being carried out with a good deal of realism. An example is a study in which Marras and Kroemer (1980) compared two distress signal designs (for flares). In one part of the study, subjects were taken by boat to an island in a large lake. They were told that their task was to sit in an inflatable rubber raft, offshore, and rate the visibility of a display which would appear on shore. They were given a distress signal to use in case of an emergency. While the subjects were observing the display, the raft deflated automatically, prompting the subjects to activate the distress signal. The time required for unpacking and successfully operating the device was recorded. The subjects were then pulled back to shore.
In an attempt to combine the benefits of both laboratory and field research, researchers often use simulations of the real world in which to conduct re¬search. A distinction should be made between physical simulations and com¬puter simulations. Physical simulations are usually constructed of hardware and represent (i.e., look like, feel like, or act like) some system, procedure, or environment. Physical simulations can range from very simple items (such as a picture of a control panel) to extremely complex configurations (such as a moving-base jumbo jet flight simulator with elaborate out-of-cockpit visual display capabilities). Some simulators are small enough to fit on a desktop; others can be quite large, such as a 400-ft2 underground coal mine simulator built by one of your authors.
Computer simulation involves modeling a process or series of events in a computer. By changing the parameters the model can be run and predicted results can be obtained. For example, workforce needs, periods of overload, and equipment downtime can be predicted from computer simulations of work processes. To develop an accurate computer model requires a thorough under¬standing of the system being modeled and usually requires the modeler to make some simplifying assumptions about how the real-world system operates.
As with descriptive studies, choosing a research setting for evaluation research is also somewhat moot. For a true test of the “goodness” of a system or device, the test should be conducted under conditions representative of those under which the thing being tested will ultimately be used. A computerized map display for an automobile, as an example, should be tested in an automobile while driving over various types of roads and in different traffic conditions. The display may be very legible when viewed in a laboratory, but very hard to read in a moving automobile. Likewise, the controls may be easily activated when the device is sitting on a desk in a laboratory but may be very hard to operate while navigating a winding road or driving in heavy traffic.
The selection of variables to be measured in research studies is such a funda¬mental and important question that we have devoted two later sections of this chapter to it.
In descriptive studies two basic classes of variables are measured: criterion variables and stratification (or predictor) variables.
Criterion Variables Criterion variables describe those characteristics and behaviors of interest in the study. These variables can be grouped into the following classes according to the type of data being collected: physical charac¬teristics, such as arm reach, stomach girth, and body weight; performance data, such as reaction time, visual acuity, hand grip strength, and memory span; subjective data, such as preferences, opinions, and ratings; and phys¬iological indices, such as heart rate, body temperature, and pupil dilation.
Stratification Variables In some descriptive studies (such as surveys), it is the practice to select stratified samples that are proportionately representative of the population in terms of such characteristics as age, sex, education, etc. Even if a stratified sample is not used, however, information is often obtained on certain relevant personal characteristics of those in the sample. Thus, the resulting data can be analyzed in terms of the characteristics assessed, such as age, sex, etc. These characteristics are sometimes called predictors.
In experimental research, the experimenter manipulates one or more variables to assess their effects on behaviors that are measured, while other variables are controlled. The variables being manipulated by the experimenter are called independent variables (IVs). The behaviors being measured to assess the effects of the IVs are called dependent variables (DVs). The variables that are controlled are called extraneous, secondary, or relevant variables. These are variables that can influence the DV; they are controlled so that their effect is not confused (confounded) with the effect of the IV.
Independent Variables In human factors research, IVs usually can be clas¬sified into three types: (1) task-related variables, including equipment variables (such as length of control lever, size of boxes, and type of visual display) and procedural variables (such as work-rest cycles and instructions to stress ac¬curacy or speed); (2) environmental variables, such as variations in illumina¬tion, noise, and vibration; and (3) subject-related variables, such as sex, height, age, and experience.
Most studies do not include more than a few IVs. Simon (1976), for example, reviewed 141 experimental papers published in Human Factors from 1958 to 1972 and found that 60 percent of the experiments investigated the effects of only one or two IVs and less than 3 percent investigated five or more IVs.
Dependent Variables DVs are the same as the criterion variables discussed in reference to descriptive studies, except that physical characteristics are used less often. Most DVs in experimental research are performance, subjective, or physiological variables. We discuss criterion variables later in this chapter.
Selecting variables for evaluation research requires the researcher to translate the goals and objectives of the system or device being evaluated into specific criterion variables that can be measured. Criterion variables must also be included to assess unintended consequences arising from the use of the system. The criterion variables are essentially the same as those used in descriptive studies and experimental research and are discussed further in later sections of this chapter and in Chapter 22.
Choosing subjects is a matter of deciding who to select, how to select them, and how many of them to select.
Proper subject selection is critical to the validity of descriptive studies. Often, in such studies, the researcher expends more effort in developing a sampling plan and obtaining the subjects than in any other phase of the project.
Representative Sample The goal in descriptive studies is to collect data from a sample of people representative of the population of interest. A sample is said to be representative of a population if the sample contains all the relevant aspects of the population in the same proportion as found in the real population. For example, if in the population of coal miners 30 percent are under 21 years of age, 40 percent are between 21 and 40 years, and 30 percent are over 40 years of age, then the sample—to be representative—should also contain the same percentages of each age group.
The key word in the definition of representative is relevant. A sample may differ from a population with respect to an irrelevant variable and still be useful for descriptive purposes. For example, if we were to measure the reaction time of air traffic controllers to an auditory alarm, we would probably carry out the study in one or two cities rather than sampling controllers in every state or geographic region in the country. This is so because it is doubtful that the reaction time is different in different geographic regions; that is, geographic region is probably not a relevant variable in the study of reaction time. We would probably take great care, however, to include the proper proportions of controllers with respect to age and sex because these variables are likely to be relevant. A sample that is not representative is said to be biased.
Random Sampling To obtain a representative sample, the sample should be selected randomly from the population. Random selection occurs when each member of the population has an equal chance of being included in the sample. In the real world, it is almost impossible to obtain a truly random sample. Often the researcher must settle for those who are easily obtained even though they were not selected according to a strict random procedure.
Even though a particular study may not sample randomly or include all possible types of people in the proportions in which they exist in the popula¬tion, the study may still be useful if the bias in the sample is not relevant to the criterion measures of interest. How does one know whether the bias is rele¬vant? Prior research, experience, and theories form the basis for an educated guess.
Sample Size A key issue in carrying out a descriptive study is determining how many subjects will be used, i.e., the sample size. The larger the sample size, the more confidence one has in the results. Sampling costs money and takes time, so researchers do not want to collect more data than they need to make valid inferences about the population. Fortunately, there are formulas for determining the number of subjects required. [See, for example, Roebuck, Kroemer, and Thompson (1975).] Three main parameters influence the number of subjects required: degree of accuracy desired (the more accuracy desired, the larger the sample size required); variance in the population (the greater the degree of variability of the measure in the population, the larger the sample size needed to obtain the level of accuracy desired); and the statistic being esti¬mated, e.g., mean, 5th percentile, etc. (some statistics require more subjects to estimate accurately than others; for example, more subjects are required to estimate the median than to estimate the mean with the same degree of accuracy).
The issue in choosing subjects for experimental research is to select subjects representative of those people to whom the results will be generalized. Subjects do not have to be representative of the target population to the same degree as in descriptive studies. The question in experimental studies is whether the subjects will be affected by the IV in the same way as the target population. For example, consider an experiment to investigate the effects of room illumination (high versus low) on reading text displayed on a computer screen. If we wish to use the data to design office environments, do we need to use subjects who have extensive experience reading text from computer screens? Probably not. Although highly experienced computer screen readers can read faster than novice readers (hence it would be important to include them in a descriptive study), the effect of the IV (illumination) is likely to be the same for both groups. That is, it is probably easier to read computer-generated text under low room illumination (less glare) than under high—no matter how much experi¬ence the reader has.
Sample Size The issue in determining sample size is to collect enough data to reliably assess the effects of the IV with minimum cost in time and re¬sources. There are techniques available to determine sample size requirements; however, they are beyond the scope of this book. The interested reader can consult Cohen (1977) for more details.
In Simon’s (1976) review of research published in Human Factors, he found that 50 percent of the studies used fewer than 9 subjects per experimental condition, 25 percent used from 9 to 11, and 25 percent used more than 11. These values are much smaller than those typically used in descriptive studies. The danger of using too few subjects is that one will incorrectly conclude that an IV had no effect on a DV when, in fact, it did. The “danger,” if you can call it that, of using too many subjects is that a tiny effect of an IV, which may have no practical importance whatsoever, will show up in the analysis.
Choosing subjects for evaluation research involves the same considerations as discussed for descriptive studies and experimental research. The subjects must be representative of the ultimate user population. The number of subjects must be adequate to allow predictions to be made about how the user population will perform when the system or device is placed in use. Unfortunately, most evaluation research must make do with fewer subjects than we would like, and often those that are made available for the evaluation are not really represen¬tative of the user population.
COLLECTING THE DATA
Data in descriptive studies can be collected in the field or in a laboratory setting. Bobo et al. (1983), for example, measured energy expenditure of underground coal miners performing their work underground. Sanders (1981) measured the strength of truck and bus drivers turning a steering wheel in a mobile laboratory at various truck depots around the country.
Often, surveys and interviews are used to collect data. Survey question¬naires may be administered in the field or mailed to subjects. A major problem with mail surveys is that not everyone returns the questionnaires, and the possibility of bias increases. Return rates of less than 50 percent should proba¬bly be considered to have a high probability of bias.
The collection of data for experimental research is the same as for descrip¬tive studies. Because experimental studies are often carried out in a controlled laboratory, often more sophisticated, computer-based methods are employed. As pointed out by McFarling and Ellingstad (1977), such methods provide the potential for including more IVs and DVs in a study and permit greater preci¬sion and higher sampling rates in the collection of performance data. Vreuls et al. (1973), for example, generated over 800 different measures to evaluate the performance of helicopter pilots doing a few common maneuvers. One, of course, has to be careful not to drown in a sea of data.
Collecting data in an evaluation research study is often difficult. The equip¬ment being evaluated may not have the capability of monitoring or measuring user performance, and engineers are often reluctant to modify the equipment just to please the evaluation researcher. All to often, the principle method of data collection is observation of users by the researcher and interviewing users regarding problems they encountered and their opinions of the equipment.
ANALYZING THE DATA
Once a study has been carried out and the data have been gathered, the experimenter must analyze the data. It is not our intention here to deal exten¬sively with statistics or to discuss elaborate statistical methods.
When analyzing data from descriptive studies, usually fairly basic statistics are compared. Probably most readers are already familiar with most statistical methods and concepts touched on in later chapters, such as frequency distribu¬tions and measures of central tendency (mean, median, mode). For those readers unfamiliar with the concepts of standard deviation, correlation, and percentiles, we describe them briefly.
Standard Deviation The standard deviation (5) is a measure of the vari¬ability of a set of numbers around the mean. When, say, the reaction time of a group of subjects is measured, not everyone has the same reaction time. If the scores varied greatly from one another, the standard deviation would be large. If the scores were all close together, the standard deviation would be small. In a normal distribution (bell-shaped curve), approximately 68 percent of the cases will be within ± IS of the mean, 95 percent within ±2S of the mean, and 99 percent within ±35 of the mean.
Correlation A correlation coefficient is a measure of the degree of rela¬tionship between two variables. Typically, we compute a linear correlation coefficient (e.g., Pearson product-moment correlation r) which indicates the degree to which two variables are linearly related, that is, related in a straight-line fashion. Correlations can range from + 1.00, indicating a perfect positive relationship, through 0 (which is the absence of any relationship), to – 1.00, a perfect negative relationship. A positive relationship between two variables indicates that high values on one variable tend to be associated with high values on the other variable, and low values on one are associated with low values on the other. An example would be height and weight because tall people tend to be heavier than short people. A negative relationship between two variables indicates that high values on one variable are associated with low values on the other. An example would be age and strength because older people tend to have less strength than younger people.
If the correlation coefficient is squared (r2), it represents the proportion of variance (standard deviation squared) in one variable accounted for by the other variable. This is somewhat esoteric, but it is important for understanding the strength of the relationship between two variables. For example, if the correlation coefficient between age and strength is .30 we would say that of the total variance in strength 9 percent (.302) was accounted for by the subjects’ age. That means that 81 percent of the variance is due to factors other than age.
Percentiles Percentiles correspond to the value of a variable below which a specific percentage of the group fall. For example, the 5th percentile standing height for males is 63.6 in (162 cm). This means that only 5 percent of males are smaller than 63.6 in (162 cm). The 50th percentile male height is 68.3 in (173 cm), which is the same as the median since 50 percent of males are shorter than this value and 50 percent are taller. The 95th percentile is 72.8 in (185 cm), meaning that 95 percent of males are shorter than this height. Some investiga¬tors report the interquartile range. This is simply the range from the 25th to the 75th percentiles, and thus it encompasses the middle 50 percent of the distribu¬tion. (Interquartile range is really a measure of variability.) The concept of percentile is especially important in using anthropometric (body dimension) data for designing objects, workstations, and facilities.
Data from experimental studies are usually analyzed through some type of inferential statistical technique such as analysis of variance (ANOVA) or multivariate analysis of variance (MANOVA). We do not discuss these tech¬niques since several very good texts are available (Box, Hunter, and Hunter, 1978; Tabachnick and Fidell, 1989; Hays, 1988; Siegel and Castellan, 1988).
The outcome of virtually all such techniques is a statement concerning the statistical significance of the data. It is this concept that we discuss because it is central to all experimental research.
Statistical Significance Researchers make statements such as “the IV had a significant effect on the DV” or “the difference between the means was significant.” The term significance is short for statistical significance. To say that something is statistically significant simply means that there is a low probability that the observed effect, or difference between the means, was due to chance. And because it is unlikely that the effect was due to chance, it is concluded that the effect was due to the IV.
How unlikely must it be that chance was causing the effect to make us conclude chance was not responsible? By tradition, we say .05 or .01 is a low probability (by the way, this tradition started in agricultural research, earthy stuff). The experimenter selects one of these values, which is called the alpha level. Actually, any value can be selected, and you as a consumer of the research may disagree with the level chosen. Thus, if the .05 alpha level is selected, the researcher is saying that if the results obtained could have oc¬curred 5 times or less out of 100 by chance alone, then it is unlikely that they are due to chance. The researcher says that the results are significant at the .05 level. Usually the researcher will then conclude that the IV was causing the effect. If, on the other hand, the results could have occurred more than 5 times out of 100 by chance, then the researcher concludes that it is likely that chance was the cause, and not the IV.
The statistical analysis performed on the data yields the probability that the results could have occurred by chance. This is compared to the alpha level, and that determines whether the results are significant. Keep in mind that the statistical analysis does not actually tell the researcher whether the results did or did not occur by chance; only the probability is given. No one knows for sure.
Here are a couple of things to remember when you are reading about “significant results” or results that “failed to reach significance”: (1) results that are significant may still be due only to chance, although the probability of this is low. (2) IVs that are not significant may still be influencing the DV, and this can be very likely, especially when small sample sizes are used. (3) Statistical significance has nothing whatever to do with importance—very small, trivial effects can be statistically significant. (4) There is no way of knowing from the statistical analysis procedure whether the experimental de¬sign was faulty or uncontrolled variables confounded the results.
The upshot is that one study usually does not make a fact. Only when several studies, each using different methods and subjects, find the same effects should we be willing to say with confidence that an IV does affect a DV. Unfortunately, in most areas, we usually find conflicting results, and it takes insight and creativity to unravel the findings.
CRITERION MEASURES IN RESEARCH
Criterion measures, as we discussed, are the characteristics and behaviors measured in descriptive studies, the dependent variables in experimental re¬search, and the basis for judging the goodness of a design in an evaluation study. Any attempt to classify criterion measures inevitably leads to confusion and overlap. Since a little confusion is part and parcel of any technical book,
we relate one simple scheme for classifying criterion measures to provide a little organization to our discussion.
In the human factors domain, three types of criteria can be distinguished (Meister, 1985): those describing the functioning of the system, those describ¬ing how the task is performed, and those describing how the human responds. The problem is that measures of how tasks are performed usually involve how the human responds. We briefly describe each type of criteria, keeping in mind the inevitable overlap among them.
System-descriptive criteria usually reflect essentially engineering aspects of the entire system. So they are often included in evaluation research but are used to a much lesser extent in descriptive and experimental studies. System-descrip¬tive criteria include such aspects as equipment reliability (i.e., the probability that it will not break down), resistance to wear, cost of operation, maintain¬ability, and other engineering specifications, such as maximum rpm, weight, radio interference, etc.
Task Performance Criteria
Task performance criteria usually reflect the outcome of a task in which a person may or may not be involved. Such criteria include (1) quantity of output, for example, number of messages decoded, tons of earth moved, or number of shots fired; (2) quality of output, for example, accidents, number of errors, accuracy of drilling holes, or deviations from a desired path; and (3) performance time, for example, time to isolate a fault in an electric circuit or amount of delay in beginning a task. Task performance criteria are more global than human performance criteria, although human performance is inextricably intertwined in task performance and in the engineering characteristics of the system or equipment being used.
Human criteria deal with the behaviors and responses of humans during task performance. Human criteria are measured with performance measures, phys¬iological indices, and subjective responses.
Performance Measures Human performance measures are usually fre¬quency measures (e.g., number of targets detected, number of keystrokes made, or number of times the “help” screen was used), intensity measures (e.g., torque produced on a steering wheel), latency measures (e.g., reaction time or delay in switching from one activity to another), or duration measures (e.g., time to log on to a computer system or time on target in a tracking task). Sometimes combinations of these basic types are used, such as number of missed targets per unit time. Another human performance measure is reliabil¬ity, or the probability of errorless performance. There has been a good deal of work relating to human reliability, and we review some of it later in this chapter.
Physiological Indices Physiological indices are often used to measure strain in humans resulting from physical or mental work and from environmental influences, such as heat, vibration, noise, and acceleration. Physiological in¬dices can be classified by the major biological systems of the body: car¬diovascular (e.g., heart rate or blood pressure), respiratory (e.g., respiration rate or oxygen consumption), nervous (e.g., electric brain potentials or muscle activity), sensory (e.g., visual acuity, blink rate, or hearing acuity), and blood chemistry (e.g., catecholamines). Subjective Responses Often we must rely on subjects’ opinions, ratings, or judgments to measure criteria. Criteria such as comfort of a seat, ease of use of a computer system, or preferences for various lengths of tool handle are all examples of subjective measures. Subjective responses have also been used to measure perceived mental and physical workload. Extra care must be taken when subjective measures are designed because people have all sorts of built-in biases in the way they evaluate their likes, dislikes, and feelings. Subtle changes in the wording or order of questions, the format by which people make their response, or the instructions accompanying the measurement instrument can alter the responses of the subjects. Despite these shortcomings, subjective responses are a valuable data source and often represent the only reasonable method for measuring the criterion of interest.
Terminal versus Intermediate Criteria
Criterion measures may describe the terminal performance (the ultimate output of the action) or some intermediate performance that led up to the output. Generally, intermediate measures are more specific and detailed than terminal measures. Terminal measures are more valuable than intermediate ones be¬cause they describe the ultimate performance of interest. Intermediate mea¬sures, however, are useful for diagnosing and explaining performance inadequacies. For example, if we were interested in the effectiveness of a warning label that instructs people using a medicine to shake well (the bottle, that is) before using, the terminal performance would be whether the people shook the bottle. Intermediate criteria would include asking the people whether they recalled seeing the warning, whether they read the warning, and whether they could recall the warning message. The intermediate criterion data would be of value in explaining why people do not always follow warnings—is it because they do not see them, do not take the time to read them, or cannot recall them?
REQUIREMENTS FOR RESEARCH CRITERIA
Criterion measures used in research investigations generally should satisfy certain requirements. There are both practical and psychometric requirements. The psychometric requirements are those of reliability, validity, freedom from contamination, and sensitivity.
Meister (1985) lists six practical requirements for criterion measures, indicating that, when feasible, a criterion measure should (1) be objective, (2) be quan¬titative, (3) be unobtrusive, (4) be easy to collect, (5) require no special data collection techniques or instrumentation, and (6) cost as little as possible in terms of money and experimenter effort.
In the context of measurement, reliability refers to the consistency or stability of the measures of a variable over time or across representative samples. This is different from the concept of human reliability, as we see later. Technically, reliability in the measurement sense is the degree to which a set of measure¬ments is free from error (i.e., unsystematic or random influences).
Suppose that a human factors specialist in King Arthur’s court were com¬manded to assess the combat skills of the Knights of the Roundtable. To do this, the specialist might have each knight shoot a single arrow at a target and record the distance off target as the measure of combat skill. If all the knights were measured one day and again the next, quite likely the scores would be quite different on the two days. The best archer on the first day could be the worst on the second day. We would say that the measure was unreliable. Much, however, could be done to improve the reliability of the measure, including having each knight shoot 10 arrows each day and using the average distance off target as the measure, being sure all the arrows were straight and the feathers set properly, and performing the archery inside the castle to reduce the variability in wind, lighting, and other conditions that could change a knight’s performance from day to day.
Correlating the sets of scores from the two days would yield an estimate of the reliability of the measure. Generally speaking, test-retest reliability correla¬tions around .80 or above are considered satisfactory, although with some measures we have to be satisfied with lower levels.
Several types of validity are relevant to human factors research. Although each is different, they all have in common the determination of the extent to which different variables actually measure what was intended. The types of validity relevant to our discussion are face validity, content validity, and construct validity. Another type of validity which is more relevant to determining the usefulness of tests as a basis for selecting people for a job is criterion-related validity. This refers to the extent to which a test predicts performance. We do not discuss this type of validity further; rather we focus on the other types.
Face Validity Face validity refers to the extent to which a measure looks as though it measures what is intended. At first glance, this may not seem important; however, in some measurement situations (especially in evaluation research and experimental research carried out in the field), face validity can influence the motivation of the subjects participating in the research. Where possible, researchers should choose measures or construct tasks that appear relevant to the users. To test the legibility of a computer screen for office use, it might be better to use material that is likely to be associated with offices (such as invoices) rather than nonsense syllables.
Content Validity Content validity refers to the extent to which a measure of some variable samples a domain, such as a field of knowledge or a set of job behaviors. In the field of testing, for example, content validity is typically used to evaluate achievement tests. In the human factors field, this type of validity would apply to such circumstances as measuring the performance of air traffic controllers. To have content validity, such a measure would have to include the various facets of the controllers’ performance rather than just a single aspect.
Construct Validity Construct validity refers to the extent to which a mea¬sure is really tapping the underlying “construct” of interest (such as the basic type of behavior or ability in question). In our Knights of the Roundtable example, accuracy in shooting arrows at stationary targets would have only slight construct validity as a measure of actual combat skill. This is depicted in Figure 2-1, which shows the overlap between the construct (combat skill) and the measure (shooting accuracy with stationary targets). The small overlap denotes low construct validity because the measure taps only a few aspects of the construct. Construct validity is based on a judgmental assessment of an accumulation of empirical evidence regarding the measurement of the variable in question.
Freedom from Contamination
A criterion measure should not be influenced by variables that are extraneous to the construct being measured. This can be seen in Figure 2-1. In our example of the knights, wind conditions, illumination, and quality of the arrows could be sources of contamination because they could affect accuracy yet are unrelated to the concept being measured, namely combat skill.
A criterion measure should be measured in units that are commensurate with the anticipated differences one expects to find among subjects. To continue with our example of the knights, if the distance off target were measured to the nearest yard, it is possible that few, if any, differences between the knights’ performance would have been found. The scale (to the nearest yard) would have been too gross to detect the subtle differences in skill between the archers.
Another example would be using a 3-point rating scale (uncomfortable-neutral-comfortable) to measure various chairs, all of which are basically comfortable. People may be able to discriminate between comfortable chairs; that is, some are more comfortable than others. With the 3-point scale given, however, all chairs would be rated the same—comfortable. If a 7-point scale were used with various levels of comfort (a more sensitive scale), we would probably see differences between the ratings of the various chairs. An overly sensitive scale, however, can sometimes decrease reliability.
Human reliability is inextricably linked to human error, a topic we discuss further in Chapter 20. As Meister (1984) points out, human reliability has been used to refer to a methodology, a theoretical concept, and a measure. As a methodology, human reliability is a procedure for conducting a quantitative analysis to predict the likelihood of human error. As a theoretical concept, human reliability implies an explanation of how errors are produced. As a measure, human reliability is simply the probability of successful performance of a task or an element of a task by a human. As such, it is the same as systems reliability, discussed in Chapter 1. Human reliability, the measure, is expressed as a probability. For example, if the reliability of reading a particular display is .992, then out of 1000 readings we would expect 992 to be correct and 8 to be in error. It is worth noting that human reliability is not the same as the reliability we discussed as a requirement for criterion measures.
Interest in human reliability began in the 1950s when there was a desire to quantify the human element in a system in the same way that reliability engineers were quantifying the hardware element. Back then, human factors activities, especially in the aerospace industry, were often part of the reliabil¬ity, or quality assurance, divisions of the companies. Thus it is understandable that human factors folks would want to impress their hosts and contribute information to already existing analytical methods used by their cohorts.
In 1962, the first prototype human reliability data bank (now we call them data bases) was developed and called, appropriately, the Data Store. Work has continued in the area on a relatively small scale by a handful of people. The Three Mile Island nuclear power plant incident renewed interest in human reliability, and considerable work was done to develop data and methodologies for assessing human reliability in the nuclear industry.
Recently, a few good overviews of the field of human reliability have appeared (Meister, 1984; Dhillon, 1986; Miller and Swain, 1987) which can be referred to for a more detailed discussion of the topic.
Data Bases and Methodologies
Over the years, there have been several attempts to develop data bases of human reliability information. Topmiller, Eckel, and Kozinsky (1982) reviewed nine such attempts. Although these data bases represent formidable efforts, their usefulness is rather limited. For the most part, the data bases are small and the tasks for which data exist tend to be manual in nature, such as operation of a control or repair of a component. But as Rasmussen (1985) points out, modern technology has removed many repetitive manual tasks and has given humans more supervisory and trouble-shooting tasks to perform. Human errors are now more related to decision making and problem solving than in the past. The human reliability data bases currently available simply do not deal adequately with such situations.
In addition to human reliability data bases, there are also methodologies for determining human reliability. Probably the most well-developed methodology is the technique for human error rate prediction (THERP). THERP has been developed in recent years mainly to assist in determining human reliability in nuclear power plants (Swain and Guttmann, 1983). THERP is a detailed pro¬cedure for analyzing a complex operation and breaking it down into smaller tasks and steps. Diagrams are used to depict the relationships between tasks and steps leading to successful and unsuccessful performance. Probabilities of successful performance, derived from empirical data and expert judgment, are assigned to the tasks and steps and are combined using rules of logic to determkje the human reliability for performing the complex operation being analyzed.
A quite different approach for determining human reliability is used by stochastic simulation models (Siegel and Wolf, 1969). In essence, a computer is used to simulate the performance of a task or job. The computer performs the job over and over, sampling from probabilistic (i.e., stochastic) distributions of task element success and failure, and computes the number of times the entire task was successfully completed. Of course, the models and procedures are far more complex than depicted here. The advantage of using computer simula¬tions is that one can vary aspects of the task and assess the effects on reliabil¬ity. For example, Siegel, Leahy, and Wiesen (1977) in a study of a sonar system found that reliability for the simulated operators varied from .55 to .94 as time permitted the operators to do the task rose from 23 to 25 min. Here is an example in which a couple of minutes one way or the other could have profound effects on system performance. The results obtained from using such models, of course, are only as good as the models themselves and the data upon which the models are based.
Criticisms of Human Reliability
Despite all the effort at developing human reliability data banks and assessment methodologies, the majority of human factors specialists appear quite unin¬terested in the subject. Meister (1985) believes some of the problem is that the association of reliability with engineering is distasteful to some behavioral scientists and that some people feel behavior is so variable from day to day that it is ludicrous to assign a single point estimate of reliability to it.
Adams (1982) raises several practical problems with the concept of human reliability. Among them is that not all errors result in failure, and presently this problem is not handled well by the various techniques. Regulinski (1971) believes that point estimates of human reliability are inappropriate for continu¬ous tasks, such as monitoring or tracking, and that tasks of a discrete nature may not be amenable to classical engineering reliability modeling. Williams (1985) points out that there has been little or no attempt made by human reliability experts to validate their various methodologies. Meister (1985), however, contends that although our estimates may not be totally accurate, they are good approximations and hence have utility.
Some scientists object to the illusion of precision implicit in reliability estimates down to the fourth decimal (for example, .9993) when the data upon which they were based come nowhere near such levels of precision. There is also concern about the subjective components that are part of all human reliability techniques. Most of the reliability data used were originally devel¬oped by expert judgment—subjectively. In addition, the analyst must interject personal judgment into the reliability estimate to correct it for a particular application. And, of course, the biggest criticism is that the data are too scanty and do not cover all the particular applications we would like to see covered. Meister (1985), however, believes we cannot hibernate until some hypothetical time when there will be enough data—and as any human factors person will tell you, there are never enough data.
In this chapter we stressed the empirical data base of human factors. Research of all types plays a central role in human factors, and without it human factors would hardly be considered a science. A major continuing concern is the measurement of criteria to evaluate systems and the effects of independent variables. We outlined the major types of criteria used in human factors, with special reference to human reliability, and some of the requirements we would like to see in our measurement procedures. For some people, research meth¬odology and statistics are not very exciting subjects. If these are not your favorite topics, think of them as a fine wine; reading about it cannot compare to trying the real thing.
Terima kasih sudah membaca tulisan “Materi Lengkap Ergonomi Teknik Industri, Lengkap dengan Contohnya”, semoga bermanfaat
Sumber: Catatan Kuliah di UMG