Does the SmartItem Work?

CASE STUDY EVIDENCE

Early Adopters and The SmartItem Impact

Background

SmartItem™ technology solves several critical validity issues that plague high-stakes testing programs today, including testwiseness, effective cheating, and the theft of exam content. This paper has been curated to further prove this claim with evidence in the form of case studies. It answers many frequently asked questions about the SmartItem in practice. These questions include:

  • How do SmartItems perform?
  • What are their item statistics?
  • Do they contribute to the reliability of a test?
  • Do they add to the validity of a test? (That is, do they improve the ability to make valid inferences from test scores?)
  • Have they been tried out on students or certification candidates?
  • Have they been tried out on actual test takers in high-stakes exams?
  • What evidence do you have for the claims?

This paper will answer these questions through evidence-based research.

To be “evidence-based” is to provide evidence that indicates what you are claiming actually works. There is no set standard on how much evidence is enough, and for new innovations, not as much evidence is expected. Instead, the evidence must be compelling and interesting enough to encourage a further look.

SmartItem technology is still new. While some research has been done, more will be needed. Utilizing SmartItems solves important problems but also raise many new questions about the way “things have always been done.” Academicians, students, and practitioners from all areas of testing must join the effort to learn more about this innovation. New ideas and new evidence are always welcomed.

This paper contains descriptions of two case studies. Each covers a different certification exam that was developed and published by the information technology company, SailPoint.

It would be helpful at this point to describe the difference between a case study and a scientific experiment. A case study is a set of conditions that are observed rather than controlled. As a result, a case study cannot determine cause and effect or rule out alternative explanations for what is observed. On the other hand, an experiment uses procedures to control for extraneous influences so the actual cause and effect can be seen and confidently concluded (at least within a small degree of acceptable statistical error).

While experiments are generally preferred, they are not always possible. One benefit of case studies is they allow us to study operational exams in a real-world setting. This is what we will see in the two case studies provided in this paper.

The Case Studies

The two case studies in this paper describe the development and implementation of certification exams for the information technology company, SailPoint. SailPoint specializes in identity governance security, and because of that, is a natural advocate of test security. As such, it wanted to avoid the test security problems plaguing other IT certification programs. Specifically, SailPoint wished to prevent the rampant theft and online publication of test questions that often occur within days or weeks of a certification exam publication. An entire “braindump” industry has been spawned and supported by this type of easy, low-risk test fraud. As a result, many unqualified examinees are able to cheat on IT certification exams using information they found on the Internet.

Hoping to avoid this scenario, SailPoint welcomed the opportunity to try out SmartItem technology. The first exam (Case Study #1) was published in June of 2018 and is called the “IdentityIQ Engineer Exam.” The second exam (Case Study #2) was published in December of 2018 and is called the “IdentityIQ Architect Exam.” Both certification exams used SmartItem technology exclusively.

SmartItem case study evidence

Case Study #1

SailPoint IdentityIQ Engineer Exam

YOU-CAN-SKIP-THE-NEXT-PART SUMMARY

SailPoint used SmartItem technology exclusively for their two certification exams published in 2018. They were attracted to SmartItem technology primarily for the security and cost-saving benefits. From the outset, they learned that creating an exam with SmartItems was more of a design effort than a writing one. To enable this effort, item writers were divided into small teams, including a coder for each team. 

During the item development process, they created a significant number of SmartItems. Code, along with other pieces of item content, were reviewed constantly and fixed when problems were discovered. Field tests of the items for the two exams went well, and a cutscore was set based on the collected data. Statistical analyses for the exams revealed the SmartItems to be high-quality items that ranged in difficulty and other properties. The large majority met the usual psychometric quality control criteria required to serve on the exams. 

Since the tests went operational, the SmartItem technology has performed well, with the typical variability in item quality for certification exams made up of traditional items. Test-level reliability and validity evidence provided support for the use of SmartItems to produce test scores for high-stakes certification decisions.

Introduction

SailPoint began its certification program in 2018. The mission of the program was similar to that of every other technology-based certification program: to qualify candidates that have sufficient skills in order to support the company’s technology. Instead of creating exams using traditional methods—with static items and multiple forms—the SailPoint program leadership decided to utilize SmartItem technology. This decision was primarily made for security reasons, but also for cost-saving and other benefits as well. The goal for the exam development process was to create a single form entirely of SmartItems, with one SmartItem per identified skill. The skills, referred to as “objectives” by the SailPoint certification team, were determined using standard job analysis methodology.

Originally, Sailpoint planned on creating 170 traditional multiple-choice (MC) items to support two forms of the test in a single test development workshop. Once the decision was made to use SmartItem technology rather than traditional items, the structure of the workshop changed. Instead of 170 items, it was instead decided to create one, perhaps two, SmartItems per objective (depending on the complexity of the objective). Since multiple forms are unnecessary with SmartItems,¹ it was also decided to create only a single test form with a target of around 65 SmartItems. If an objective needed to be weighted more heavily on the test, the SmartItem congruent with that objective could simply be presented on the form more than once.

Methodology
JOB ANALYSIS AND BLUEPRINTING

A job analysis was conducted by interviewing SailPoint subject matter experts. From these interviews, a list of objectives was created and then weighted for their perceived importance. There were 62 total objectives.

ITEM DEVELOPMENT

A total of 64 SmartItems were created and vetted for the objectives for the IdentityIQ Engineer exam.² The SmartItems used a mixture of item formats. Figure 1 shows the breakdown of item types used as SmartItems. To avoid the validity issues associated with testwiseness, no traditional MC items were used.

Each SmartItem was developed by a team of two or three SMEs, plus an individual with coding skills. (At the time, Caveon’s GUI-based tool was still in development, so coders were necessary to develop SmartItems.) The coder provided input to the SMEs on whether the design of their item was both functional and easy to code, thereby maximizing efficiency. The coder then implemented the SMEs’ design in Scorpion, Caveon’s exam development and delivery platform.

Item Types as SmartItems for Case Study #1

DOMC™ (SuperDOMC)

27

DOMC (Coded)³

27

Build List (Coded)

7

Matching (Coded)

2

Short Answer (Coded)

1

TOTAL:

64

Figure 1
The breakdown of item types used as SmartItems for Case Study #1.

Once a draft of the SmartItem was complete, the SmartItem was “previewed”⁴ a sufficient number of times to determine whether the item was functioning as designed or needed revisions. This review process differed from what is typically done for traditional items because the SmartItems appear differently each time they are rendered. This process was more similar to a software quality assurance check than to a conventional item review. SmartItems that failed to pass this quality check were discarded or revised and then previewed again. Those that passed this quality step were compiled together for a field test.

FIELD TEST

The field test, often referred to as a “beta test” or an item “pre-test,” was conducted to obtain an empirical measure of SmartItem quality. Were the SmartItems too easy or too difficult? Did they discriminate well among test takers? 

Several dozen individuals were recruited to take all 64 SmartItems that made up the field test exam. The participants were recruited to span a range of ability in order to:

  • Obtain stable estimates of item performance metrics, and
  • Help determine the pass/fail standard for the final exam.

Based on field test results, 49 out of the 64 SmartItems were deemed to be of sufficient psychometric quality to serve on the actual certification exam. The issues with the remaining SmartItems were typical of those found in all test development projects. Some SmartItems were too difficult or too easy. A few had low correlations with the total test score.

All of the remaining 15 items were carefully reviewed and revised and then included in the certification exam as unscored questions to collect data on the new versions. It was hoped that these repaired SmartItems might eventually perform well enough to be included on the scored portion of
the exam.

Validity Coefficient for Case Study #1

Which choice best describes your experience in the SailPoint IdentityIQ Engineer Role?

  • Minimal
  • Competent
  • Expert

r = 0.304
df = 248
prob. = .00004

Figure 2
The survey question and its validity coefficient.

VALIDITY EVIDENCE

The field test also included survey questions. One question asked the candidates to indicate their proficiency in the SailPoint skills covered by the exam. Their responses were then correlated with their test scores, providing empirical evidence of validity. Figure 2 shows the survey question and its validity coefficient.

RELIABILITY EVIDENCE

The Cronbach’s Alpha reliability coefficient, calculated using only the data from the 49 scored SmartItems, was α = .75.

SETTING THE CUTSCORE

The Contrasting Groups method (Zieky, 2001) was used to set the cutscore for this exam. The field test provided a way to create a cutscore. Field test participants were pre-sorted, based on judgments by the SailPoint management team and various other supervisors regarding the capability of each participant. Based on these judgments, the field test participants were divided into three groups:

  • Those who are expected to pass the exam
  • Those where it is unclear if they should pass or not, and
  • Those who should not pass the exam

After the SmartItems were evaluated and the final 49 SmartItems selected, the performance of the individuals was re-scored based on those 49 items. The score distributions for the three groups were plotted and compared. A cutscore that minimized classification errors was set.

Field test participants were told in advance that their score, based on qualified SmartItems, would be evaluated against an empirically derived cutscore following the field test. It was assumed the motivation these individuals would feel to obtain the certification would produce test-taking behavior similar to candidates who would be taking the operational exam in the future.

Performance of the Operational Exam
SMARTITEM DISCLOSURES

One year after the exam had been released and made available to candidates, there still had not been any public disclosures of exam content on the Internet. This was confirmed by extensive web patrolling efforts throughout the life of the exam. This result is very unusual for IT-based certification programs, as test content is usually stolen and disclosed within days.

PSYCHOMETRIC PERFORMANCE

With over 375 tests administered as of the authoring of this case study, the SailPoint IdentityIQ Engineer Exam continues to perform well. Similar to the field test results, the items show a typical large range of variability in item difficulty and item discrimination.

VALIDITY EVIDENCE

Like the field test, the operational test also included survey questions. One of those questions asked the candidates to indicate their proficiency in the SailPoint skills covered by the exam. Their responses were then correlated with their test scores, providing initial empirical evidence of validity. Figure 3 shows the survey question and its validity coefficient.

RELIABILITY EVIDENCE

The Cronbach’s Alpha reliability coefficient, calculated using the data from the 55 scored SmartItems used on the operational exam, was α = .74.

Validity Coefficient for Case Study #1

Which choice best describes your experience in the SailPoint IdentityIQ Engineer Role?

  • Minimal
  • Competent
  • Expert

r = 0.24

Figure 3
The survey question and its validity coefficient for Case Study #1.

Case Study #2

SailPoint IdentityIQ Architect Exam

YOU-CAN-SKIP-THE-NEXT-PART SUMMARY

Case Study #2 describes the second exam published by SailPoint, this one titled the “Identity IQ Architect Exam.” The experience, including the positive outcomes, was similar to the first exam except for one main difference: For this exam, we changed the SmartItem review process to improve our ability to evaluate each SmartItem.

A new “mapping” phase was added to the development process. In this phase, the SMEs created the item variables and values⁵ for a SmartItem and indicated the relationships between them. This helped later reviewers understand how the SmartItem functioned in order to evaluate if it is functioning properly.

Introduction

SailPoint began the planning and development process for its second certification exam—the SailPoint IdentityIQ Architect Exam—a bit differently than it had for the first exam. From the outset of the planning process, the goal was to create an exam completely of SmartItems. Given the experience from the initial exam, a small change was made in the SmartItem writing process.

The development team leaders further refined a step that takes place between an SME designing a SmartItem and the actual coding of the SmartItem. The refinement of this step involved the item SME designers/writers creating stronger “road maps” of the SmartItem. For each SmartItem, they would create a roadmap that included the stem, all variables, all option sets, and any extra guidance for the coder on the relationships between these components. Additionally, this “road map” was archived in the development software so anyone at any stage of the exam development process could view the original intentions of the SMEs for each item.

With one SmartItem exam written, the development team was able to provide more examples of SmartItems to help the SME designers and coders envision and create new items. This made it easier for the SMEs to learn the process and think creatively about designing items to cover a specific objective.

Methodology
JOB ANALYSIS AND BLUEPRINTING

A job analysis was conducted in a manner similar to Case Study #1. A total of 67 objectives were identified.

ITEM DEVELOPMENT

Based on the objectives of the IdentityIQ Architect Exam, 67 SmartItems were eventually created. The SmartItems used a mixture of item formats. Figure 4 shows the breakdown of item types used as SmartItems. As with the original exam, no multiple-choice (MC) items were used. Each SmartItem was developed by a team of two or three SMEs who designed and crafted the SmartItem. A coder was assigned to each team to create the items in Scorpion, Caveon’s exam development and delivery platform. The writing process was enhanced with a “mapping” phase where SMEs documented their variables and logic behind each SmartItem. This additional step helped subsequent reviewers understand the intended function of the items.

Item Types as SmartItems for Case Study #2

DOMC™ (SuperDOMC)

35

DOMC (Coded)

28

Build List (Coded)

3

Matching (Coded)

1

TOTAL:

64

Figure 4
The breakdown of item types used as SmartItems for Case Study #2.

FIELD TEST

As with the first case study, a field test of the 67 items was conducted. Participants were recruited whose abilities spanned a range of competency on the content comprising the IdentityIQ Architect Exam. As before, the field test not only evaluated item performance empirically, but also provided the data to set the cutscore. Based on field test results, 61 of the 67 total SmartItems were deemed to be of sufficient psychometric quality to use on the actual certification exam. The remaining six SmartItems were revised and then included on the certification exam as unscored items in order to collect new data.

Validity Coefficients for Case Study #2

Have you previously taken any IdentityIQ Implementer courses offered by SailPoint?

  • Yes
  • No

r = 0.49

Which choice best describes your experience in the SailPoint IdentityIQ Architect role?

  • Minimal
  • Competent
  • Expert

r = 0.51

Figure 5
The survey questions and validity coefficients for Case Study #2.

VALIDITY EVIDENCE

The operational test also included survey items. Two of those survey items asked the candidates questions related to their proficiency in the content of the exam. Their responses were then correlated with their test scores, providing empirical evidence of validity. Figure 5 shows the survey questions and validity coefficients.

RELIABILITY EVIDENCE

Cronbach’s Alpha reliability coefficient, calculated using only the data from the 61 scored SmartItems, was α = .82.

SETTING THE CUTSCORE

As with Case Study #1, the Contrasting Groups method was used to set the cutscore for this exam. Field test participants were pre-sorted based on the SailPoint management team’s experience with participants. They were then divided into two groups: 

  • Those who are expected to pass the exam, and
  • Those who should not pass the exam.

After the SmartItems were evaluated and the 61 final SmartItems were selected, the performance of the individuals was re-scored based on those 61 items. The score distributions for the two groups were plotted and compared. A cutscore was set that minimized classification errors.

Case study evidence
Methodology

What did we learn from the two case studies? What are some of the initial insights from using SmartItem technology on actual high-stakes information technology certification exams? Here are some initial conclusions:

  • Using SmartItem technology reduces initial item development costs.⁶ It is also likely they will reduce long-term maintenance costs because SmartItems do not need to be replaced.
  • With the addition of a unique step or two,⁷ SmartItems can be created easily in a typical item development workshop environment. Developing SmartItems takes no more time than a traditional workshop.
  • Multiple-choice (MC) items, either as traditional MC items or as MC-based SmartItems, are generally unnecessary. DOMC is a preferred substitute because it performs as well or better statistically and removes the testwiseness advantages for some test takers.
  • Not all SmartItems require coding. The SuperDOMC format⁸ was sufficient and appropriate for almost half of the skills on the SailPoint exams.
  • SmartItems, when created, can be effectively reviewed for accuracy, bias, etc.
  • SmartItems can use several varieties of item formats: DOMC, SuperDOMC, Build List, Matching, and Short Answer.
  1. DOMC: Some DOMC items employed code to display the item content.
  2. SuperDOMC: No code was used. Instead, a large number of options were produced.
  3. Build List: Here, examinees selected from a large list and dragged elements from that list to create another. Then examinees re-arranged the list according to the instructions in the item.
  4. Matching: This Matching format required the examinees to match two lists using drop-down menus.
  5. Short Answer: Examinees were required to type their answer in the response box provided using their keyboard.
  • SmartItems performed well statistically in both the field tests and the operational tests.
  • SmartItems, like traditional items, may be constructed poorly. Weak items are determined through field testing and statistical analysis.
  • A test completely comprised of SmartItem technology can produce strong evidence of reliability and validity.
  • SmartItems can be used as part of the process to set cutscores using the Contrasting Groups method.

Conclusion

It is clear from these two case studies that utilizing SmartItem technology is a viable alternative to using fixed-content items. By replacing the traditional items with SmartItems, security is enhanced and test development costs are lowered. In addition, the influence of testwiseness, benefiting some test takers to the detriment of others, is virtually eliminated when DOMC is part of the SmartItem design. The test takers adapted easily to the new DOMC format and to the variable nature of SmartItem renderings.

For SailPoint, the experience from the first two exams has encouraged the program to continue to use the innovative methodology for their third exam, published in the fall of 2019.

With SailPoint’s support, SmartItem technology is no longer just an innovative idea. With these case studies, SmartItems went from a promising innovation to actual use on high-stakes tests.

Footnotes:

  1. The purpose of multiple forms is to provide candidates with a different form when re-testing takes place, as it isn’t advisable to provide the same form a second time. In addition, a second form would make it more difficult to cheat even if incomplete pre-knowledge were obtained from a braindump site. However, these security measures are unnecessary when SmartItems are used, as each test taker experiences a form different from everyone else’s, or if he or she took the exam a second, third, fourth, etc., time.
  2. When objectives describe complex skills, it may be reasonable to create more than one SmartItem for the objective. The decision to do this a couple of times resulted in the number of SmartItems (n=64) being higher than the number of objectives (n=62).
  3. The term “coded” used here indicates programming code was used in the development of some SmartItems. Other SmartItems, which are always selected-response items, do not use code, but instead use a much larger number of options.
  4. Previewing an item means the SmartItem was “run” in the same way it would appear and function on an exam for an actual test taker. The SmartItem reviewers would be able to answer and score the SmartItem as part of this step. By doing this a number of times with no apparent issues, the reviewers can be confident that the SmartItem would perform similarly on the operational exam.
  5. An example would probably be helpful here. A variable could be “mammals” and the values would be a list of specific mammals, such as bear, dog, llama, etc. The list of values could number in the hundreds, thousands, or more.
  6. Given that this case study describes one of the initial pioneering efforts, it isn’t clear how much savings will be gained by additional experience in designing and producing SmartItems.
  7. SmartItem development has unique steps, including a more complex design phase, along with innovative item review steps.
  8. SuperDOMC items are items where the number of options, both correct and incorrect, exceeds the typical number or three to five. SuperDOMC items commonly use dozens or hundreds of options.

References:

Zieky, M. J. (2001). So much has changed: How the setting of cutscores has evolved since the 1980s. In G. J. Cizek (Ed.), Setting performance standards: concepts, methods, and perspectives (pp. 19–51). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.

Curious If Scorpion Is The Right Platform For You?

Tell us a little about your organization’s needs and request your free demo today!