*Bruce Thompson, *

Texas A & M
Too_{ }few_{ }researchers_{ }understand_{ }what_{ }statistical_{ }significance_{ }testing_{ }does_{ }and_{ }doesn't_{ }do,_{ }and_{ }consequently_{ }their_{ }results_{ }are_{ }misinterpreted._{ }Even_{ }more_{ }commonly,_{ }researchers_{ }understand_{ }elements_{ }of_{ }statistical_{ }significance_{ }testing,_{ }but_{ }the_{ }concept_{ }is_{ }not_{ }integrated_{ }into_{ }their_{ }research._{ }For_{ }example,_{ }the_{ }influence_{ }of_{ }sample_{ }size_{ }on_{ }statistical_{ }significance_{ }may_{ }be_{ }acknowledged_{ }by_{ }a_{ }researcher,_{ }but_{ }this_{ }insight_{ }is_{ }not_{ }conveyed_{ }when_{ }interpreting_{ }results_{ }in_{ }a_{ }study_{ }with_{ }several_{ }thousand_{ }subjects.

This_{ }article_{ }will_{ }help_{ }you_{ }better_{ }understand_{ }the_{ }concept_{ }of_{ }significance_{ }testing._{ }The_{ }meaning_{ }of_{ }probabilities,_{ }the_{ }concept_{ }of_{ }statistical_{ }significance,_{ }arguments_{ }against_{ }significance_{ }testing,_{ }misinterpretation,_{ }and_{ }alternatives_{ }are_{ }discussed.

**WHAT**_{ }ARE_{ }THOSE_{ }PROBABILITIES_{ }IN_{ }STATISTICAL_{ }SIGNIFICANCE_{ }TESTING?

Researchers_{ }may_{ }invoke_{ }statistical_{ }significance_{ }testing_{ }whenever_{ }they_{ }have_{ }a_{ }random_{ }sample_{ }from_{ }a_{ }population,_{ }or_{ }a_{ }sample_{ }that_{ }they_{ }believe_{ }approximates_{ }a_{ }random,_{ }representative_{ }sample._{ }Statistical_{ }significance_{ }testing_{ }requires_{ }subjective_{ }judgment_{ }in_{ }setting_{ }a_{ }predetermined_{ }acceptable_{ }probability_{ }(ranging_{ }between_{ }0_{ }and_{ }1.0)_{ }of_{ }making_{ }an_{ }inferential_{ }error_{ }caused_{ }by_{ }the_{ }sampling_{ }error--getting_{ }samples_{ }with_{ }varying_{ }amounts_{ }of_{ }"flukiness"--inherent_{ }in_{ }sampling._{ }Sampling_{ }error_{ }can_{ }only_{ }be_{ }eliminated_{ }by_{ }gathering_{ }data_{ }from_{ }the_{ }entire_{ }population.

One_{ }probability_{ }(p),_{ }the_{ }probability_{ }of_{ }deciding_{ }to_{ }reject_{ }a_{ }null_{ }hypothesis_{ }(e.g.,_{ }a_{ }hypothesis_{ }specifying_{ }that
Mean_{1}_{ }=_{ }Mean_{2 }_{ }=_{ }Mean_{3},_{ }or_{ }R^{2} _{ }=_{ }0)_{ }when_{ }the_{ }null_{ }hypothesis_{ }is_{ }actually_{ }true_{ }in_{ }the_{ }population,_{ }is_{ }called_{ }"alpha,"_{ }and_{ }also
p_{(CRITICAL)}._{ }When_{ }we_{ }pick_{ }an_{ }alpha_{ }level,_{ }we_{ }set_{ }an_{ }upper_{ }limit_{ }on_{ }the_{ }probability_{ }of_{ }making_{ }this_{ }erroneous_{ }decision,_{ }called_{ }a_{ }Type_{ }I_{ }error._{ }Therefore,_{ }alpha_{ }is_{ }typically_{ }set_{ }small,_{ }so_{ }that_{ }the_{ }probability_{ }of_{ }this_{ }error_{ }will_{ }be_{ }low._{ }Thus,
p_{(CRITICAL)}_{ }is_{ }selected_{ }based_{ }on_{ }subjective_{ }judgment_{ }regarding_{ }what_{ }the_{ }consequences_{ }of_{ }Type_{ }I_{ }error_{ }would_{ }be_{ }in_{ }a_{ }given_{ }research_{ }situation,_{ }and_{ }given_{ }personal_{ }values_{ }regarding_{ }these_{ }consequences.

A_{ }second_{ }probability,_{ }p_{(CALCULATED)}_{ }(which,_{ }like_{ }all_{ }p's,_{ }ranges_{ }between_{ }.0_{ }and_{ }1.0),_{ }is_{ }calculated._{ }Probabilities_{ }can_{ }only_{ }be_{ }calculated_{ }in_{ }the_{ }context_{ }of_{ }assumptions_{ }sufficient_{ }to_{ }constrain_{ }the_{ }computations_{ }such_{ }that_{ }a_{ }given_{ }problem_{ }has_{ }only_{ }one_{ }answer.

What's_{ }the_{ }probability_{ }of_{ }getting_{ }mean_{ }IQ_{ }scores_{ }of_{ }99_{ }and_{ }101_{ }in_{ }two_{ }sample_{ }groups?_{ }_{ }It_{ }depends,_{ }first,_{ }on_{ }the_{ }actual_{ }statistical_{ }parameters_{ }(e.g.,_{ }means)_{ }in_{ }the_{ }populations_{ }from_{ }which_{ }the_{ }samples_{ }were_{ }drawn._{ }These_{ }two_{ }sample_{ }statistics
(Mean_{1}_{ }=_{ }99_{ }and_{ }Mean_{2}_{ }=_{ }101)_{ }would_{ }be_{ }most_{ }probable_{ }(yielding_{ }the_{ }highest
p_{(CALCULATED)}_{ }if_{ }the_{ }population_{ }means_{ }were_{ }respectively_{ }99_{ }and_{ }101._{ }These_{ }two_{ }sample_{ }statistics_{ }would_{ }be_{ }less_{ }likely_{ }(yielding_{ }a_{ }smaller
p_{(CALCULATED)}_{ }if_{ }the_{ }population_{ }means_{ }were_{ }both_{ }100._{ }Since_{ }the_{ }actual_{ }population_{ }parameters_{ }are_{ }not_{ }known,_{ }we_{ }must_{ }assume_{ }what_{ }the_{ }parameters_{ }are,_{ }and_{ }in_{ }statistical_{ }significance_{ }testing_{ }we_{ }assume_{ }the_{ }parameters_{ }to_{ }be_{ }correctly_{ }specified_{ }by_{ }the_{ }null_{ }hypothesis,_{ }i.e.,_{ }we_{ }assume_{ }the_{ }null_{ }hypothesis_{ }to_{ }be_{ }exactly_{ }true_{ }for_{ }these_{ }calculations.

A_{ }second_{ }factor_{ }that_{ }influences_{ }the_{ }calculation_{ }of_{ }p_{ }involves_{ }the_{ }sample_{ }sizes._{ }Samples_{ }(and_{ }thus_{ }the_{ }statistics_{ }calculated_{ }for_{ }them)_{ }will_{ }potentially_{ }be_{ }less_{ }representative_{ }of_{ }populations_{ }("flukier")_{ }as_{ }sample_{ }sizes_{ }are_{ }smaller._{ }For_{ }example,_{ }drawing_{ }two_{ }samples_{ }of_{ }sizes_{ }5_{ }and_{ }5_{ }may_{ }yield_{ }"flukier"_{ }statistics_{ }(means,_{ }r's,_{ }etc.)_{ }than_{ }two_{ }samples_{ }of_{ }sizes_{ }50_{ }and_{ }50._{ }Thus,_{ }the
p_{(CALCULATED)}_{ }computations_{ }also_{ }must_{ }(and_{ }do)_{ }take_{ }sample_{ }size_{ }influences_{ }into_{ }account._{ }If_{ }the_{ }two_{ }samples_{ }both_{ }of_{ }size_{ }5_{ }had_{ }means_{ }of_{ }100_{ }and_{ }90,_{ }and_{ }the_{ }two_{ }samples_{ }both_{ }of_{ }size_{ }50_{ }also_{ }had_{ }means_{ }of_{ }100_{ }and_{ }90,_{ }the_{ }test_{ }of_{ }the_{ }null_{ }that_{ }the_{ }means_{ }are_{ }equal_{ }would_{ }yield_{ }a_{ }smaller
p_{(CALCULATED)}_{ }for_{ }the_{ }larger_{ }samples,_{ }because_{ }assuming_{ }the_{ }null_{ }is_{ }exactly_{ }true,_{ }unequal_{ }sample_{ }statistics_{ }are_{ }increasingly_{ }less_{ }likely_{ }as_{ }sample_{ }sizes_{ }increase._{ }Summarizing,_{ }the
p_{(CALCULATED)}_{ }probability_{ }addresses_{ }the_{ }question:

Assuming_{ }the_{ }sample_{ }data_{ }came_{ }from_{ }a_{ }population_{ }in_{ }which_{ }the
null_{ }hypothesis_{ }is_{ }(exactly)_{ }true,_{ }what_{ }is_{ }the_{ }probability
of_{ }obtaining_{ }the_{ }sample_{ }statistics_{ }one_{ }got_{ }for_{ }one's_{ }sample
data_{ }with_{ }the_{ }given_{ }sample_{ }size(s)?

Even_{ }without_{ }calculating_{ }this_{ }p,_{ }we_{ }can_{ }make_{ }logical_{ }judgments_{ }about_{ }p_{(CALCULATED)}._{ }In_{ }which_{ }one_{ }of_{ }each_{ }of_{ }the_{ }following_{ }pairs_{ }of_{ }studies_{ }will_{ }the
p_{(CALCULATED)}_{ }be_{ }smaller?

_{ }_{ }- In
_{ }two_{ }studies,_{ }each_{ }involving_{ }three_{ }groups_{ }of_{ }30_{ }subjects:
in_{ }one_{ }study_{ }the_{ }means_{ }were_{ }100,_{ }100,_{ }and_{ }90;_{ }_{ }in_{ }the_{ }second
study_{ }the_{ }means_{ }were_{ }100,_{ }100,_{ }and_{ }100.

_{ }_{ }
_{ }_{ }- In
_{ }two_{ }studies,_{ }each_{ }comparing_{ }the_{ }standard_{ }deviations_{ }(SD)
of_{ }scores_{ }on_{ }the_{ }dependent_{ }variable_{ }of_{ }two_{ }groups_{ }of_{ }subjects,_{ }in_{ }both_{ }studies
_{ }_{ }_{ }_{ }SD_{1}_{ }=_{ }4_{ }and_{ }SD_{2}_{ }=_{ }3,_{ }but_{ }in
study_{ }one_{ }the_{ }sample_{ }sizes_{ }were_{ }100_{ }and_{ }100,_{ }while_{ }in_{ }study
two_{ }the_{ }samples_{ }sizes_{ }were_{ }50_{ }and_{ }50.

_{ }_{ }
_{ }_{ }- In
_{ }two_{ }studies_{ }involving_{ }a_{ }multiple_{ }regression_{ }prediction_{ }of
Y_{ }using_{ }predictors_{ }X_{1},_{ }X_{2},_{ }and_{ }X_{3},_{ }and_{ }both_{ }with_{ }samples
sizes_{ }of_{ }75,_{ }in_{ }study_{ }one_{ }R^{2} _{ }=_{ }.49_{ }and_{ }in_{ }study_{ }two_{ }R^{2}
_{ }_{ }_{ }_{ }=.25.

**WHAT**_{ }DOES_{ }STATISTICAL_{ }SIGNIFICANCE_{ }REALLY_{ }TELL_{ }US?

Statistical_{ }significance_{ }addresses_{ }the_{ }question:

"*Assuming*_{ }the_{ }sample_{ }data_{ }came_{ }from_{ }a_{ }population_{ }in_{ }which_{ }the
null_{ }hypothesis_{ }is_{ }(exactly)_{ }true,_{ }and_{ }given_{ }our_{ }sample
statistics_{ }and_{ }sample_{ }size(s),_{ }is_{ }the_{ }calculated_{ }probability
of_{ }our_{ }sample_{ }results_{ }less_{ }than_{ }the_{ }acceptable_{ }limit_{ }(p_{(CRITICAL)})_{ }imposed_{ }regarding_{ }a_{ }Type_{ }I_{ }error?"

When_{ }p_{(CALCULATED)}_{ }is_{ }less_{ }than_{ }p_{(CRITICAL)},_{ }we_{ }use_{ }a_{ }decision_{ }rule_{ }that_{ }says_{ }we_{ }will_{ }"reject"_{ }the_{ }null_{ }hypothesis._{ }The_{ }decision_{ }to_{ }reject_{ }the_{ }null_{ }hypothesis_{ }is_{ }called_{ }a_{ }"statistically_{ }significant"_{ }result._{ }All_{ }the_{ }decision_{ }means_{ }is_{ }that_{ }we_{ }believe_{ }our_{ }sample_{ }results_{ }are_{ }relatively_{ }unlikely,_{ }given_{ }our_{ }assumptions,_{ }including_{ }our_{ }assumption_{ }that_{ }the_{ }null_{ }hypothesis_{ }is_{ }exactly_{ }true.

However,_{ }though_{ }it_{ }is_{ }easy_{ }to_{ }derive_{ }p_{(CRITICAL)},_{ }calculating_{ }p_{(CALCULATED)}_{ }can_{ }be_{ }tedious._{ }Traditionally,_{ }test_{ }statistics_{ }(e.g.,_{ }F,_{ }t,_{ }X_{ }squared)_{ }have_{ }been_{ }used_{ }as_{ }equivalent_{ }(but_{ }more_{ }convenient)_{ }reexpressions_{ }of_{ }p's,_{ }because_{ }Test_{ }Statistics_{(CALCULATED)}_{ }are_{ }easier_{ }to_{ }derive._{ }The_{ }TS_{(CRITICAL)}_{ }exactly_{ }equivalent_{ }to_{ }a_{ }given_{ }p_{(CRITICAL)}_{ }can_{ }be_{ }derived_{ }from_{ }widely_{ }available_{ }tables;_{ }the_{ }tabled_{ }value_{ }is_{ }found_{ }given_{ }alpha_{ }and_{ }the_{ }sample_{ }size(s)._{ }Different_{ }TS_{(CALCULATED)}_{ }are_{ }computed_{ }depending_{ }on_{ }the_{ }hypothesis_{ }being_{ }tested._{ }The_{ }only_{ }difference_{ }in_{ }invoking_{ }test_{ }statistics_{ }in_{ }our_{ }decision_{ }rule_{ }is_{ }that_{ }we_{ }reject_{ }the_{ }null_{ }(called_{ }"statistically_{ }significant")_{ }when_{ }TS_{(CALCULATED)}_{ }is_{ }greater_{ }than_{ }TS_{(CRITICAL)}._{ }However,_{ }comparing_{ }p's_{ }and_{ }TS's_{ }for_{ }a_{ }given_{ }data_{ }set_{ }will_{ }always_{ }yield_{ }the_{ }same_{ }decision.

Remember,_{ }knowing_{ }sample_{ }results_{ }are_{ }relatively_{ }unlikely,_{ }assuming_{ }the_{ }null_{ }is_{ }true,_{ }may_{ }not_{ }be_{ }helpful._{ }An_{ }improbable_{ }result_{ }is_{ }not_{ }necessarily_{ }an_{ }important_{ }result,_{ }as_{ }Shaver_{ }(1985,_{ }p._{ }58)_{ }illustrates_{ }in_{ }his_{ }hypothetical_{ }dialogue_{ }between_{ }two_{ }teachers:

Chris:_{ }...*I*_{ }set_{ }the_{ }level_{ }of_{ }significance_{ }at_{ }.05,_{ }as_{ }my_{ }_{ }thesis
advisor_{ }suggested._{ }So_{ }a_{ }difference_{ }that_{ }large_{ }would_{ }occur_{ }by
chance_{ }less_{ }than_{ }five_{ }times_{ }in_{ }a_{ }hundred_{ }if_{ }the_{ }groups
weren't_{ }really_{ }different._{ }An_{ }unlikely_{ }occurrence_{ }like_{ }that
surely_{ }must_{ }be_{ }important.

Jean:_{ }_{ }Wait_{ }a_{ }minute._{ }Remember_{ }the_{ }other_{ }day_{ }when_{ }you_{ }went_{ }into
the_{ }office_{ }to_{ }call_{ }home?_{ }Just_{ }as_{ }you_{ }completed_{ }dialing_{ }the
number,_{ }your_{ }little_{ }boy_{ }picked_{ }up_{ }the_{ }phone_{ }to_{ }call_{ }someone.
So_{ }you_{ }were_{ }connected_{ }and_{ }talking_{ }to_{ }one_{ }another_{ }without_{ }the
phone_{ }ever_{ }ringing..._{ }Well,_{ }that_{ }must_{ }have_{ }been_{ }a_{ }truly
important_{ }occurrence_{ }then?

**WHY**_{ }NOT_{ }USE_{ }STATISTICAL_{ }SIGNIFICANCE_{ }TESTING?

Statistical_{ }significance_{ }testing_{ }may_{ }require_{ }an_{ }investment_{ }of_{ }effort_{ }that_{ }lacks_{ }a_{ }commensurate_{ }benefit._{ }Science_{ }is_{ }the_{ }business_{ }of_{ }isolating_{ }relationships_{ }that_{ }(re)occur_{ }under_{ }stated_{ }conditions,_{ }so_{ }that_{ }knowledge_{ }is_{ }created_{ }and_{ }can_{ }be_{ }cumulated._{ }But_{ }statistical_{ }significance_{ }does_{ }not_{ }adequately_{ }address_{ }whether_{ }the_{ }results_{ }in_{ }a_{ }given_{ }study_{ }will_{ }replicate_{ }(Carver,_{ }1978)._{ }As_{ }scientists,_{ }we_{ }must_{ }ask_{ }(a)_{ }what_{ }the_{ }magnitudes_{ }of_{ }sample_{ }effects_{ }are_{ }and_{ }(b)_{ }whether_{ }these_{ }results_{ }will_{ }generalize;_{ }statistical_{ }significance_{ }testing_{ }does_{ }not_{ }respond_{ }to_{ }either_{ }question_{ }(Thompson,_{ }in_{ }press)._{ }Thus,_{ }statistical_{ }significance_{ }may_{ }distract_{ }attention_{ }from_{ }more_{ }important_{ }considerations.

**MISINTERPRETING**_{ }STATISTICAL_{ }SIGNIFICANCE_{ }TESTING

Many_{ }of_{ }the_{ }problems_{ }in_{ }contemporary_{ }uses_{ }of_{ }statistical_{ }significance_{ }testing_{ }originate_{ }in_{ }the_{ }language_{ }researchers_{ }use._{ }Several_{ }names_{ }can_{ }refer_{ }to_{ }a_{ }single_{ }concept_{ }(e.g.,
"SOS_{(BETWEEN)}"=_{ }"SOS_{(EXPLAINED)}"_{ }=
"SOS_{(MODEL)}"_{ }=_{ }"SOS_{(REGRESSION)}"),_{ }and_{ }different_{ }meanings_{ }are_{ }given_{ }to_{ }terms_{ }in_{ }different_{ }contexts_{ }(e.g.,_{ }"univariate"_{ }means_{ }having_{ }only_{ }one_{ }dependent_{ }variable_{ }but_{ }potentially_{ }many_{ }predictor_{ }variables,_{ }but_{ }may_{ }also_{ }refer_{ }to_{ }a_{ }statistic_{ }that_{ }can_{ }be_{ }computed_{ }with_{ }only_{ }a_{ }single_{ }variable).

Overcoming_{ }three_{ }habits_{ }of_{ }language_{ }will_{ }help_{ }avoid_{ }unconscious_{ }misinterpretations:

_{ }_{ }**Say**_{ }"statistically_{ }significant"_{ }rather_{ }than_{ }"significant."
Referring_{ }to_{ }the_{ }concept_{ }as_{ }a_{ }phrase_{ }will_{ }help_{ }break_{ }the
erroneous_{ }association_{ }between_{ }rejecting_{ }a_{ }null_{ }hypothesis
and_{ }obtaining_{ }an_{ }important_{ }result.
_{ }_{ }**Don't**_{ }say_{ }things_{ }like_{ }"my_{ }results_{ }approached_{ }statistical
significance."_{ }This_{ }language_{ }makes_{ }little_{ }sense_{ }in_{ }the
context_{ }of_{ }the_{ }statistical_{ }significance_{ }testing_{ }logic._{ }My
favorite_{ }response_{ }to_{ }this_{ }is_{ }offered_{ }by_{ }a_{ }fellow_{ }editor_{ }who
responds,_{ }"How_{ }did_{ }you_{ }know_{ }your_{ }results_{ }were_{ }not_{ }trying_{ }to
avoid_{ }being_{ }statistically_{ }significant?".
_{ }_{ }**Don't**_{ }say_{ }things_{ }like_{ }"the_{ }statistical_{ }significance_{ }testing
evaluated_{ }whether_{ }the_{ }results_{ }were_{ }'due_{ }to_{ }chance'."_{ }This
language_{ }gives_{ }the_{ }impression_{ }that_{ }replicability_{ }is
evaluated_{ }by_{ }statistical_{ }significance_{ }testing.

**WHAT**_{ }ANALYSES_{ }ARE_{ }PREFERRED_{ }TO_{ }STATISTICAL_{ }SIGNIFICANCE_{ }TESTING?

Two_{ }analyses_{ }should_{ }be_{ }emphasized_{ }over_{ }statistical_{ }significance_{ }testing_{ }(*Journal*_{ }of_{ }Experimental_{ }Education,_{ }1993)._{ }First,_{ }effect_{ }sizes_{ }should_{ }be_{ }calculated_{ }and_{ }interpreted_{ }in_{ }all_{ }analyses._{ }These_{ }can_{ }be_{ }r_{ }squared-type_{ }effect_{ }sizes_{ }(e.g.,_{ }R_{ }squared,_{ }eta_{ }squared,_{ }omega_{ }squared)_{ }that_{ }evaluate_{ }the_{ }proportion_{ }of_{ }variance_{ }explained_{ }in_{ }the_{ }analysis,_{ }or_{ }standardized_{ }differences_{ }in_{ }statistics_{ }(e.g.,_{ }standardized_{ }differences_{ }in_{ }means),_{ }or_{ }both._{ }Second,_{ }the_{ }replicability_{ }of_{ }results_{ }must_{ }be_{ }empirically_{ }investigated,_{ }either_{ }through_{ }actual_{ }replication_{ }of_{ }the_{ }study,_{ }or_{ }by_{ }using_{ }methods_{ }such_{ }as_{ }cross-validation,_{ }the_{ }jackknife,_{ }or_{ }the_{ }bootstrap_{ }(see_{ }Thompson,_{ }in_{ }press).

