Student Academic Achievement

Overview
Evaluation Questions
Summary of Major Results

Component 1a: Student Academic Achievement

The ultimate goal of Urban Dreams is improved student achievement.  The project components are designed to contribute to academic gains.  The project objectives call for measurable student achievement in core academic areas by the end of the 2001-2002 academic year.

The project evaluators and staff instituted a more rigorous “quasi-experimental” design this year to meet the demands of the No Child Left Behind statute and to better understand the impact of the project on students within Urban Dreams’ classrooms related specifically to academic achievement and technology proficiency.  To accomplish this, the evaluators developed representative samples of Urban Dreams and non-Urban Dreams students.  For the purpose of this experimental study, being treated was operationally defined as having taken one or more classes during either or both the current or past school year (2000-2002) from at least one teacher who was associated with the UD program. 

A survey instrument was administered to students who attended one of six sites where teachers had the opportunity to participate in the Urban Dreams program.  For each school site, a list of the language arts and social studies teachers was developed.  Stratified sampling resulted in the random selection of six teachers (one from each site – 24 teachers total) for each of the following four groups:   a language arts teacher involved in the Urban Dreams program, a social studies teacher involved in the Urban Dreams program, a language arts teacher who was not involved in Urban Dreams, and a social studies teacher who was not involved in Urban Dreams.

The purpose of the following analysis is to summarize performance on standardized tests (specifically, the SAT/9 and STAR tests) administered during spring 2002 to students at high school sites where teachers participating in the Urban Dreams program work.  The test scores for these students on the Stanford Achievement Test, Version 9 (SAT/9) and California State Standards Test (STAR) were retrieved for students in both the “experimental” and “comparison” groups from the records kept by the school district.

Ninety-six percent of the students completed the question on the survey that allowed classification into the “experimental” vs. “comparison” group.  For the purpose of this experimental study, being treated was operationally defined as having taken one or more classes during either or both the current or past school year (2000-2002) from at least one teacher who was associated with the program.  The comparison group consisted of the 23% of the sample who were students at the same sites but who did not have a UD program teacher within the last two years.

The size of the sample used in each statistical analysis varies reflecting the percentage of students for which data related to each of the dependent variables has been obtained.  Of the 890 students that could be classified as belonging to either the experimental or control group, 90.8% had useable scores reflecting their total number of technology proficiency skills.  In contrast, only 73.9%, 71.3%, 70.9%, 70.6%, and 72.7% of these group-classified students had accessible NCE SAT/9 reading, language, and social studies scores and STAR English language arts and history scores, respectively.

 

Evaluation Questions

The evaluation questions for student academic achievement are:

Evaluation Question One: Do students (in the “experimental group”) who were enrolled in at least one course taught by a teacher who participated in the Urban Dreams program, on average, perform better on the SAT/9 subtests (reading, language arts, and social studies) and STAR subtests (English language arts and history) than students who were not taught by such teachers (the “comparison” group)?
Evaluation Question Two: What is the correlation between program participation and standardized test scores?
Evaluation Question Three: Do students who perform better on the SAT/9 and STAR subtests also self-report higher levels of technology proficiency?
Evaluation Question Four: Is there a statistically significant difference between the experimental and comparison groups’ standardized test performance after controlling for background factors (i.e., those that are not attributed to program impact) between groups that might influence the attainment of technology competencies (an intermediate outcome hypothesized to impact achievement) and/or represent a selection threat that gives one group an initial advantage with regard to standardized test performance?

 

Evaluation Question One:

To address evaluation question one, a series of independent samples t-tests were performed using the normal curve equivalent (NCE) scores of each of the three SAT/9 subtests.  In the analysis of performance on the STAR, the levels were compared between the groups using the non-parametric Mann-Whitney U test since the data represent ordinal measurement (i.e, lacking the property of equal widths between successive levels as needed for t-tests.).  For all 5 analyses, group membership (experimental versus comparison) served as the independent variable. 

For each of the five standardized test scores analyzed, students in the experimental group, on average, performed higher than those in the comparison group (see table below).  It should be kept in mind, however, that these statistically significant differences may, in part, be explained by initial group differences in background factors having little, if anything, to do with the Urban Dreams program itself.  Thus, greater attention should be paid to the results of evaluation question four.

It should be noted that for approximately 30% of the students who were classifiable as belonging to the experimental or comparison group, no standardized test score data was available.  Thus, the samples sizes used in the analyses ranged from a low of 628 to a high of 658.  (The degrees of freedom indicated in each statistical summary reflect a statistical adjustment that is made when the homogeneity of variance assumption is not met, as true in these instances.) 

Standardized SAT/9 achievement test score performance by group (experimental vs. comparison)

 

Mean Difference Between Groups

Program Participation Indicator

 

N

 

Mean

 

Std. Deviation

Sat/9 Reading NCE

 

10.4

Experimental

510

38.360

19.007

Comparison

148

27.921

15.637

Sat/9 Language NCE

 

9.0

Experimental

494

47.138

20.015

Comparison

141

38.118

16.764

Sat/9 Social Science NCE

 

8.0

Experimental

488

44.644

18.327

Comparison

143

36.676

15.164

Level on STAR English language arts test by group membership

 

 

Level on STAR English Language Arts Test

 

Group Membership

 

Far Below

Below Basic

Basic

Proficient

Advanced

Total

Comparison

 

 

 

Count

62.0

38.0

40.0

9.0

1.0

150.0

Expected Count

36.5

37.5

43.9

21.3

10.7

150.0

% within group

41.3%

25.3%

26.7%

6.0%

0.7%

100.0%

Std. Residual

4.2

0.1

-0.6

-2.7

-3.0

 

Experimental

Count

91.0

119.0

144.0

80.0

44.0

478.0

Expected Count

116.5

119.5

140.1

67.7

34.3

478.0

% within group

19.0%

24.9%

30.1%

16.7%

9.2%

100.0%

Std. Residual

-2.4

0.0

0.3

1.5

1.7

 

  Total     

Count

153.0

157.0

184.0

89.0

45.0

628.0

Expected Count

153.0

157.0

184.0

89.0

45.0

628.0

% within group

24.4%

25.0%

29.3%

14.2%

7.2%

100.0%

 

Level on STAR history test by group membership

 

 

Level on STAR History Test

Total

Group Membership

 

Far Below

Below Basic

Basic

Proficient

Advanced

 

Comparison

Count

58.0

46.0

39.0

6.0

1.0

150.0

Expected Count

41.5

38.7

48.2

18.8

2.8

150.0

% within group

38.7%

30.7%

26.0%

4.0%

0.7%

100.0%

Std. Residual

2.6

1.2

-1.3

-2.9

-1.1

 

Experimental

Count

121.0

121.0

169.0

75.0

11.0

497.0

Expected Count

137.5

128.3

159.8

62.2

9.2

497.0

% within group

24.3%

24.3%

34.0%

15.1%

2.2%

100.0%

Std. Residual

-1.4

-0.6

0.7

1.6

0.6

 

Total

Count

179.0

167.0

208.0

81.0

12.0

647.0

Expected Count

179.0

167.0

208.0

81.0

12.0

647.0

% within group

27.7%

25.8%

32.1%

12.5%

1.9%

100.0%

 

SAT/9 reading scores - The mean difference of 10.4 NCE score points on the SAT/9 Reading scale, unadjusted for background factors that differ between the groups, was statistically significant, t(284.959)= 6.795, p< .001.  A 95% confidence interval for the difference between the two population means suggests that the experimental group’s population mean lies between 6.5 to 14.4 NCE score points higher than that of the comparison group.

SAT/9 language scores - The mean difference of 9.0 NCE score points on the SAT/9 Language scale, unadjusted for background factors that differ between the groups, was statistically significant, t(264.646)= 5.387, p< .001.  A 95% confidence interval for the difference between the two population means suggests that the experimental group’s population mean lies between 4.7 to 13.4 NCE score points higher than that of the comparison group. 

SAT/9 social science scores - The mean difference of 8.0 NCE score points on the SAT/9 Social Science scale, unadjusted for background factors that differ between the groups, was statistically significant, t(274.892)= 5.259, p< .001.  A 95% confidence interval for the difference between the two population means suggests that the experimental group’s population mean lies between 4.0 to 11.9 NCE score points higher than that of the comparison group. 

STAR English language arts proficiency level scores - As evident in Table 2b-1 (page 12), the mean rank for the proficiency level of students in the experimental group on the STAR English language arts test (N= 478, Mean Rank= 339.66) exceeds that of the comparison group (N= 150, Mean Rank= 234.31).  The median English language arts proficiency level for the comparison group is 2 (below basic) and for the experimental group it is 3 (basic).  The Mann-Whitney U test statistic (U= 23822) was significant (p< .001).  The pattern of standardized residuals reveal that there is a much larger proportion of students in the comparison group who are far below basic proficiency (level=1) than expected and a much smaller proportion of students in the experimental group at that level than expected.  Because the levels assigned are suppose to be tailored for each grade level, it is unclear that this merely reflects the selection threat introduced from having a larger proportion of freshmen and sophomores in the comparison group as compared with the experimental group.  Still, it must be recognized that the results may reflect other initial group differences that the lack of random assignment to treatment (i.e., quasi-experimentation) may introduce.

STAR history proficiency level scores - As evident in Table 2b-2, the mean rank for the proficiency level of students in the experimental group on the STAR history test (N= 497, Mean Rank= 343.12) exceeds that of the comparison group (N= 150, Mean Rank= 260.66).  The median history proficiency level for the comparison group is 2 (below basic) and for the experimental group it is 3 (basic).  The Mann-Whitney U test statistic (U= 27774) was significant (p< .001).  The pattern of standardized residuals reveal that there is a much larger proportion of students in the comparison group who are far below basic proficiency (level=1) than expected and a much smaller proportion of students in the comparison group at the proficient level (of 4) than expected.  Because the levels assigned are suppose to be tailored for each grade level, it is unclear that this merely reflects the selection threat introduced from having a larger proportion of freshmen and sophomores in the comparison group as compared with the experimental group.  Still, it must be recognized that the results may reflect other initial group differences that the lack of random assignment to treatment (i.e., quasi-experimentation) may introduce.

Evaluation Question Two:

To address evaluation question two, point biserial correlations were calculated between group membership (experimental versus comparison) and NCE scores on the SAT/9 reading, language arts, and social studies subtests.  The rank-biserial correlation coefficient needed for correlating an ordinal and dichotomous variable was approximated by calculating Spearman correlations between group membership and the level indicators for each of the STAR tests (English language arts and history).

Once squared, the correlations between group membership and standardized test performance help in understanding how much the variation in the latter can be predicted on the basis of program participation.  Because the experimental and comparison groups were coded 1 and 0, respectively, the positive correlations indicate that the experimental group tends to outperform the comparison group (as noted above).  Although all statistical inference tests for one sample correlation coefficients are significant (p < .001), the correlations range between .186 and .255 which suggests the relationships are modest.  By squaring the correlation coefficients we can conclude that 5.4%, 3.6%, 3.5%, 6.5%, and 3.8% of the variation in SAT/9 reading, language, social science, STAR English language arts, and STAR history scores, respectively, can be predicted on the basis of group membership.

Point biserial and rank-biserial correlations between program participation (experimental=1, comparison=0) and performance on SAT/9 and STAR standardized achievement tests

 

Standardized Test and Scale

Size of Available Sample

 

Correlation

Percentage of Variance Predicted

SAT/9 Reading

658

.232

5.4

SAT/9 Language

635

.191

3.6

SAT/9 Social Studies

631

.186

3.5

STAR English Language Arts

628

.255

6.5

STAR History

647

.194

3.8

 

 

Evaluation Question Three:  

To address evaluation question three, correlations were calculated between the total number of technology proficiency skills self-reported by the students (in response to questions on the STPI; where scores can range from 0-21) and NCE scores on the SAT/9 reading, language arts, and social studies subtests.  The biserial correlation coefficient needed for correlating an ordinal and dichotomous variable was approximated by calculating Spearman correlations between level of technology proficiency and the level indicators for each of the STAR tests (English language arts and history).

The correlations between students’ self-reported level of technology proficiency and standardized test performance do suggest that those with more technological proficiency tend to perform better on the five standardized test scales being used in this investigation (see Table 4).  Although all statistical inference tests for one sample correlation coefficients are significant (p < .001), the correlations range between .293 and .379 which suggests the relationships are in the low to moderate range.  By squaring the correlation coefficients we can conclude that 14.4, 13.4, 8.6, 13.5, and 10.6% of the variation in SAT/9 reading, language, social science, STAR English language arts, and STAR history scores, respectively, can be predicted on the basis of students’ self-reported levels of technology proficiency.  

Pearson and biserial correlations between the  students’ self-reported levels of technology proficiency and performance on SAT/9 and STAR standardized achievement tests

Standardized Test and Scale

Size of Available Sample

Correlation

Percentage of Variance Predicted

SAT/9 Reading

623

.379

14.4

SAT/9 Language

606

.366

13.4

SAT/9 Social Studies

600

.293

8.6

STAR English Language Arts

601

.367

13.5

STAR History

614

.326

10.6

 

 

Evaluation Question Four:  

To address evaluation question four, hierarchical regression analysis was employed where blocks of variables are entered successively and those in prior blocks are controlled when examining the effects of variables entering in later blocks (see Table 5, next page).  It should be noted that the analyses involving STAR level data are only approximations.  Technically, a more sophisticated analysis is warranted that properly treats the STAR levels as being ordinal (not interval) data.

Table 5. Variable blocks used in hierarchical regression analysis of program impact

Block 1 (Demographic):

Male (1=Yes, 0=No)

Grade Level (9, 10, 11, 12)

Ethnicity (indicators for 4 groups)

Block 2 (Academic Achievement/ Aspirations):

Self-Reported Grades

Having Plans to Attend College (1=Yes, 0=No)

Block 3 (Computer-specific):

Having Home Computer (1=Yes, 0=No)

Took Technology Class (1=Yes, 0=No)

Belief in Importance of Computers (1=Yes, 0=No)

Block 4 (Program Treatment):

Experimental Group (1= Teacher Participated in Program, 0= No Teachers Participated in Program)

Students in the treatment group (i.e., whose teachers participated in Urban Dreams) did perform significantly better on standardized achievement tests than the students in the comparison group after controlling for demographic, academic achievement/ aspiration, and computer-specific background variables.  Though statistically significant, the change in the proportion of variance (for the test score, or level of proficiency, being predicted) accounted for by knowing whether the student was in the experimental or comparison group ranged between .017 to .032.  It should be recognized that this estimate of program impact is conservative in that we control for computer-specific variables that the Urban Dreams program could, in fact, have impacted (e.g., the acquisition of a home computer, the decision to take a computer class, beliefs in the importance of having computer skills). 

The regression results for the model outlined above with program treatment dichotomously indicated are shown in Table 6 (next page).  The “R” column shows the multiple regression correlation coefficient at the last step when the group membership variable is entered.  The “R square change” shows the additional proportion of the variation in the test scores, or level of proficiency, that can be predicted on the basis of group membership after all the control variables have already entered.  The “b” column represents the unstandardized regression coefficients.  They indicate how much higher, on average, scores on the test will be for those in the experimental group (i.e., because the signs are all positive) as compared to those in the comparison group.  The “t” value for testing the statistical significance of adding the group membership variable to the model is given along with its associated probability (labeled “Sig t – Sig Change”).  In the last column, the partial correlations indicate the correlation between group membership and the test scores after controlling for the variables that have already entered the regression model.  It must be noted that multiple linear regression is used to approximate the results of an ordinal regression (which would be more appropriate since the levels of proficiency aren’t truly interval or ratio data).

Hierarchical regression results at final block where group membership enters as a predictor

Scale

R

R Square Change

b

t

Sig t =

Sig Change

Partial Corr

SAT/9 Reading

.550

.023

7.315

4.4

<.001

.178

SAT/9 Language

.565

.017

6.653

3.8

<.001

.157

SAT/9 Social Studies

.506

.023

7.106

4.2

<.001

.173

STAR English Language Arts

.571

.032

.539

5.2

<.001

.213

STAR History

.492

.023

.387

4.0

<.001

.163

R= multiple correlation coefficient

R Square Change= the change in R squared once the group membership variable is added

b = unstandardized regression coefficient for group membership variable in final model

t = t test statistic value for determining if the regression coefficient differs from zero

Note:  Multiple linear regression is used to approximate the results of an ordinal regression for the STAR measures.  The STAR proficiency levels technically represent ordinal rather than interval or ratio data as linear regression generally assumes.

The mean difference of 7.3 NCE score points on the SAT/9 reading scale, as reflected in the unstandardized regression coefficient, b, is statistically significant and suggests that SAT/9 reading NCE scores, on average, do vary between the experimental and comparison groups even after adjusting for background factors.  In comparing the mean difference (10.4) for these scores found in addressing evaluation question 1 to the difference (7.3) found in this more conservative analysis of program impact whereby potential pre-existing group differences thought to impact either technology proficiency or test performance have been controlled, we find that the impact is still of practical importance.  Similarly, one can compare the point biserial correlation reported in Table 3 (.232) to the partial correlation in Table 6 (.178) and conclude that the introduction of the control variables used in the hierarchical regression analysis did not dramatically alter the results.

SAT/9 reading scores - The mean difference of 6.6 NCE score points on the SAT/9 reading scale, as reflected in the unstandardized regression coefficient, b, is statistically significant and suggests that SAT/9 Language NCE scores, on average, do vary between the experimental and comparison groups even after adjusting for background factors.  In comparing the mean difference (9.0) for these scores found in addressing evaluation question 1 to the difference (6.6) found in this more conservative analysis of program impact whereby potential pre-existing group differences thought to impact either technology proficiency or test performance have been controlled, we find that the impact is still of practical importance.  Similarly, one can compare the point biserial correlation reported in Table 3 (.191) to the partial correlation in Table 6 (.157) and again conclude that the introduction of the control variables used in the hierarchical regression analysis did not dramatically alter the results.

SAT/9 social science scores - The mean difference of 7.1 NCE score points on the SAT/9 social science scale, as reflected in the unstandardized regression coefficient, b, is statistically significant and suggests that SAT/9 social science NCE scores, on average, do vary between the experimental and comparison groups even after adjusting for background factors.  In comparing the mean difference (8.0) for these scores found in addressing evaluation question 1 to the difference (7.1) found in this more conservative analysis of program impact whereby potential pre-existing group differences thought to impact either technology proficiency or test performance have been controlled, we find that the impact is still of practical importance.  Similarly, one can compare the point biserial correlation reported in (.186) to the partial correlation (.173) and again conclude that the introduction of the control variables used in the hierarchical regression analysis did not dramatically alter the results.

STAR English language arts proficiency levels - The mean difference of .571 level points on the STAR English Language Arts test, as reflected in the unstandardized regression coefficient, b, is statistically significant and suggests that STAR English Language Arts proficiency levels, on average, do vary between the experimental and comparison groups even after adjusting for background factors.  The zero-order rank biserial correlation reported in Table 3 (.255) is not that different from the partial correlation in Table 6 (.213) which suggests that the introduction of the control variables used in the hierarchical regression analysis did not dramatically alter the results.

STAR history proficiency levels - The mean difference of .494 level points on the STAR History test, as reflected in the unstandardized regression coefficient, b, is statistically significant and suggests STAR History proficiency levels, on average, do vary between the experimental and comparison groups even after adjusting for background factors.  The zero-order rank biserial correlation reported in Table 3 (.194) is not that different from the partial correlation in Table 6 (.163) which suggests that the introduction of the control variables used in the hierarchical regression analysis did not dramatically alter the results.

 

Summary of Major Results

1.  For each of the five standardized test scores analyzed, students in the experimental group, on average, performed higher than those in the comparison group.  Ninety-five percent confidence intervals for the difference between the two population means suggest the amount that the experimental group’s population mean lies above that of the comparison group is between:

6.5 to 14.4 NCE score points on the SAT/9 reading test
4.7 to 13.4 NCE score points on the SAT/9 language test
4.0 to 11.9 NCE score points on the SAT/9 social studies test

2.  The STAR English language arts proficiency level and the history proficiency level results are as follows:

The median English Language Arts proficiency level for the comparison group is 2 (below basic) and lower than that of the experimental group which is 3 (basic).

The median History proficiency level for the comparison group is 2 (below basic) and lower than that of the experimental group which is 3 (basic)

3.  The point biserial correlations between group membership and standardized test performance range between .186 and .255.  Thus, the percentages of the variance in SAT/9 reading, language, social science, STAR English language arts, and STAR history scores that can be predicted on the basis of group membership (knowing whether a student is in the experimental vs. comparison group) are 5.4, 3.6, 3.5, 6.5, and 4.0%, respectively.

4.  The correlations between students’ self-reported level of technology proficiency and standardized test performance range between .293 and .379 which suggests that those with more technological proficiency tend to perform better on the five standardized test scales being used in this investigation.  14.4, 13.4, 8.6, 13.5, and 10.6% of the variation in SAT/9 reading, language, social science, STAR English language arts, and STAR history scores, respectively, can be predicted on the basis of students’ self-reported levels of technology proficiency.  These associations, however, should not be interpreted as reflecting a causal linkage, as a third variable (e.g., level of achievement motivation) may be linking technology proficiency levels with achievement test scores.

5.  Students in the treatment group (i.e., whose teachers participated in UD) did perform significantly better on standardized achievement tests than the students in the comparison group even after controlling for demographic, academic achievement/ aspiration, and computer-specific background variables.  The last two estimates are somewhat crude approximations to ordinal regression by use of multiple linear regression.  The latter assumes the data to be measured on an interval or ratio scale.  However, technically, the STAR proficiency levels should be treated as ordinal data.  The unstandardized regression coefficients suggest that the amount that the experimental group’s population mean lies above that of the comparison group is:

 

7.3 NCE points on the SAT/9 reading test

6.6 NCE points on the SAT/9 language NCE test

7.1. NCE points on the SAT/9 social science NCE test

.539 level points on the STAR English language arts test

.387 level points on the STAR history test.

 

 

 

© Copyright 2002 Center for Evaluation and Research, LL