This spring my high school administered ACTFL’s AAPPL assessment to all World Languages students in their 3rd, 5th, and 7th semesters of study. The results were encouraging, enlightening, and challenging. Below is the express version of action research on the findings and implications of this assessment.
On May 3, 2018, approximately 170 students in grades 9-12 at Chillicothe High School in southern Ohio took the AAPPL assessment. My world languages colleagues and I hoped to see if our students were meeting the proficiency targets we had set for them and if our understanding of the proficiency levels and proficiency scoring was roughly equivalent to how the ACTFL-trained AAPPL raters were scoring the exams.
Chillicothe, OH is a small city in Appalachia with a mix of rural, urban, and suburan students. The majority of our students are on free or reduced lunch, so much that the district received a grant to provide free lunch for 100% of the students. Many of our students come from homes with no college education, and most of Chillicothe High School’s students live below the poverty line. Anecdotally speaking, many teachers in this school have given up assigning homework, because students will not do it, or they will rush to copy right before class, even in AP or college prep classes. Academics are not devalued; however, they are not well valued beyond the limits of the school day. Furthermore, Chillicothe does not have a very linguistically diverse population. Less than 2% of students who took the AAPPL come from homes or families where a foreign language is spoken. In short, the only exposure to a language other than English (LOTE) is contained to the classroom for the vast majority of our students.
Chillicothe High School offers three LOTEs – Spanish, French, and Chinese. Spanish is by far the most popular, having three to four times more students than any other language. Accordingly, three teachers teach Spanish at Chillicothe High School. Two of the three Spanish teachers are dedicated to levels 1-3, while the third teacher teachers levels 4-7 (AP). French is the second most studied language at CHS with about one hundred students and one teacher who teaches levels 1-7 (AP). Finally, Chinese has about twenty students and one teacher responsible for levels 1-7 (AP). CHS operates by semesters, with odd-level courses being offered in the spring and even-level courses offered in the fall. Classes are 70 minutes long and meet every day for one semester. Advanced middle school students may begin their LOTE study in 8th grade, while the general population of students begins their first LOTE in the spring of their 9th grade year.
The world language teachers at CHS do not teach using one uniform methodology. The French teacher (me) uses Organic World Language (OWL) strategies almost exclusively, as does the Spanish teacher who teaches levels 4-AP. The Chinese teacher uses a variety of comprehensible input methods (CI) including some OWL strategies. One of the lower level Spanish teachers teaches mostly in English, using a great deal of technology and her excellent rapport with students to teach about the language and its mechanics. The other lower level Spanish position has seen a great deal of turnover in the past few years. One teacher used mostly traditional, English methods to teach about the language. Another teacher who was brand new to the field of teaching attempted to use OWL in all of his classes. A third teacher relied mostly on traditional methods with some incorporation of OWL strategies before inverting that methodology in her final semester. In brief, students in Spanish 1-3 have been exposed to a variety of teachers and methods.
The department never formalized essential questions that we had going into the AAPPL testing, but informal conversations proved that we had two uniform concerns:
- What level of proficiency are our students demonstrating at the various levels? Specifically, are our students reaching Intermediate-Low by level 3 as the Ohio Department of Education recommends for level 3 students of French and Spanish? Furthermore, are our AP students performing near the advanced levels?
- Are we assessing our students in a way similar to how ACTFL experts would assess? We are familiar with ACTFL’s proficiency levels and sub-levels, but it is hard to apply these standards to our own students. We know them. We want them to be somewhere. We are unsure as to how much accuracy must be demonstrated in order to achieve various levels. Are we grading too harshly, too leniently, or just right?
On a personal level, I had one more essential question. I wanted to know what effect if any the various methodologies have on student performance, giving us a third essential question:
- Is OWL methodology producing higher levels of student performance than non-OWL / non-CI strategies?
I can only speak about my own assumptions for my French students, but I would like to clarify how I was teaching them prior to the AAPPL. As I said, I use almost exclusively OWL strategies for all classes. However, the prompts I used in each class varied by what I thought that they needed. I was convinced that my Level 3 students were Novice-High, so I delivered mostly intermediate prompts to them. I made the same assumptions about my Level 5 students, believing that their linguistic accuracy was too low to actually qualify them to be intermediate speakers. I believed my level 7 students to be Intermediate-Lows. Since I wanted to push them to Intermediate-Mid, and since they were in a mixed class with my level 5, I provided them with mostly intermediate prompts with the very rare advanced prompts. In brief, essentially all of the students that I tested were working almost uniquely with Intermediate-level prompts.
Data provided by ODE supported my beliefs. The data showed that students at the end of a 7-12 program are very likely to end up at Intermediate-Mid to Intermediate-High, and that advanced scores were a longshot, even with a 7-12 program. I had an 8-12 program, and I believed that my best students were likely to cap out at Intermediate-Mid.
Why we chose AAPPL
My department decided on AAPPL for a variety of reasons. It was cost effective. At only $20 per student (and a small discount after that, just for being in Ohio), we could easily tack this on as a lab fee for anyone in the levels 3,5, or 7 courses. Furthermore, we wanted a test whose results could earn our students the Seal of Biliteracy, new to Ohio in 2018. This left us choosing between AAPPL and the STAMP test. While I believe that STAMP has a lot to offer as a test (even claiming to be a proficiency rather than performance test), and would not be closed to it in the future, we ultimately chose AAPPL because it was created by ACTFL. We know and trust ACTFL. We like that ACTFL trains the raters. We’ve invested a lot of time over the past few years learning about the proficiency levels, and AAPPL seemed to mirror proficiency more clearly than STAMP. Furthermore, AAPPL provides free rubrics that are easy to use, and my department incorporates these rubrics into our day-to-day grading. (As a side note, and I swear that I’m not being paid to say this (but I’m not closed to that idea if anyone from ACTFL is reading), the support we received from AAPPL was superb. They were very helpful and prompt when it came to administering the exams).
Choosing the examinees to be studied
Prior to administering the exam, we knew that we wanted to see first of all if our Level 3 students were attaining Intermediate-Low scores in French and Spanish. Although Chinese students also took the AAPPL, they are studying a Level 4 difficulty language, so their targets are a bit more complicated. Looking at Spanish 3 versus French 3 data gives us a bit more of an apples to apples comparison. Level 1 is also offered during the spring semester when we were administering the AAPPL, but all teachers in the department felt relatively confident that our level 1 students were reaching the target of Novice-Mid. We also wanted to know how far into the Intermediate sub-levels our honors students were climbing, so we assessed, Levels 5 and 7 as well. Comparisons of Honors-level French and Spanish students are being left out of this study, because the sample size of honors French students is too small to be conclusive.
Administering the exam
We administered exams to all 170 students in one sitting on the morning of May 3. Students who were not testing were allowed to arrive at school two hours later, which mirrored a familiar state testing schedule that our school uses. We had enough headphones to record about 100 students at a time, so students were divided into groups A and B. A groups completed the Interpretive Listening (IL) test, followed by the Interpersonal Listening and Speaking (ILS) test. Following the ILS test, Group A students were given a brief break during which their proctors placed the recording microphones outside the classroom doors. After the break, students in Group A took the Interpretive Reading (IR) test and concluded with the Presentational Writing (PW) test. Group B students began by completing the IR and IL tests. They were allowed a break after IL, during which microphones were transferred from Group A testing rooms to Group B testing rooms. After the break, Group B students took the ILS test and concluded with the PW. We intentionally placed the PW test as the final test for both groups, because our research had revealed that it was the part of the test that took the longest to complete, and it could be stopped and restarted relatively easily.
Students were grouped in rooms of like students. All Spanish 3 students took the AAPPL Form A (measuring Novice-Low to Intermediate-Mid) and all Spanish 5 and 7 students took Form B (measuring Novice-High to Advanced). Some French 3 students attempted form B, while the rest took Form A, and all French 5 and 7 students took Form B. All Chinese students attempted form A. Rooms were grouped so that the same test form was given to all students. French A students were only with other French A students. Spanish B students were only with other Spanish B students, and so on.
We incentivized participation and effort on this test by allowing test-takers to leave school early on the test day. We furthermore promised that any students who met or exceeded course expectations on the AAPPL exam would immediately receive an A or A+ on the final exam and would be excused from class on the final exam date. Consequently, our attendance was excellent, with over 95% of expected students sitting the exam.
While the AAPPL itself is a very reliable measure, our study itself does present some limitations. The most significant factor is that in comparing OWL and non-OWL students at Level 3, the course methodology is not the only variable. The full-OWL students were also French students, while the some-OWL students were all Spanish students. This is significant because demographic and academic disparities are present between these two languages. Many Spanish students take Spanish by default. There are multiple Spanish I courses offered every year, and with scant prerequisites, it is an easy elective to opt into, even if they do not really want to learn Spanish. French on the other hand has only one teacher and limited French I course offerings. Getting into French is difficult to do by accident. So the students who take French generally intended to take French, while Spanish classes have a larger mix of intentional and accidental students. This demographic difference may have affected classroom management and therefore learning outcomes in the classroom. Furthermore, as mentioned above, the Spanish courses have had a great deal of teacher turnover in the past few years. Some of the Spanish 3 students had had 4 different Spanish teachers. Their LOTE-learning history was a little less stable than the French students’. So while AAPPL will give us a clear view of what is happening in our classes, it cannot tell us conclusively that OWL vs non-OWL methodology is to blame or to praise for the results.
AAPPL results come back very quickly. The computer-scored IL and IR sections are returned immediately. The human-scored ILS and PW sections were returned in under 3 weeks, with Chinese students getting their results back in less than 24 hours.
Quantifying the results
AAPPL reports scores using a mix of letters and numbers. The following chart shows the AAPPL to ACTFL scoring.
In order to to quantify this a little simpler, I converted these scores to a 1-10 scale, as seen in the middle column on the table below.
The results of the French 3 (OWL) vs Spanish 3 (some OWL) are shown below. The Ohio Department of Education (ODE) targets are also shown in red.
Below are the results for all French and Spanish classes:
And overall scores by language are presented below. Here are the French score distributions:
And the Spanish Score distributions:
The differences between CHS students are most stark at the Level 3 (full-OWL vs little to no OWL). It is clear that the OWL students performed their non-OWL counterparts by 2-3 points. Also, Level 3 OWL students on average performed above the Intermediate-Low target level in all 4 tests, whereas non-OWL students performed at level in only presentational writing and below level on all other tests. In short, the average Level 3 student is performing at intermediate levels of proficiency when taught in an OWL setting. The average non-OWL student at CHS is performing within the novice range.
Spanish students in level 5 have nearly caught up to their French counterparts after a full year of OWL instruction. The average French sscore tudent is performing well above level on all four tests, whereas Spanish students are performing above level in all tests save the ILS, in which they are essentially at level.
In the seventh semester of instruction, the ODE target (Intermediate High) surpasses French and Spanish students’ performance in nearly all tests. Furthermore, the Level 7 students’ scores are nearly identical to the scores of the Level 5 students.
A further observation is that the Interpersonal Listening and Speaking score for all languages and for all levels was the lowest of the average scores.
When looking at the score distributions by language, I notice spikes at certain scores. French spikes in ILS at I4 and I1, and it spikes again in PW at I4. Spanish also spikes at I4 on both of those tests with additional spikes at N4 in ILS and I1 in PW. The I4 spikes in particular serve as a ceiling for the ILS and PW tests in both languages.
It seems that the ILS test was the most difficult for all students regardless of language or level. This difficulty was particularly pronounced in Level 3 non-OWL courses, but the difficulty seemed to be staved off by OWL instruction in Levels 5 and under. This seems to indicate that students need more practice in the interpersonal mode, even in OWL classes, and especially at the upper levels.
OWL vs Traditional Instruction
The data in this study favors full OWL instruction over limited or no OWL instruction. The effects were most striking in the presentational writing and interpersonal listening and speaking modes.
What’s happening at Level 7?
In Level 7 the ODE line far surpasses our students’ performance. I account this to ODE assuming linear growth in proficiency over time. In reality, we know that it takes longer to advance through the intermediate levels than the novice one. The higher you climb in proficiency, the longer it takes you to reach the next level. ODE knows that. They even have charts that show that students will not reach Intermediate-High in a 4 year program (They can). I think they just don’t know how to represent that in course targets. No judgment. I don’t know how to do that either.
Not only does ODE pass CHS students at level 7, but in some cases, our level 5 students are surpassing them! I can only suppose that this is because language learners will naturally stagnate at upper proficiency levels, and that is why there is so little observable growth between levels 5 and 7. I also have one other explanation that can be better explained by the spikes.
Both French and Spanish students show spikes and ceilings at I4 in Speaking and Writing. That spike in that position is telling. French and Spanish teachers were convinced that the upper-level students were intermediate-lows, and they were trying to grow them to intermediate-mids by providing them with a great deal of intermediate prompts. The data suggests that that strategy is wrong, because the upper-level students are not intermediate-lows. In fact, they’re strong intermediate-mids! We must keep in mind that a score of I5 indicates that the speaker/writer can produce at the advanced level about half of the time. Students scoring I4 do not need intermediate prompts to grow. They need advanced prompts, and we were not providing advanced prompts to them. They stalled out at I4 because they had never been pushed beyond intermediate. The same concept can be applied anywhere there is a spike.
OWL and Reading/Writing
One of the biggest questions or criticisms that I have heard about OWL is that it does not focus enough on reading or writing. The data does not support this criticism. Not only are OWL students performing above-level in writing in Levels 3 and 5, but they are out-performing non-OWL students by 2 whole points in Level 3. While non-OWL students are hitting Intermediate-Low in Presentational Writing, the OWL students are firm Intermediate-Mids. Non-OWL Level III courses have students reading at a Novice-Mid level on average while OWL students are reading at Intermediate-Mid. That is an astounding difference. I think many people assume that OWL students will be slower to read and write than students in traditional instruction, but the data does not support that assumption. This data seems to support the opinion that the more language in a student’s head, the more language the student will read and write, and that the best way to get language in a student’s head is to give them a ton of comprehensible input. OWL instruction provides that input constantly. The constant use of the language in coversational setting seems to translate quickly to reading and writing skills.
Before moving on, I should add that reading and writing instruction are not absent from OWL instruction. On the contrary, reading serves as an important source of input. OWL methodology seeks to provide authentic reading that is both appropriate to students’ proficiency level and to their interests. Furthermore, any speaking prompt in the OWL circle can be transformed into a writing prompt with almost no effort on the part of teacher or student. Those written responses can then serve as a form of reading input for other students. When done correctly, Reading and writing are very much part of the OWL circle and classroom.
Interpersonal Listening and Speaking are Tough.
I was surprised that the ILS scores are the lowest of the scores for full OWL classes. This is certainly the skill I spend the most time on in class. I was heartned to see that the same challenge existed for the non-OWL classes, with ILS earning the lowest of the four scores in Spanish 3 classes as well. This makes me think that the interpersonal speaking skill is the slowest skill to acquire. I would love to see data on students who took all four tests nationwide to see if ILS is consistently the lowest. Also, please post in comments your experience with this if you have also administered the AAPPL.
I’m not exposing my students to enough authentic listening.
This wasn’t really part of the study. This is me using my own data for my own purposes. No students scored advanced on the IL test. Some Spanish students did. I’m not mad or jealous. I just notice it, and I’m glad to have the data to point it out. In my 12 years of teaching, this was the first official feedback that told me that my students are not doing as well in Listening that I would like. Thanks for the data, AAPPL. I know what I need to change next year.
Knowing where our students are helps us break through spikey ceilings.
The biggest revelation I had during the study was that I4 spike phenomenon. I had intentionally been giving all of my students mostly intermediate prompts. I was so sure that they were maxing out at intermediate-low. I knew that they needed more intermediate prompts. The data is telling me that I underrated my students’ proficiency levels. Even my French 3 students are starting to comfortably approach the intermediate-mid levels. They need to start seeing advanced prompts before the end of French III. Students above French III really only needed advanced prompts this year, and I didn’t give them very many. Again, I now possess really important data, and I’m extremely grateful that AAPPL let me see what I can do to be a better teacher in the upcoming school year.
Side anecdote: I got this data back a couple of weeks before the school year ended. One of my French III students who scored an I4 was taking the final interview with me, and I decided to make him go for it. While discussing his summer plans, I asked him if he had any funny vacation stories from the past that he remembered. It was a ton of fun to make him stick to his narration. His grammar was a mess, but I think a native speaker would have gotten the gist. He needed my support to finish his narration. He couldn’t do it on his own. He needed me to ask questions like “What happened next?” or “How did you resolve that situation?” He fumbled his way through it, because he was an intermediate speaker talking through an advanced prompt, but he got to flex that advanced muscle. It was so fun to watch such a smart and talented student go through a productive struggle.
Grammar is only a problem when it’s a problem.
If I have been underrating my students’ performance all year, it’s because I thought the quantity of their grammatical errors prevented them from having the requisite accuracy to advance to the next level. The AAPPL raters did not seem to mind the grammar mistakes. This was refreshing actually. They didn’t need tenses to be perfect as long as they could understand who was doing the action and when. They didn’t need perfect verb endings as long as they could figure out who was doing the action.
Grammar is only a part of accuracy, along with spelling and pronunciation. Accuracy is not assessed by grammarians. In the novice level, accuracy is assessed by very sympathetic speakers being able to understand. At the intermediate level, regular sympathetic speakers can understand. At the advanced level, any speaker can understand. Nothing about “speaker can understand” indicates perfection or even close to perfection in terms of grammar and spelling. Grammar only becomes a problem if it hinders comprehensibility. Point in case: Tomorrow I went to grandma’s house. In this sentence I really don’t understand if the speaker has already gone there or if she will go tomorrow. Compare to Yesterday I goed to grandma’s house. There is a cringeworthy grammar error in there, but it does not impede comprehensibility in any way. I know who did what and when they did it, and I don’t even have to think to figure it out. All that to say, grammar is only a problem when it’s a problem. In the first sentence, the grammar is a problem (or the word tomorrow is), because I really can’t figure out when someone is going to grandma’s house. The grammar is not problematic in the second sentence, because I understand it, and I would understand it even as a non-sympathetic speaker.
To conclude, I return to the three essential questions.
Are our students meeting targets?
In levels 3 and 5, OWL students are consistently meeting targets. Non-OWL students are not meeting most of the targets in level 3. Level 7 students come up a bit short of the targets in most skills. This informs us, though, that Intermediate low is a very realistic target for Level 3, Intermediate-mid is a realistic target for Level 5, and Intermediate-high is a challenging but realistic target for Level 7. Now that we know that our Level 7 students are ready for advanced prompts, I am excited to see if we can grow them quite a bit more next year.
Are the grades we are giving similar to the scores AAPPL raters give?
No. We’re not. We’re grading far more strictly than AAPPL graded. One of the great things about the AAPPL score reporting is that they not only show each student’s score, but they also provide the writing / speaking sample that the student provided. I can read my student’s work, grade it myself, and see if I get the same score that AAPPL gave. That exercise alone could constitute days of really useful professional development for my department.
Are OWL students demonstrating higher levels of proficiency?
The data shows a distinct positive correlation between OWL methodology and student performance.
Assessing the assessment
If anyone is on the fence about administering AAPPL, I hope that they will take the plunge. World Langauges teachers rarely get such scientific feedback as to how well their students are progressing. This test is real data from outside sources that can actually inform my instruction. AP does not give me that level of feedback. My own assessments do not provide me the objectivity or the scientificness that I need to make real course corrections. AAPPL has been a very strong experience, and I would highly recommend it to any other teacher.