Testing quality of different live subtitling methods: a Spanish into Italian case study
By inTRAlinea Webmaster
Abstract
Interlingual Live Subtitling (ILS) finds its foundations in subtitling for the d/Deaf and Hard of Hearing (SDH) and Simultaneous Interpreting (SI) and it is situated at the crossroads between Audiovisual Translation (AVT) and SI (Romero-Fresco & Alonso 2022), providing access to the audio content of live events and TV programmes for audiences with and without hearing impairments and disabilities (Romero-Fresco 2018). It is currently being achieved through different approaches requiring human-mediated translation and automatic language processing systems to different extents and among them, respeaking is triggering growing interest. This paper begins with an introduction to interlingual respeaking, a technique in which a transcript under the form of subtitles is provided thanks to a respeaker working with a speech recognition, and which derives from SI itself according to Carlo Eugeni and Francesca Marchionne (2014). The article shares a training proposal in interlingual respeaking in Spanish to Italian and the results of an experiment (Pagano, 2022a) are then presented, comparing interlingual respeaking to other four different workflows of ILS, with different degrees of human-machine interaction: SI and intralingual respeaking, SI and Automatic Speech Recognition (ASR), intralingual respeaking and Machine Translation (MT), and ASR and MT. Final results on quality of the Live Subtitling (LS) of the experiment show their linguistic accuracy using a model by Pablo Romero-Fresco and Franz Pöchhacker (2017), their delay and broadcasting. Each of the five methods is then described pointing out its strengths and weaknesses to shed some light on which one can be more suited to provide high quality ILS for live events.
Keywords: media accessibility, interlingual respeaking, simultaneous interpreting, automatic speech recognition, Machine Translation
©inTRAlinea & inTRAlinea Webmaster (2025).
"Testing quality of different live subtitling methods: a Spanish into Italian case study"
inTRAlinea Special Issue: Media Accessibility for Deaf and Blind Audiences
Edited by: Carlo Eugeni & María J. Valero Gisbert
This article can be freely reproduced under Creative Commons License.
Stable URL: https://www.intralinea.org/specials/article/2678
1. Introduction
This research deals with Media Accessibility (MA) for accessible multilingual communication, covering accessibility through respeaking a diamesic translation (Gottlieb, 2007) that transfers an audio input into written output, and from one language into another (interlingual). Interlingual respeaking accounts not only for sensory, such as subtitling for SDH, but also for linguistic barriers where the cross-cultural factors need to be addressed in a similar manner to SI. Integrating Pablo Romero-Fresco’s definition of respeaking, Hayley Dawson (2020) defines it as:
a technique in which one listens to the original sound of a (live) programme or event in one language and respeaks (interprets) it in another language, including punctuation marks and some specific features for an audience who cannot access the sound in its original form, to a speech recognition software, which turns the recognised utterances into text displayed on screen with the shortest possible delay (Romero-Fresco 2011: 1).
Respeaking is a technology-enabled hybrid modality of translation (Davitti & Sandrelli, 2020), that shares a common ground with SI in terms of skills and competences required (Russello, 2013), and with subtitling as well as it was initially carried out only by simultaneous interpreters and subtitlers (Eugeni & Mack, 2006; Szarkowska et al., 2018; Romero-Fresco & Eugeni, 2020).
The second section of this paper (§2) is dedicated to briefly introduce Human-Machine Interaction (HMI), while §3 covers assessment in ILS presenting the NTR model for linguistic accuracy. The following §4 aims at outlining the Spanish to Italian course proposal in interlingual respeaking carried out to train participants for the experiment, while §5 define the research method, with emphasis on participants, tools, and materials. Finally, results of the pilot study are presented (§6), followed by some concluding remarks (§7).
2. Human-Machine Interaction in Interlingual Live Subtitling
This pilot study focuses on ILS produced through different methods that require different degrees of HMI. In 1992, Hewett et al. defined this interdependence with technology as Human-Computer Interaction (HCI), or the implementation of interactive computing systems for human use. As explained by Eugeni (2019), in many fields today we see a rise in technological intervention rather than human, which is gradually being replaced by automated processes. In some fields where ASR and Natural Language Processing (NLP) systems are used, for example, “technological evolution is reducing the place of humans in this interaction to such an extent, that their profession could hardly be possible without it” (Eugeni 2019: 873). Three different categories of HCI can be distinguished in LS (Eugeni 2019), that can be applied to both intra and interlingual subtitling, given that this structure is technology-oriented (Pagano 2020a):
- Computer-Aided ILS, which implies a SI by an interpreter and a live subtitler to transcribe what is being said, namely a human carries out the job and a machine assist them;
- Human-Aided ILS, which is the reverse of Computer-Aided ILS, in which a machine carries out the job and humans assist it, editing the transcript;
- Fully-Automated ILS, without any human assistance. For intralingual LS an ASR system produces the transcript. For ILS, instead, this is the case of an ASR system recognizing the audio input in one language, and connected to an MT software producing the transcription in a second language.
3. Assessment in ILS: linguistic accuracy
Different models have been designed and tested over the years to assess accuracy in ILS (Roberts et al. 2017), varying depending on the method by which subtitles are produced, such as the conceptual measurement IRA (Idea Rendition Assessment) model by Eugeni (2017), based on the main distinction of rendered and non-rendered ideas in the Target Text (TT), and therefore more on communicative and conceptual level of analysis based on ideas, than taking into account formal errors. Other assessment models have been proposed, such as the WIRA (Weighted-Idea-Rendition-Assessment) model (Eichmeyer-Hell, forthcoming), and the NERLE model (Moores 2023).
In this research the NTR model (Number of words, Translation errors, Recognition errors) by Romero-Fresco and Pöchhacker (2017) is used to analyse the sets of subtitles created through the different methods. The model was first applied to test interlingual respeaking feasibility (Dawson 2020), and then different ILS outputs within the SMART (Shaping Multilingual Access Through Respeaking Technology) project[1] experiments (Sandrelli 2020). The model can be applied to assess subtitles produced in and from any language pair (not sign language). As shown in Figure 1 below, the calculation consists in subtracting the sum of translation errors (T) and recognition errors (R) from the total number of words in the subtitles (N), dividing by N again, and then multiplying by 100: the reference threshold rate for a subtitle in order to be deemed acceptable is 98%. In addition, the model takes into account Effective Editions (EEs), namely those reformulations that can be deemed strategic, i.e. that do not lead to a loss of information or content.
Figure 1. The NTR model with error type categorizations (Romero-Fresco & Pöchhacker 2017)
In this model, translation errors refer to the errors made by the respeakers in their performance, while recognition errors refer to misrecognitions by the Speech Recognition (SR) software. Translation errors are divided into error types based on content or form, the former accounting for omissions, additions and substitutions, the latter for style and correction errors. Error grading is threefold and is categorised using the following terminology: minor errors (0.25), major errors (0.5), critical errors (1). The error grading identified for the NTR model entails both translation and recognition processes: minor errors cause only a small loss of content and do not impact comprehension, being mainly recognition errors and relating to punctuation, spelling, and misrecognition of small words such as prepositions or articles. A major error implies depriving the reader of part of content without them noticing: major translation errors would be the omission of a full independent idea unit, while major recognition errors account to those that the reader can identify (either a misrecognition resulting in a nonsensical word, or a sensical new word that can be clearly deemed a misrecognition since it makes no sense in the context). Finally, critical errors introduce new content unrecognizable as such by the reader, constituting misleading information. While recognition errors (R) do not have subcategories, translation errors (T) can relate to both content and form. Content errors account to omissions, additions, and substitutions, while form correctness errors refer to errors of grammar in the TT and form style errors to errors caused by unnatural translation or changes in the register. Lastly, effective editions (EE) can be omissions or substitutions (reformulation, generalization, etc.) that do not lead to loss of content. An EE is spotted when the Source Text (ST) has been modified to a certain extent but without losing relevant information and can therefore be considered an improvement, for example when the respeaker omits orality or reformulates something to make it more readable.
3.1 About delay and the concept of ‘quality’
‘Quality’ is not always a straightforward concept to identify. The ‘high quality’ or ‘low quality’ judgment on almost everything is heavily driven by subjective elements such as personal opinions and preferences, for example, and it is therefore difficult to standardise with clear-cut characteristics. Similarly to translated and interpreted products, quality assessment in LS cannot be restricted to mere linguistic accuracy, but includes other factors (Pöchhacker 2013: 34) such as delay. In SI delay in the interpreting process plays an important role in assessing the overall quality of a SI output. Between accuracy and delay there is an important link, underlined by Romero-Fresco:
[T]he interplay between accuracy and delay constitutes an intrinsic part of live subtitling and is often described as a trade-off: launching the subtitles without prior correction results in smaller delays but less accuracy, while correcting the subtitles before cueing them on air increases accuracy but also delay. (Romero-Fresco 2019b: 99)
Delay in subtitling is often referred to as ‘latency’, namely “the delay between the source speech and real-time target text, [which] will also vary in relation to the output delivery and degree of editing” (Davitti & Sandrelli 2020: 105). As described by Dan McIntyre et al. 2018 (in Moores 2020):
there is an inherent delay or latency in respeaking between a word being spoken and appearing on screen in a subtitle; this results from the time needed for the spoken word to be heard, respoken, recognised and processed through the subtitling software and onto the screen.
Therefore, an overall quality assessment that takes into account not only linguistic accuracy, but also latency is the one that was sought in the analysis of this experiment results, to guarantee the most comprehensive assessment possible.
Having outlined the theoretical framework that lies behind the technique of interlingual respeaking, the following § 4 is dedicated to presenting the training modules of the course.
4. Training course proposal in interlingual respeaking
To train participants to the experiment in intra and interlingual respeaking, they followed a Spanish to Italian respeaking course offered at the University of Genoa in the academic year 2021/22 (Dawson 2019; Dawson & Romero-Fresco 2021; Pagano 2022b; forthcoming). The training lasted 70 hours and included synchronous, distance learning and individual practical exercises, and it was taught over a three-month period. Some materials were taken and adapted from the ILSA (Interlingual Live Subtitling for Access) course, a three-year Erasmus+ project co-financed by the European Union (2017-2020) whose key objective was to bridge the gap between intra and interlingual live subtitling as recognised professional practices (Robert et al. 2019b) by identifying the profile of the interlingual live subtitler and developing the first training course on ILS.
As shown in Figure 2 below, the training at the University of Genoa comprised a theoretical introduction to MA and pre-recorded subtitling, a review of SI with preparatory exercises, presented the use of the ASR systems and, ultimately, proposed intralingual and interlingual respeaking practice.
Figure 2. An overview of the English to Italian interlingual respeaking workshop.
More in detail, Module 1 showed an introduction to MA and AVT, and an overview on fundamentals of subtitling (condensation, line breaks, characters per line, subtitles duration, etc.) and of SDH. Module 2 on SI preparatory exercises consisted of counting, shadowing, paraphrasing and summarizing, and sight translation. Module 3, emphasised the dictation practice and presented the ASR software that was used during the experiment as well, Dragon Naturally Speaking, working on pronunciation, sentences segmentation, punctuation dictation and error edition. Final Modules 4 and 5 were dedicated to practice in both intra and interlingual respeaking.
5. The experiment
5.1 Participants
According to several scholars, interpreters and translators are the preferable option to train professionals in intra and interlingual respeaking, thanks to their background and skills (Russello 2010; Romero-Fresco 2012; Pagano 2022b). The participants to the experiment were, indeed, five students of the master’s degree in Translation and Interpreting of the University of Genoa, the same who were trained through the Spanish to Italian respeaking course. Four females and one male, aged between 21 and 23 years old, they all had previous experience, albeit introductory, in subtitling and SI. They were all Italian native speakers with Spanish as their B language, and with a C1 proficiency level according to the CEFR. None of them had any previous knowledge or training on LS or respeaking. Given their performances were recorded on screen, they were previously informed and agreed with an authorization form on ethical procedure[2].
5.2 Tested methods
The aim of the experiment was to test the participants’ performance in five different modes of ILS by comparing their outputs (see also Dawson 2021; Romero-Fresco & Alonso-Bacigalupe 2022). The methods and the roles assigned to each participant in the experiment are presented in Table 1.
|
Tested methods |
Roles |
1 |
Interlingual respeaking |
Participant A – interlingual respeaker ES>IT |
2 |
SI + Intralingual respeaking |
Participant B – simultaneous interpreter ES>IT |
Participant C – intralingual respeaker IT>IT |
||
3 |
SI + ASR |
NA (same Participant B – simultaneous interpreter ES>IT) |
4 |
Intralingual respeaking + MT |
Participant D – intralingual respeaker ES>ES |
5 |
ASR + MT |
NA |
Table 1. Participants and roles needed for each tested method.
On a scale of different extents of HMI, Method 1 is the most human-centered mode, while Method 5 is machine-centered and fully automated. The experiment was conducted on completion of training, during which each participant was trained in SI, intra and interlingual respeaking. The participant who was more fluent in SI during the training was assigned the role of the interpreter, who performed better in interlingual respeaking performed as an interlingual respeaker, and who scored worse results in interlingual respeaking was given the intralingual IT>IT respeaking role. Due to the lack of Spanish native speakers among the participants, participant D carried out the intralingual respeaking task ES>ES despite not being a native. Role’s assignment was based on assessment results of performances during the training and on formative assessment. Roles were also assigned according to the participants’ self-evaluation and self-confidence in performing one or the others. It is important to note that, while total participants to the experiment were five, only data from four of them were taken into account in the final results analysis. Two people carried out the Spanish to Spanish intralingual respeaking task, and only the one performing better was analysed.
5.3 Tools
As for tools, the ASR system used in all the methods was Dragon Naturally Speaking, version 15, the same students were trained with during the course. In Methods 4 and 5 the MT machine used was Google Translate, the neural machine translation service developed by Google, which is one of the most widely used worldwide. Similar experiments conducted in parallel with this pilot study (Romero-Fresco & Alonso-Bacigalupe 2021, 2022; Dawson 2021) also used Google Translate as MT, so it was also chosen to compare results even with different language pairs.
During the experiment, participants also recorded their screens with FlashBack Recorder, exporting all necessary files for the analysis and the delay calculation. Moreover, delay was calculated according to the Spanish Norma UNE 153010: 2012 for Methods 1 and 2, which consists of choosing one sentence-ending each minute of the source video and calculate the lag between the moment in which a specific utterance was spoken, and the moment in which it was displayed as subtitles on screen (Romero-Fresco & Alonso-Bacigalupe 2022).
5.4 Materials
Each participant was asked to interpret or respeak intra or interlingually three short videos chunks, as detailed in Table 2 below, according to their roles. One participant performed as an interlingual respeaker, one as a simultaneous interpreter, one as intralingual respeaker ES>ES, and another IT>IT. The texts were extract of original speeches in Spanish that were not altered (re-read, accelerated or adapted), featuring low levels of information density. The three text chunks were chosen on purpose with different variables – speech rate and difficulty of the topic – in order to be able to observe how the five different methods handled the diverse typologies.
Title |
Duration |
Number of words |
Words per minute (wpm) |
Discurso del Papa Francisco para “El día de la Tierra[3]” |
00:04:20 |
424 |
98 |
Discurso final como presidente de Mariano Rajoy a la Cámara[4] |
00:01:24 |
208 |
139 |
Presentación del dictamen sobre el marco jurídico de las comunicaciones electrónicas[5] |
00:02:35 |
318 |
126 |
Table 2. Experiment material information.
The first speech had a slow speech rate and was given to the participants to warm up. It was a speech delivered by Pope Francis on the subject of the Earth’s Day and mentioned climate change and Covid. The Pope speaks a native Argentine Spanish that is dissimilar to the Castellano the students were used to. Nevertheless, the speech is delivered clearly and with good intonation and articulation. The second chunk was by former Spanish President Mariano Rajoy, addressing the Senate. It was faster, but shorter in length and did not feature any specific terminology. The third and final video was taken from the Speech Repository by the European Commission[6] and was a video by the European Economic and Social Committee, featuring this time with a higher density of information and with institutional terminology such as dictamen (appraisal), marco jurídico (legal framework), directiva marco (framework directive), propuesta de reglamento (proposal for a regulation), autoridad (authority), gestión del espectro (spectrum management), separación functional (functional separation). Participants attempted their task – interlingual or intralingual respeaking or simultaneous interpreting, according to their roles – only once, and they were not given the opportunity to watch the proposed videos beforehand. A total of 15 outputs were provided with the experiment: 5 (number of tested methods) for each of the 3 videos.
6. Results and discussion
Table 3 below shows the results of the testing for the three videos. The final NTR percentage scores for each of the five methods are displayed individually, and then on average.
Methods |
ST1 Pope Francis’ speech |
ST2 Mariano Rajoy’s speech |
ST3 European Economic and Social Committee |
NTR % average score |
Method 1 Interlingual respeaking |
96.1% (0/10) |
97.3% (3.3/10) |
97.9% (4.8/10) |
97.1% (2.8/10) |
Method 2 SI + intralingual resp. |
99.1% (7.8/10) |
98.7% (6.8/10) |
98% (5/10) |
98.6% (7.5/10) |
Method 3 SI + ASR |
97% (2.5/10) |
98% (5/10) |
96.6% (1.5/10) |
97.2% (3/10) |
Method 4 Intralingual resp. + MT |
95.3% (0/10) |
97.7% (4.3/10) |
98.5% (6/10) |
97.2% (3/10) |
Method 5 ASR + MT |
94.4% (0/10) |
96.7% (1.8/10) |
95.1% (0/10) |
95.4% (0/10) |
Table 3. NTR scores
Together with the percentage score, each output was attributed a grade on a 10-point scale linked to a descriptive classification of the performance (Dawson, 2020), as shown in Table 4.
Accuracy % |
10-point scale |
Classification |
< 96 |
0/10 |
Unclassified |
96.4 |
1/10 |
Very poor |
96.8 |
2/10 |
Poor |
97.2 |
3/10 |
Poor |
97.6 |
4/10 |
Satisfactory |
98.0 |
5/10 |
Satisfactory |
98.4 |
6/10 |
Good |
98.8 |
7/10 |
Good |
99.2 |
8/10 |
Very good |
99.6 |
9/10 |
Excellent |
100 |
10/10 |
Exceptional |
Table 4. Classification of performances in reference to the NTR model.
While testing the NTR model for linguistic accuracy, subjectivity in the analysis of error gradings was crucial (Romero-Fresco & Pöchhacker 2017). One solution to minimise it is to have more than one evaluator carrying out the assessment (inter-annotators), to monitor any discrepancies between different people’s opinions. In this sense, NTR analyses for all the outputs of the experiments where first carried out by the participants in a self-evaluation process and revised in a peer-evaluation phase by the other course students. Only then, the trainer of the course and author of this contribution worked as a first evaluator; afterwards, a second inter-annotator with previous knowledge and experience of assessment with the NTR model reviewed all the analysis again. If and when discrepancy in error grading between the first evaluator and the inter-annotator was detected (rarely), they would agree on the severity to give to the error after having discussed its impact.
Pre and post-experiment questionnaires were submitted to participants. In the former, all participants expressed some anxiety for the forthcoming experiment, but they also answered they felt adequately prepared for the task. In the latter, when asked to consider how difficult they found the entire test (all three videos), two participants declared ‘neither easy, nor difficult’, and the other three ‘difficult’. They were also asked how difficult they found each text chunk: for the speech by the Pope, all responded between ‘neither easy nor difficult’ to ‘very easy’; the speech by Mariano Rajoy was rated ‘very difficult’, and for the speech “Marco jurídico de las Comunicaciones electrónicas”, one person rated it ‘easy’, two ‘neither easy nor difficult’, and two ‘difficult’, showing a divergence of opinions on the matter. For all the videos, the participants deemed neither the topics nor the terminology to be complex.
6.1 NTR analyses for linguistic accuracy
The final corpus with all the subtitles from the five methods for each of the three videos amounted to 4,200 words, with 258 errors in total. Of these, 45.7% (118) were translation errors, and 54.3% (140) were recognition errors. Out of translation errors, 57.6% were minor, 18.7% were major, and 23.7% critical. Out of recognition errors, 71.4% were minor, 15% major, and 13.6% critical.
Translation and recognition errors for each method are now briefly outlined to observe their frequency. Translation errors in Methods 1, 2, and 3 were only imputable to the interpreter or the respeaker, while in Methods 4 and 5 were all imputable to either the or the MT tool used, or they were recognition error by Dragon.
6.1.1 Translation errors
Before moving on to the detail of error types and their frequency in the following sections, Figure 3 below shows an overview of the different errors for the five methods. The trend shows content omission errors (in blue) frequently occurring as minor errors, followed by major ones. Content additions (in orange) are almost non-existent, while content substitutions (in gray) confirm they pertain almost exclusively to critical errors. Errors of form, both in style and correctness, are confined to minor severity with only one exception.
Figure 3. Overview of the most frequent translation error typologies according to severity
The participant using Method 1 said this was – understandably, as they had to perform two new tasks contrary to all other participants – the most challenging. Still, the results of the analyses from this method are promising since an average of 97.1% was achieved. Translation errors detected totalled 30 (16 minor, 7 major, and 7 critical). Interestingly, the subject using Method 2 (SI + intralingual respeaking) ranked first in linguistic accuracy, far outperforming the other methods. There was a total of 19 translation errors detected: 11 minor, 5 major, and 3. In the whole Method 3 (SI + ASR), 10 translation errors were counted among the three severities: 6 minor, 4 major, and no critical, ranking second in accuracy. Translation errors for Method 4 totalled 35, 10 of which by the Google Translate MT system (8 minor and 2 critical, with no major detected). Fully automated Method 5 (ASR + MT) totalled 24 translation errors.
Methods 4 and 5 in which the final subtitle was product of an automatic process of MT showed some curious examples of mistranslations or misrecognitions that will now be illustrated[7]. The examples display the Spanish ST, the Spanish to Spanish respoken text (RT), the text recognised by Dragon (ASR), and the final translation by the MT for examples 1-3 taken from Method 4. Examples showed from Method 5 display ST, the ASR text and the TT, namely the final output produced by the MT. The back translation of the final output translated by the MT is provided in Italian by the author – please note that any not common or grammatically wrong English expression are willingly left to maintain the hints of the machine-translated input. No back translation is provided for the other utterances because the focus wants to be on the final subtitle only, namely what the audience would read.
Example 1 (Method 4):
ST: “De una crisis no se sale igual. Salimos mejores o peores”.
RT: “De una crisis no se sale igual o se sale mejor o peor”.
ASR: “La Cristina se sale igua o se sale mejor o peor”.
MT: Cristina esce lo stesso o esce meglio o peggio.
Back translation: Cristina goes out any way either she goes out better or worse.
In this example taken from text 1 (speech by Pope Francisco) the ASR software transcribes a sentence that does not make any sense, misrecognizing the dictation of ‘de una crisis’ con ‘La Cristina’, and therefore displaying a proper name that, despite not seeming correct in the given context, could confuse the reader to some extent.
Example 2 (Method 4):
ST: “Gracias a todos y de manera muy especial a mi partido sin el cual nada hubiera sido posible”.
RT: “Gracias a todos y de manera muy especial a mi partido sin el cual nada había sido posible”.
ASR: “Gracias a todos y de manera muy especial a mi partido sin el cual nada había sido posible”.
TT: “Grazie a tutti e in modo specialissimo alla mia festa senza la quale nulla sarebbe stato possibile”.
Back translation: Thanks to everybody and in a super special way to my party without her anything would have been possible.
In example 2 taken from text 2 (speech by former President Mariano Rajoy) both the respoken text and the transcription worked well, however the MT features firstly an issue of style (“de manera muy especial” as “in modo specialissimo”, which in Italian reads informally for a speech given in a parliamentary session), secondly, the translation of “Partido”, which makes sense in the English back translation as a “party” may be both a political party or a celebration, which is not the case for either Spanish or Italian. As an assumption, this could derive from the Google Translate MT model that from Spanish, first translates the utterance into English and not directly in Italian. One positive aspect, however, is that when the respeaker poorly dictated a past conditional tense sentence using “había” instead of “habría”, the software autocorrected it displaying the correct grammatical structure in the Italian TT.
Example 3 (Method 4):
ST: “Y vamos a ser más resilientes cuando trabajemos juntos en lugar de hacerlo solos”.
RT: “Vamos a ser más resilientes cuando trabajemos juntos en lugar de hacerlo solos”.
ASR: “Vamos a ser madres hirientes cuando trabajemos juntos en lugar de hacerlo solos”.
TT: “Saremo madri dolorose quando lavoreremo insieme invece che da sole”.
Back translation: We will be grieving mothers when we will work together instead of alone.
In this last example the misrecognition by Dragon leads to a critical error, since it creates a new meaning for the reader that was not the one in the ST. Interestingl, the MT, despite having translated correctly the second half of the sentence and detected the male gender and plural, consistently to what has wrongly recognised as “madres hirientes”, changed it all into the feminine plural.
Example 4 (Method 5):
ST: “En resumen, la pandemia del COVID nos ha enseñado esta interdependencia […]”.
ASR: “En resumen, la pandemia del obispo nos ha enseñado esta interdependencia […]”.
TT: “Insomma, la pandemia del vescovo ci ha insegnato questa interdipendenza […]”.
Back translation: To sum up, the bishop pandemic taught us this interdependence.
In example 4, taken from text 1, the word ‘Covid’ is misrecognised in “obispo” (bishop). While it would probably be clear that a ‘bishop pandemic’ is not the intended meaning, in this case it is the Pope speaking, so semantically he could be referring to bishops and it can be misleading.
Example 5 (Method 5):
ST: “Si alguien se ha sentido, en esta cámara o fuera de ella, ofendido o perjudicado le pido disculpas. Gracias a todos”.
ASR: “Si alguien se ha sentido en esta cámara oscura, bello, ofendido o perjudicado le pido gracias a todos”.
TT: “Se qualcuno si è sentito bello, offeso o leso in questa camera oscura, chiedo grazie a tutti”.
Back translation: If someone felt good looking, offended, or damaged in this darkroom, I ask thanks to everyone.
There are several problems in this last example taken from the speech by Mariano Rajoy. First, “o fuera”, jointly with the word “cámara” (referring to the Parliament) is misrecognised as “oscura” (darkroom) – as in for photography, which makes very little sense, or even as in a ‘shady room’, which could be particularly misleading given the diplomatic scenario in which the speech is given. Secondly, “de ella” is misrecognised as “bella” (good looking), and then tuned into a masculine by the MT. Lastly, the ST “disculpas” is omitted, which is an exception here since the machine rarely completely misses parts of speech. Together with the non-recognition of the full stop at the end of the sentence, the beginning of the new one is linked to the previous, resulting in misinformation: “pedir gracias” should be “decir gracias” and does not sound natural but could be seen as a stylistic error and not as a double omission (of “disculpas” and of the punctuation mark). Here the subtitling displays a worrying mistranslation.
6.1.2 Recognition errors
Recognition errors for the five tested methods were attributable to the ASR software; a total of 140 errors were detected.
In Method 1, 9 recognition errors were found: 5 were minor, 3 were major, and 1 was critical.
In Method 2, 6 recognition errors were found for the three videos: 5 were minor, 1 major. Method 3 registered the highest number of recognition errors, i.e. 57 in total: 51 minor, 2 major, 4 critical. In Method 4 there was just 1 critical recognition error, while for Method 5, as per Method 3, many more recognition errors were detected since there was no human intervention in either dictation to the SR software or monitoring of the written output, as can be seen in Figure 4 below. A total of 61 recognition errors were detected for the fully automated mode: 36 were minor, 12 were major, and 13 were critical.
Figure 4. Recognition errors in the five methods.
Recognition errors were mainly imputable to lack or shift of punctuation marks throughout the subtitles. As a result, in Methods 3 and 5 in which punctuation was added automatically by the speech recognition software, it is in some parts non-existent. Just for Method 3, approximately 80% of detected errors were punctuation based and out of 5 critical errors in Methods 3 and 5, 3 were punctuation errors, as follows.
Example 6 (Method 3):
ST: “Es el momento de actuar, estamos en el límite. Quisiera repetir un dicho viejo español”.
TT: “È il momento di agire perché siamo al limite secondo un vecchio detto spagnolo”.
Back translation: It is the moment to act because we have no time left according to an old Spanish saying.
Example 7 (Method 3):
ST: “Señoras y señores diputados, seré muy breve. A la vista de lo que todos sabemos […]”.
TT: “Signore, signori deputati, cercherò di essere molto breve come tutti sappiamo […]”.
Back translation: Ladies and gentlemen MPs, I will try to be concise as we all know.
Example 8 (Method 5):
ST: “[…] es bueno recordar cosas que nos decimos mutuamente para que no caigan en el olvido. Desde hace tiempo estamos tomando más conciencia […]”.
TT: “è sempre bene ricordare le cose che ci diciamo nell'oblio di quel tempo stiamo diventando più consapevoli…”.
Back translation: It is always good to remember things that we say in the oblivion of that time we are becoming more and more aware.
In this last example in particular, the sentence does not flow and it is clear that there was a misrecognition since “oblio” (oblivion) is it not relevant there. The lack of a full stop makes it seem as though “desde hace tiempo” refers to the previous sentence, thus giving it a different meaning.
6.2 Delay
The translation process is made of translated idea units and not sentences. To calculate delay, each rendered idea unit could have been considered calculating how many seconds after each idea is given in the ST it appears on screen as a subtitle. According to the Norma UNE (ibid.) an average is to be made of one sentence ending (idea unit ending) per minute of the video. Therefore, delay was calculated by choosing one sentence ending per minute in each ST and assessing the delay of that sentence ending compared to when the relevant text first appeared on screen as a subtitle. Methods 3, 4 and 5, delay calculations were speculative since the whole process was split into different stages during the pilot study. For Method 3, it was only possible to calculate delay for the delay of the simultaneous interpreter while translating, as their .mp3 recording was afterwards fed in Dragon and did not occur live. This was due to the fact the Dragon had no option of adding automatic punctuation, therefore the subtitles would have been a meaningless non-stop flow of words. To the interpreter’s delay an average of 1 extra second was added as per the time the ASR software usually takes to process the audio input and produce the written text. Usually, ASR takes less to process the transcript, but it does take a bit longer when longer utterances are dictated, as it is often the case when interpreting in SI. For Method 4, the intralingual respeaker delay was calculated and an average of 1 extra second was added as per the time the MT software usually takes to produce the translated output. Lastly, for Method 5, on average 2 seconds for the ASR and 1 extra second for the MT were calculated. It is to bear in mind that sometimes it can take less, and that this was a first attempt to compare the delays, without carrying out the five Methods fully live unfortunately. Anyway, delay in Methods 4 and 5 consistently took much less than any other method, with a substantial difference.
Table 5 below shows the delay calculated in each method for each ST, also offering an average among the three calculations and their ranking.
Methods |
Texts |
Delay average per text (seconds) |
Delay average per method (seconds) |
Rank |
Method 1 (Participant A) |
Text 1 |
7.2 |
7.0 |
4 |
Text 2 |
8.3 |
|||
Text 3 |
5.5 |
|||
Method 2 (Participant B) |
Text 1 |
7.6 |
11.3 |
5 |
Text 2 |
10.7 |
|||
Text 3 |
15.5 |
|||
Method 3 (Participant C) |
Text 1 |
4.2 |
5.7 |
2 |
Text 2 |
5.3 |
|||
Text 3 |
4.5 |
|||
Method 4 (Participant E) |
Text 1 |
7.4 |
6.8 |
3 |
Text 2 |
6.0 |
|||
Text 3 |
7.0 |
|||
Method 5 |
Text 1 |
3.0 |
3.0 |
1 |
Text 2 |
3.0 |
|||
Text 3 |
3.0 |
Table 5. Participants’ delay results in seconds.
Concerning delay, an inverse correlation with accuracy was detected: those methods that were faster in production of subtitles were the ones that scored lower accuracy rates, and those that required a longer delay were more accurate. In particular, participants in Method 2, who scored the highest in linguistic accuracy, was the slowest since two subjects were involved in the process. The correlation between accuracy and delay is all the more important and intended as a ‘trade-off’: more accurate subtitles take more time to be produced (and corrected), while less accurate ones are cued with much shorter delay.
Method |
Accuracy |
Delay |
Interlingual respeaking |
Poor |
Acceptable |
SI + intralingual respeaking |
Good |
Long |
SI + ASR |
Poor |
Acceptable |
Intralingual resp. + MT |
Poor |
Acceptable |
ASR + MT |
Insufficient |
Short |
Table 6. Accuracy and delay results.
7. Conclusions
By combining linguistic accuracy and delay results for each method, a consistent correlation can be observed: the least accurate method (Method 5) is the fastest in delivering the subtitles, while the most accurate (Method 2) takes longer, up to 11 seconds delay. The highest overall quality this experiment sought is, in fact, determined by the setting and the situation in which the ILS is delivered: the need for faster subtitles is to the detriment of accurate transcription and translation, otherwise better accuracy entails slower subtitles. In terms of overall quality, it is shown that Methods 1, 3, and 4 can be a good compromise, ranging between 6 to 7 seconds of delay, which seems to be acceptable especially for live programmes, also according to broadcasters that use LS such as the BBC. In the trade-off between accuracy and delay, despite being slower, Method 2 does provide a good level of accuracy. Other variables to consider in searching for quality ILS methods are the type of live event setting: Method 3 features too many errors in recognizing and placing punctuation marks, making it difficult for audiences to read the subtitles. The more and fully-automated Methods 4 and 5 leave the final decision on what and how to broadcast the content to the machine, which can be very risky in formal and important situations, especially concerning mistranslations. Nevertheless, these are also the cheapest methods, which need to be considered while evaluating which mode is the preferred, and most usable.
Several limitations have impacted this research, such as the distance setting for the training and testing, its restriction to the language combination Spanish to Italian, and to the use of only few available software, both ASR and MT systems. Moreover, subjectivity in the NTR analyses played a major role in the interpretation of the final results (Romero-Fresco & Pöchhacker, 2017), despite trying to minimise it through the two evaluators’ assessment. The fact that the intralingual respeaking in Spanish for Method 4 was carried out by a non-native speaker could is also a variable to take into account, even if – when listening to the audio recordings of the participant’s performance – phonology was clear and per se did not represent an obstacle in the ASR process.
Though marginally and despite all the limitations, it is hoped that these preliminary results can start and shed some light on which strengths and weaknesses each ILS method entails.
References
Davitti, Elena, Sandrelli, Annalisa (2020) “Embracing the Complexity: A Pilot Study on Interlingual Respeaking”, Journal of Audiovisual Translation, 3(2): 103-139. URL: https://jatjournal.org/index.php/jat/article/view/135/40 (accessed 10 April 2023).
Dawson, Hayley (2019) “Feasibility, quality and assessment of interlingual live subtitling: A pilot study”, Journal of Audiovisual Translation, 2(2), pp. 36–56,
URL: https://www.jatjournal.org/index.php/jat/article/view/72 (accessed 15 April 2023).
Dawson, Hayley (2020) Interlingual live subtitling – A research-informed training model for interlingual respeakers to improve access for a wide audience, PhD diss., University of Roehampton, UK.
Dawson, Hayley (2021) Exploring the quality of different live subtitling methods: a Spanish to English follow up case study, paper presented on 17th September 2021 at the 7th IATIS conference, Universitat Pompeu Fabra, ES.
Dawson, Hayley, Romero-Fresco, Pablo (2021) “Towards research-informed training in interlingual respeaking: an empirical approach”, The Interpreter and Translator Trainer, 15(1), pp. 66-84. URL: https://www.tandfonline.com/doi/full/10.1080/1750399X.2021.1880261 (accessed May 2023).
Eichmeyer-Hell, Daniela (forthcoming) WIRA: Model for qualitative assessment of a speech-to-text interpreting service, taking into account the user perspective, PhD diss., University of Graz, GER.
Eugeni, Carlo, Mack, Gabriele (2006) “Proceedings of the first international seminar on real-time intralingual subtitling”, inTRAlinea, Special issue. URL: http://www.intralinea.org/specials/article/Proceedings_of_the_First_International_Seminar_on_real-time_intralingual_su (accessed March 2024).
Eugeni, Carlo, Marchionne, Francesca (2014) “Beyond Computer Whispering: Intralingual and French into Italian TV Respeaking Compared”, Petillo, Mariacristina (ed.) Reflecting on Audiovisual Translation in the Third Millennium. Perspectives and Approaches to a Complex Art. Bucarest: Editura Institutul European.
Eugeni, Carlo (2017) “La sottotitolazione intralinguistica automatica: Valutare la qualità con IRA”, CoMe, Studi di Comunicazione e Mediazione linguistica e culturale, 2(1), pp. 102-113. URL: http://comejournal.com/wp-content/uploads/2019/06/8.-CoMe-II-1-2017.-Eugeni.pdf (accessed 10 March 2023).
Eugeni, Carlo (2019) Technology in court reporting – Capitalising on human-computer interaction, International Justice Congress Proceedings, Uluslararası Adalet Kongresi (UAK), 2-4 May 2019.
Eugeni, Carlo (2020) “Human-Computer Interaction in Diamesic Translation. Multilingual Live Subtitling” in Translation Studies and Information Technology – New Pathways for Researchers, Teachers and Professionals”, Dejica, Daniel, Eugeni, Carlo and Dejica-Carţiş, Anca (eds), Translation Studies Series Editura Politehnica, Politehnica University Timișoara, pp. 19-31.
Fantinuoli, Claudio, Prandi, Bianca (2021) Towards the evaluation of automatic simultaneous speech translation from a communicative perspective, in 18th International Conference on Spoken Language Translation Proceedings, Bangkok, Thailand, August 5-6, 2021. Association for Computational Linguistics, pp. 245-254.
Gottlieb, Henrik (2007) Multidimensional Translation: Semantics turned Semiotics, Copenhagen, Proceedings of the Marie Curie Euroconferences MuTra: Challenges of Multidimensional, Sandra, Nauert, Heidrun, Gerzymisch-Arbogast (eds), pp. 1-29.
Hewett, Thomas, Baecker, Ronald, Card, Stuart, Gasen, Jean, Perlman, Gary, Strong, Gary Tremaine, Marilyn, Verplank, William (1992) ACM SIGCHI curricula for human-computer interaction, report of the ACM SIGCHI Curriculum Development Group, Broadway, ACM,
URL: https://www.researchgate.net/publication/234823126_ACM_SIGCHI_curric ula_for_human-computer_interaction (accessed 12 May 2023).
McIntyre, Dan, Moores, Zoe, Price, Hazel (2018) Respeaking Parliament: Using Insights from Linguistics to Improve the Speed and Quality of Live Parliamentary Subtitles, Huddersfield: Language Unlocked/Institute for Applied Linguistics.
Moores, Zoe (2020) “Fostering access for all through respeaking at live events”, JoSTrans, Journal of Specialised Translation, 33, URL: https://jostrans.org/issue33/art_moores.php (accessed 5 April 2023).
Pagano, Alice (2020a) “Verbatim vs. Edited Live Parliamentary Subtitling” in Translation Studies and Information Technology – New Pathways for Researchers, Teachers and Professionals, Dejica, Daniel, Eugeni, Carlo, Dejica-Carţiş, Anca (eds), Translation Studies Series Editura Politehnica, Politehnica University Timișoara, pp. 32-44.
Pagano, Alice (forthcoming) “Formación de intérpretes simultáneos para la accesibilidad: relato de una experiencia didáctica”, Aproximaciones teóricas y prácticas a la accesibilidad desde la traducción y la interpretación, Varela Salinas, María José and Plaza Lara, Cristina (eds) Editorial Comares, Granada.
Pagano, Alice (2022b) “Interlingual respeaking training for simultaneous interpreting trainees: new opportunities in Media Accessibility”, CoMe, Studi di Comunicazione e Mediazione linguistica e culturale, VI(1). URL: http://comejournal.com/wp-content/uploads/2022/11/2.-Pagano.pdf (accessed 31 May 2023).
Pagano, Alice (2022a) Testing quality in interlingual respeaking and other methods of interlingual live subtitling, PhD diss., Università degli Studi di Genova, Italy.
Robert, Isabel, Remael, Aline (2017) “Assessing quality in live interlingual subtitling: A new challenge”, Linguistica Antverpiensa, New Series: Themes in Translation Studies, 16, pp. 168-195, URL: https://lans-tts.uantwerpen.be/index.php/LANS-TTS/article/view/454 (accessed 15 May 2023).
Romero-Fresco, Pablo (2011) Subtitling through speech recognition: Respeaking, St. Jerome Publishing, Manchester.
Romero Fresco, Pablo (2012) “Respeaking in translator training curricula. Present and future prospects”, in The Interpreter and Translator Trainer, 6(1), pp. 91-112,
URL: https://www.tandfonline.com/doi/abs/10.1080/13556509.2012.10798831?journalCode=ritt20 (accessed 3 May 2023).
Romero-Fresco, Pablo, Martínez, Juan (2015) “Accuracy rate in live subtitling: The NER model”, Audiovisual Translation in a Global Context: Mapping an Ever-changing Landscape, Díaz-Cintas, Jorge and Baños, Rocío (eds), London & New York: Palgrave Macmillan, pp. 28-50.
Romero-Fresco, Pablo, Pöchhacker, Franz (2017) “Quality assessment in interlingual live subtitling: The NTR model”, Linguistica Antverpiensia, New Series: Themes in Translation Studies, 16, pp. 149-167,
URL: https://lans-tts.uantwerpen.be/index.php/LANS-TTS/article/view/438 (accessed 3 May 20223).
Romero-Fresco, Pablo (2018) “In support of a wide notion of Media Accessibility: Access to content and access to creation”, Journal of Audiovisual Translation, 1(1), pp. 187-204,
URL: http://www.jatjournal.org/index.php/jat/article/view/53/12 (accessed 16 April 2023).
Romero-Fresco, Pablo (2019b) “Respeaking: subtitling through speech recognition”, The Routledge Handbook of Audiovisual Translation, Perez-González, Luis (eds), Oxon & New York: Routledge, pp. 96-113.
Romero-Fresco, Pablo, Eugeni, Carlo (2020) “Live subtitling through respeaking”, Handbook of Audiovisual Translation and Media Accessibility, Bogucki, Łukasz, Deckert Mikołaj (eds), Palgrave MacMillan, pp. 269-295.
Romero-Fresco, Pablo, Alonso-Bacigalupe, Luis (2022) “An Empirical Analysis on the Efficiency of Five Interlingual Live Subtitling Workflows”, XLinguae, Special issue, Bogucki, Łukasz and Deckert Mikołaj (eds).
Russello, Claudio (2010) “Teaching respeaking to conference interpreters”, Intersteno, Education committee archive. URL: https://www.intersteno.it/materiale/ComitScientifico/EducationCommittee/Russello2010Teaching Respeaking to Conference Interpreters.pdf (accessed 25 May 2023).
Russello, Claudio (2013) “Aspetti didattici”, Respeaking. Specializzazione online, Eugeni, Carlo, Zambelli, Luigi (eds), pp. 44-53,
URL: https://www.accademia-aliprandi.it/public/specializzazione/respeaking.pdf. (accessed 12 April 2023).
Sandrelli, Annalisa (2020) “Interlingual respeaking and simultaneous interpreting in a conference setting: a comparison”, Technology in Interpreter Education and Practice. inTRAlinea, Special issue, Spinolo, Nicoletta and Amato, Amalia (eds), URL: https://www.intralinea.org/specials/article/2518 (accessed: 5 April 2023).
Szarkowska, Agnieszka, Krejtz, Krzysztof, Dutka, Łukasz, Pilipczuk, Olga (2018) “Are interpreters better respeakers?”, in The Interpreter and Translator Trainer, 12(2), pp. 207-226. URL: https://www.tandfonline.com/doi/full/10.1080/1750399X.2018.1465679 (accessed 27 May 2023).
Notes
[1] http://galmaobservatory.eu/projects/shaping-multilingual-access-through-respeaking-technology-smart/.
[2] To see the audio and video recordings authorization form, please refer to Alice Pagano (2022a), Appendix 9.
[3] The video of the live speech delivery can be retrieved at: https://www.youtube.com/watch?v=LiTJvHmFtbE.
[4] The video of the full speech delivery can be retrieved at: https://www.youtube.com/watch?v=iSzk2Sm4Fl4.
[5] The video of the full speech delivery can be retrieved at: https://webgate.ec.europa.eu/sr/speech/marco-jur%C3%ADdico-de-las-comunicaciones-electrónicas.
[7] For the full display of the error analysis, please refer to Pagano (2022a), Appendix 5.
©inTRAlinea & inTRAlinea Webmaster (2025).
"Testing quality of different live subtitling methods: a Spanish into Italian case study"
inTRAlinea Special Issue: Media Accessibility for Deaf and Blind Audiences
Edited by: Carlo Eugeni & María J. Valero Gisbert
This article can be freely reproduced under Creative Commons License.
Stable URL: https://www.intralinea.org/specials/article/2678