Trump's latest call with a world leader has given everyone a case of impeachment fever. The summarized transcript has been released (link) and it's fairly easy to see why the fever is so contagious. However, this is a "rough" transcript based on what the notetaker recalled about the conversation. Now, people are also getting worked up about what may be missing from the transcript. One senator did a readout of the transcripts and suggested that there may be 20 minutes of conversation missing (link). However, I think the only thing that's missing is allowance for translation which I will show below.
import pandas as pd
import numpy as np
import urllib.request
First I set dictionaries with information on each transcript. Dictionaries are one of my favourite things to use in Python and I like to throw them in when I can.
files = {'nieto': {'link': 'https://raw.githubusercontent.com/sampurkiss/Misc/master/Trump/Data/call%20with%20nieto.txt',
'date': 'January 27, 2017, FROM 9:35', 'length in mins': 53},
'turnbull': {'link': 'https://raw.githubusercontent.com/sampurkiss/Misc/master/Trump/Data/call%20with%20turnbull.txt',
'date': 'January 28, 2017 5:05 PM', 'length in mins': 24},
'zelensky': {'link': 'https://raw.githubusercontent.com/sampurkiss/Misc/master/Trump/Data/call%20with%20zelenskyy.txt',
'date': 'July 25, 2019, 9:03 PM', 'length in mins': 30}}
The question I'm interested in is what do we know about these calls? We know what the administration has claimed was said, and we know how long the conversation lasted. There were two leaked transcripts provided to the Washington Post some months ago which can give us context for what a normal Trump conversation with a world leader might be like. These two can be used to get an idea of whether or not Trump's call with the Ukrainian president fits into the patterns of a "normal" conversation.
I cleaned transcripts of the calls for further data analysis and pull them in below. I also created a column that counts the number of words used.
transcript=pd.DataFrame()
for leader in files.keys():
link = files[leader]['link']
d=list()
f = urllib.request.urlopen(link)
for line in f:
d.append(line.decode('latin-1'))
file = leader
temp = pd.DataFrame({'transcript': file, 'lines': d[1:]})
new = temp['lines'].str.split(':', n=2, expand = True)
temp['speaker'] = new[0]
temp['lines'] = new[1]
transcript = pd.concat([transcript, temp])
transcript['speaker']= transcript['speaker'].str.replace('The President', 'TRUMP')
transcript['speaker']= transcript['speaker'].str.replace('President Zelenskyy', 'ZELENSKY')
transcript['num of words'] = transcript['lines'].str.split().str.len()
If you look at line 4 of the Nieto transcript, you'll notice that Nieto switches to Spanish. Nieto would presumably speak Spanish, a translator on Trump's side would translate to English, Trump would reply in English and a translator on Nieto's side would translate to Spanish. As a result of the translation, every conversation would take about twice as long and use twice as many words. If we're going to compare the conversations we have to account for the fact that all sentences must be repeated twice. This complicates things, but for simplicity I've assumed that all Trump and Nieto words are doubled as a result which should be approximately correct. It also seems reasonable to expect that Zelensky would have spoken his native language and used a translator as well (which has been confirmed by at least the Washington Post).
To adjust for this, I simply double the number of words used by the world leaders in the Mexican and Ukrainian conversations.
transcript['num of words'] = np.where(transcript['transcript'] =='zelensky',
transcript['num of words']*2,
transcript['num of words'])
transcript['num of words'] = np.where(transcript['transcript'] =='nieto',
transcript['num of words']*2,
transcript['num of words'])
The easiest way to see what the differences are is to compare the numbe of words used per minute. This should give us a sense of how chatty Trump and friends are.
words = transcript.groupby(by ='transcript').sum()
words[ 'words per min'] =None
for name in words.index:
words.loc[name, 'words per min'] = words.loc[name, 'num of words'] / files[name]['length in mins']
words['words per min'] = (words['words per min']
.astype(float).round(0).astype(int))
words.style
As you can see, all the calls clock in at about 130 words used per minute. What's even more noticeable is that the Nieto and Zelensky transcript, both of which required a translator, clock in at identical words per minute. Even if you remove the translation adjustment, the result is identical.
Another way to approach this is to look at number of words used by each leader in each conversation to see if there are any differences.
words_per_speaker =transcript.groupby(by =['transcript', 'speaker']).sum().reset_index()
words_per_speaker [ 'words per min'] =None
for name in words_per_speaker['transcript']:
words_per_speaker [ 'words per min'] = np.where(words_per_speaker['transcript'] ==name, words_per_speaker['num of words']/ files[name]['length in mins'],
words_per_speaker['words per min'])
words_per_speaker['words per min'] = (words_per_speaker['words per min']
.astype(float).round(0).astype(int))
words_per_speaker.style
Strangely, in the Nieto and Turnbull call, Trump manages to say about 70 words per minute and the other world leaders squeeze in about 60. However, in the Zelensky call, Trump only manages 50 words per minute and Zelensky speaks 80. This indicates that, for some reason, Trump spoke 30% less and Zelensky spoke 30% more.
So, we already know that the the transcript isn't the full transcript. We know it's been edited down somehow. My main question is, did Trump really speak a lot less? Did Zelensky really speak a lot more? Or have things been modified to hide one or the other?
Next step, I want to translate the English words into the respective language to get a more accurate sense of how many words were actually used. I think King et. al. want to accurately assess what's missing, they should use the transcript with translation. The other thing I hope to do is use sentiment analysis to dig into how each conversation actually went. Unfortunately transcripts between Trump and world leaders are notoriously hard to get (for good reason, probably) so, even though I'd love to run more ML analysis, the training set is probably a bit too small.
Comments !