calculate_alignment module

calculate_alignment.BuildSemanticModel(semantic_model_input_file, pretrained_input_file, output_file_directory, use_pretrained_vectors=True, high_sd_cutoff=3, low_n_cutoff=1, save_vocab_freqs=False)

Given an input file produced by the ALIGN Phase 1 functions, build a semantic model from all transcripts in all conversations in target corpus after removing high- and low-frequency words. High-frequency words are determined by a user-defined number of SDs over the mean (by default, high_sd_cutoff=3). Low-frequency words must appear over a specified number of raw occurrences (by default, low_n_cutoff=1).

Frequency cutoffs can be removed by high_sd_cutoff=None and/or low_n_cutoff=0.

Also optionally saves to output_file_directory a .txt list of all unique words and their frequency count if save_vocab_freqs=True

calculate_alignment.ConvoByConvoAnalysis(dataframe, maxngram=2, ignore_duplicates=True, add_stanford_tags=False)

Calculate analysis of multilevel similarity over a conversation between two interlocutors from a transcript dataframe prepared by Phase 1 of ALIGN. Automatically detect speakers by unique speaker codes.

By default, include maximum n-gram comparison of 2. If desired, this may be changed by passing the appropriate value to the the maxngram argument.

By default, return scores based only on Penn POS taggers. If desired, also return scores using Stanford tagger with add_stanford_tags=True.

By default, remove exact duplicates when calculating POS similarity scores (i.e., does not consider perfectly mimicked lexical items between speakers). If desired, duplicates may be included when calculating scores by passing ignore_duplicates=False.

calculate_alignment.GenerateSurrogate(original_conversation_list, surrogate_file_directory, all_surrogates=True, keep_original_turn_order=True, id_separator='\\-', dyad_label='dyad', condition_label='cond')

Create transcripts for surrogate pairs of participants (i.e., participants who did not genuinely interact in the experiment), which will later be used to generate baseline levels of alignment. Store surrogate files in a new folder each time the surrogate generation is run.

Returns a list of all surrogate files created.

By default, the separator between dyad ID and condition ID is a hyphen (‘-‘). If desired, this may be changed in the id_separator argument.

By default, condition IDs will be identified as any characters following cond. If desired, this may be changed with the condition_label argument.

By default, dyad IDs will be identified as any characters following dyad. If desired, this may be changed with the dyad_label argument.

By default, generate surrogates from all possible pairings. If desired, instead generate surrogates only from a subset of all possible pairings with all_surrogates=False.

By default, create surrogates by retaining the original ordering of each surrogate partner’s data. If desired, create surrogates by shuffling all turns within each surrogate partner’s data with keep_original_turn_order = False.

calculate_alignment.LexicalPOSAlignment(tok1, lem1, penn_tok1, penn_lem1, tok2, lem2, penn_tok2, penn_lem2, stan_tok1=None, stan_lem1=None, stan_tok2=None, stan_lem2=None, maxngram=2, ignore_duplicates=True, add_stanford_tags=False)

Derive lexical and part-of-speech alignment scores between interlocutors (suffix 1 and 2 in arguments passed to function).

By default, return scores based only on Penn POS taggers. If desired, also return scores using Stanford tagger with add_stanford_tags=True and by providing appropriate values for stan_tok1, stan_lem1, stan_tok2, and stan_lem2.

By default, consider only bigram when calculating similarity. If desired, this window may be expanded by changing the maxngram argument value.

By default, remove exact duplicates when calculating similarity scores (i.e., does not consider perfectly mimicked lexical items between speakers). If desired, duplicates may be included when calculating scores by passing ignore_duplicates=False.

calculate_alignment.TurnByTurnAnalysis(dataframe, vocablist, highDimModel, delay=1, maxngram=2, add_stanford_tags=False, ignore_duplicates=True)

Calculate lexical, syntactic, and conceptual alignment between interlocutors over an entire conversation. Automatically detect individual speakers by unique speaker codes.

By default, compare only adjacent turns. If desired, the comparison distance may be changed by increasing the delay argument.

By default, include maximum n-gram comparison of 2. If desired, this may be changed by passing the appropriate value to the the maxngram argument.

By default, return scores based only on Penn POS taggers. If desired, also return scores using Stanford tagger with add_stanford_tags=True.

By default, remove exact duplicates when calculating POS similarity scores (i.e., does not consider perfectly mimicked lexical items between speakers). If desired, duplicates may be included when calculating scores by passing ignore_duplicates=False.

calculate_alignment.build_composite_semantic_vector(lemma_seq, vocablist, highDimModel)

Function for producing vocablist and model is called in the main loop

calculate_alignment.calculate_alignment(input_files, output_file_directory, semantic_model_input_file, pretrained_input_file, high_sd_cutoff=3, low_n_cutoff=1, delay=1, maxngram=2, use_pretrained_vectors=True, ignore_duplicates=True, add_stanford_tags=False, input_as_directory=True, save_vocab_freqs=False)

Calculate lexical, syntactic, and conceptual alignment between speakers.

Given a directory of individual .txt files and the vocabulary list that have been generated by the prepare_transcripts preparation stage, return multi-level alignment scores with turn-by-turn and conversation-level metrics.

Parameters:
  • input_files (str (directory name) or list of str (file names)) – Cleaned files to be analyzed. Behavior governed by input_as_directory parameter as well.

  • output_file_directory (str) – Name of directory where output for individual conversations will be saved.

  • semantic_model_input_file (str) – Name of file to be used for creating the semantic model. A compatible file will be saved as an output of prepare_transcripts().

  • pretrained_input_file (str or None) – If using a pretrained vector to create the semantic model, use name of model here. If not, use None. Behavior governed by use_pretrained_vectors parameter as well.

  • high_sd_cutoff (int, optional (default: 3)) – High-frequency cutoff (in SD over the mean) for lexical items when creating the semantic model.

  • low_n_cutoff (int, optional (default: 1)) – Low-frequency cutoff (in raw frequency) for lexical items when creating the semantic models. Items with frequency less than or equal to the number provided here will be removed. To remove the low-frequency cutoff, set to 0.

  • delay (int, optional (default: 1)) – Delay (or lag) at which to calculate similarity. A lag of 1 (default) considers only adjacent turns.

  • maxngram (int, optional (default: 2)) – Maximum n-gram size for calculations. Similarity scores for n-grams from unigrams to the maximum size specified here will be calculated.

  • use_pretrained_vectors (boolean, optional (default: True)) – Specify whether to use a pretrained gensim model for word2vec analysis (True) or to construct a new model from the provided corpus (False). If True, the file name of a valid model must be provided to the pretrained_input_file parameter.

  • ignore_duplicates (boolean, optional (default: True)) – Specify whether to remove exact duplicates when calculating part-of-speech similarity scores (True) or to retain perfectly mimicked lexical items for POS similarity calculation (False).

  • add_stanford_tags (boolean, optional (default: False)) – Specify whether to return part-of-speech similarity scores based on Stanford POS tagger in addition to the Penn POS tagger (True) or to return only POS similarity scores from the Penn tagger (False). (Note: Including Stanford POS tags will lead to a significant increase in processing time.)

  • input_as_directory (boolean, optional (default: True)) – Specify whether the value passed to input_files parameter should be read as a directory (True) or a list of files to be processed (False).

  • save_vocab_freqs (boolean, optional (default: False)) – Specify whether to save a .txt file to output_file_directory that contains list of all unique words in corpus with frequency counts

Returns:

  • real_final_turn_df (Pandas DataFrame) – A dataframe of lexical, syntactic, and conceptual alignment scores between turns at specified delay. NaN values will be returned for turns in which the speaker only produced words that were removed from the corpus (e.g., too rare or too common words) or words that were present in the corpus but not in the semantic model.

  • real_final_convo_df (Pandas DataFrame) – A dataframe of lexical, syntactic, and conceptual alignment scores between participants across the entire conversation.

calculate_alignment.calculate_baseline_alignment(input_files, surrogate_file_directory, output_file_directory, semantic_model_input_file, pretrained_input_file, high_sd_cutoff=3, low_n_cutoff=1, id_separator='\\-', condition_label='cond', dyad_label='dyad', all_surrogates=True, keep_original_turn_order=True, delay=1, maxngram=2, use_pretrained_vectors=True, ignore_duplicates=True, add_stanford_tags=False, input_as_directory=True, save_vocab_freqs=False)

Calculate baselines for lexical, syntactic, and conceptual alignment between speakers.

Given a directory of individual .txt files and the vocab list that have been generated by the prepare_transcripts preparation stage, return multi-level alignment scores with turn-by-turn and conversation-level metrics for surrogate baseline conversations.

Parameters:
  • input_files (str (directory name) or list of str (file names)) – Cleaned files to be analyzed. Behavior governed by input_as_directory parameter as well.

  • surrogate_file_directory (str) – Name of directory where raw surrogate data will be saved.

  • output_file_directory (str) – Name of directory where output for individual surrogate conversations will be saved.

  • semantic_model_input_file (str) – Name of file to be used for creating the semantic model. A compatible file will be saved as an output of prepare_transcripts().

  • pretrained_input_file (str or None) – If using a pretrained vector to create the semantic model, use name of model here. If not, use None. Behavior governed by use_pretrained_vectors parameter as well.

  • high_sd_cutoff (int, optional (default: 3)) – High-frequency cutoff (in SD over the mean) for lexical items when creating the semantic model.

  • low_n_cutoff (int, optional (default: 1)) – Low-frequency cutoff (in raw frequency) for lexical items when creating the semantic models. Items with frequency less than or equal to the number provided here will be removed. To remove the low-frequency cutoff, set to 0.

  • id_separator (str, optional (default: '-')) – Character separator between the dyad and condition IDs in original data file names.

  • condition_label (str, optional (default: 'cond')) – String preceding ID for each unique condition. Anything after this label will be identified as a unique condition ID.

  • dyad_label (str, optional (default: 'dyad')) – String preceding ID for each unique dyad. Anything after this label will be identified as a unique dyad ID.

  • all_surrogates (boolean, optional (default: True)) – Specify whether to generate all possible surrogates across original dataset (True) or to generate only a subset of surrogates equal to the real sample size drawn randomly from all possible surrogates (False).

  • keep_original_turn_order (boolean, optional (default: True)) – Specify whether to retain original turn ordering when pairing surrogate dyads (True) or to pair surrogate partners’ turns in random order (False).

  • delay (int, optional (default: 1)) – Delay (or lag) at which to calculate similarity. A lag of 1 (default) considers only adjacent turns.

  • maxngram (int, optional (default: 2)) – Maximum n-gram size for calculations. Similarity scores for n-grams from unigrams to the maximum size specified here will be calculated.

  • use_pretrained_vectors (boolean, optional (default: True)) – Specify whether to use a pretrained gensim model for word2vec analysis. If True, the file name of a valid model must be provided to the pretrained_input_file parameter.

  • ignore_duplicates (boolean, optional (default: True)) – Specify whether to remove exact duplicates when calculating part-of-speech similarity scores. By default, ignore perfectly mimicked lexical items for POS similarity calculation.

  • add_stanford_tags (boolean, optional (default: False)) – Specify whether to return part-of-speech similarity scores based on Stanford POS tagger (in addition to the Penn POS tagger).

  • input_as_directory (boolean, optional (default: True)) – Specify whether the value passed to input_files parameter should be read as a directory or a list of files to be processed.

Returns:

  • surrogate_final_turn_df (Pandas DataFrame) – A dataframe of lexical, syntactic, and conceptual alignment scores between turns at specified delay for surrogate partners. NaN values will be returned for turns in which the speaker only produced words that were removed from the corpus (e.g., too rare or too common words) or words that were present in the corpus but not in the semantic model.

  • surrogate_final_convo_df (Pandas DataFrame) – A dataframe of lexical, syntactic, and conceptual alignment scores between surrogate partners across the entire conversation.

calculate_alignment.conceptualAlignment(lem1, lem2, vocablist, highDimModel)

Calculate conceptual alignment scores from list of lemmas from between two interocutors (suffix 1 and 2 in arguments passed to function) using word2vec.

calculate_alignment.get_cosine(vec1, vec2)

Derive cosine similarity metric, standard measure. Adapted from <https://stackoverflow.com/a/33129724>.

calculate_alignment.ngram_lexical(sequence1, sequence2, ngramsize=2)

Create ngrams of the desired size for each of two interlocutors’ sequences and return a dictionary of counts of ngrams for each sequence.

By default, consider bigrams. If desired, this may be changed by setting ngramsize to the appropriate value.

calculate_alignment.ngram_pos(sequence1, sequence2, ngramsize=2, ignore_duplicates=True)

Remove mimicked lexical sequences from two interlocutors’ sequences and return a dictionary of counts of ngrams of the desired size for each sequence.

By default, consider bigrams. If desired, this may be changed by setting ngramsize to the appropriate value.

By default, ignore duplicate lexical n-grams when processing these sequences. If desired, this may be changed with ignore_duplicates=False.

calculate_alignment.returnMultilevelAlignment(cond_info, partnerA, tok1, lem1, penn_tok1, penn_lem1, partnerB, tok2, lem2, penn_tok2, penn_lem2, vocablist, highDimModel, stan_tok1=None, stan_lem1=None, stan_tok2=None, stan_lem2=None, add_stanford_tags=False, maxngram=2, ignore_duplicates=True)

Calculate lexical, syntactic, and conceptual alignment between a pair of turns by individual interlocutors (suffix 1 and 2 in arguments passed to function), including leading/following comparison directionality.

By default, return scores based only on Penn POS taggers. If desired, also return scores using Stanford tagger with add_stanford_tags=True and by providing appropriate values for stan_tok1, stan_lem1, stan_tok2, and stan_lem2.

By default, consider only bigrams when calculating similarity. If desired, this window may be expanded by changing the maxngram argument value.

By default, remove exact duplicates when calculating similarity scores (i.e., does not consider perfectly mimicked lexical items between speakers). If desired, duplicates may be included when calculating scores by passing ignore_duplicates=False.