Introduction

In college, I made the choice to pursue a double major in Mathematics and Spanish–a combination which still to this day results in questions like “why…?” Though these two subjects seem far apart, I was able to link them together during my senior year when I took a Spanish Linguistics class.

Linguistics is defined by the University at Buffalo as: “the study of language, and its focus is the systematic investigation of the properties of particular languages as well as the characteristics of language in general.” Our final project in the class was to contribute to the Virginia Corpus of Spanish Variation, intended to enable the general public to explore Spanish language variation in the state of Virginia.

I contributed to the project by interviewing participants, transcribing those interviews, and ultimately presenting a research project on marcadores discursivos as the final project for my Spanish Linguistics class; my research focused specifically on the use of words like “uhm”, “este”, and others, how participants’ origins and Spanish experience influenced the presence and use of these discourse markers in their speech, and whether one discourse marker was used more by Spanish speakers born outside the United States.

an example of one of my plots, generated with R

Turning text to data with Python

This project necessitated turning these interview transcriptions into tabular data that we could then analyze. For some, depending on the subject of their research, this involved endless CTRL + Fs and manually transcribing this data into Excel files. Because I had exposure to data analysis and data manipulation through my stats classes and my time outside of the classroom learning R, I knew how I wanted these tables to look for analysis. I also knew that I didn’t want to spend hours of my life manually finding these words.

Basically, I wanted to programmatically construct a dataframe which contained:

the word I was looking for
all of the information about an interview subject (their age, sex, etc.)
the word before and after the word that was being searched for
the phrase/sentence before, essentially the “context” in which the word was said so I could determine whether it was in fact a discourse marker

I ended up writing a Python script (in fact, one of the first I ever wrote!) that parsed interviews into dataframes of the format I envisioned, which made it easy to analyze in R (or Excel). I even got others in my class to use the script and cite my script in their presentations! I did run the script for them, but still…

The main downside of the script, in my view, was that it did require some manual intervention up front. Specifically, it required me to manually remove interviewer questions from the transcripts to ensure it only captured responses from the interviewee.

An additional downside was that the script was not context-aware, so for some words I asked it to find like “uh”, it instead found instances of “uhm” and suggested that the word/sentence following the use of “uh” was “m”, which was obviously not what I was looking for.

What if I had GPT for this project?

Recently, as I’ve been playing with the OpenAI API for work, I had the idea of revisiting this project. After all, LLMs are great at parsing text, and this project is all about parsing text.

My theory is that we can build a more powerful, streamlined system for turning these interviews into data by identifying interviewer prompts and respondent answers in code (rather than the manual method from a few years ago) and by directing GPT to produce a similar, context-aware, machine-readable output using some clever prompt engineering.

My Julia Project Setup

I decided to approach this problem with Julia–and to keep it simple (and challenging at the same time!), I tried to use the least amount of external dependencies possible. I’ve published all the code used here so that I can keep this post focused on the process rather than the code.

For this small project, my setup was:

two modules containing helper functions for OpenAI’s API and the corpus interviews
one top-level script, which I named main.jl
the prompt itself, contained in a separate Julia script and defined as a const.

Parsing Interviews

In the corpus-helpers.jl file, there are a few functions that help with reading these interviews. I pulled one interview (the one I just so happened to take part in) from the corpus as a PDF. My one manual intervention here was to export this as a .txt file.

The read_interview function reads the text file as a whole, then splits it up based on newlines. It also filters out blank rows.

The parse_lines_to_q_a function is intended to capture both interviewer questions and participant answers as separate vectors of strings. The function basically looks through every element of the vector of strings we created with read_interview, checks whether the line starts with “participante” or “entrevistador”, and then pushes those lines into separate vectors for questions and answers.

Warning

This function is a pretty naive attempt that assumes the interview follows a specific format. In a real production setting, we’d want to test this extensively with different formats to ensure it worked!

The format that we’ll provide the OpenAI API is just the combined answers as a single string.

OpenAI Helpers

We can access the OpenAI API through a simple HTTP request. Meaning that a helper function like this can do a lot of heavy lifting:

function get_gpt_chat_completion(messages; model = "gpt-3.5-turbo", temperature = 0.7)
    r = request(
        "POST",
        "https://api.openai.com/v1/chat/completions",
        Dict("Content-Type" => "application/json", "Authorization" => "Bearer $(ENV["OPEN_AI_KEY"])");
        body = """
        {
            "model": "$model",
            "messages": $messages,
            "temperature": $temperature
        }
        """
    )

    b = JSON3.read(String(r.body))

    return b
end

Note that I’m using my ENV to hold my OpenAI key. Please don’t put that in plaintext :)

I won’t walk through the whole process of setting up requests to the API, you can read OpenAI’s docs about that.

The Prompt

The prompt is perhaps the most important part of this project. As I mentioned before, we want to return something which is machine-readable and could be incorporated into a production setting. An easy format for this is JSON, which plays nicely with a lot of systems and is easily parseable in Julia to a DataFrame.

I provide the prompt to Julia as a const. As I mentioned in a prior blog post, I give GPT a role to play and outline what’s expected of it. The first part of the prompt looks like this:

Your role is to identify 'marcadores discursivos' in interview transcriptions.

These markers may include specific words and phrases such as 'este' and its variations, 'uh,' 'um,' 'uhhh,' 'um,' 'entonces,' and 'eh.'

Your task is to provide the following information for each instance of these markers in a machine-readable JSON format:
- instance number, a running count of the number of times the marcador has appeared in the interview
- word, the word you're looking for
- word_before, the word that appears before the one in question
- word_after, the word that appears after the one in question
- phrase_before, the sentence or phrase that appears before the one in question
- phrase_after, the sentence or phrase that appears before the one in question

After establishing context up front, I provide some examples of what I expect back.

An example of the JSON format:
{
instances:
    [
        {
            "instance_no": 1,
            "marcador": "uh",
            "word_before": "este",
            "word_after": "parte",
            "phrase_before": "que mi favorita es este",
            "phrase_after": "del mar en cual"
        },
        {
            "instance_no": 2,
            "marcador": "uh",
            "word_before": "pintor",
            "word_after": "una",
            "phrase_before": "yo pienso de este pintor",
            "phrase_after": "una artista que"
        },
        {
            "instance_no": 1,
            "marcador": "este",
            "word_before": "con",
            "word_after": "hombre",
            "phrase_before": "estaba hablando con",
            "phrase_after": "hombre que me llamó"
        }
    ]
}

I provide some additional instructions below this, things like “make sure word_before and word_after are only one word” or “remove stray unescaped characters in the JSON response.”

Reading in the response

The function I provided above which returns GPT’s chat completion returns the body of the response as a JSON object in Julia. This means that turning the JSON which is outlined above from GPT’s response looks something like:

JSON3.read(resp_b[:choices][1][:message][:content])[:instances] |> DataFrame

Which, of course, I wrapped into a helper function included in the open-ai-helpers.jl file!

Our participant information is contained in a struct called CorpusParticipant. I designed the struct in a way where we can optionally call a function which creates a 1-row DataFrame object containing the subject’s information, like this:

participant_1 = CorpusParticipant(
    ID = "MONO-016",
    Age = 21,
    Sex = "M",
    BirthPlace = "Lima, Peru",
    Studies = "university-college",
    Job = "skilled-worker",
    YearsInUS = 5,
    YearsInVirginia = 5,
    SpeakerType = "first-generation"
).generate_participant_row()

We can then take this 1 row DataFrame and cross-join it to the one that’s generated from the API response.

I find this to be an improvement on my Python script because we can more explicitly define participant information this way rather than it being hidden as a function argument–which can lead to mistakenly forgetting to fill it out or change it between participants.

The result is extremely similar to the results from my Python script from a few years back.

How does GPT handle this task?

GPT-4 does alright at what the prompt asks it to do with some exceptions. Here’s a snippet of what it returned:

GPT-4’s attempt at ID’ing marcadores discursivos
marcador	instance_no	word_before	word_after	phrase_before	phrase_after
este	1	yo,	tengo	En mi tiempo libre yo	tengo estos munecos chiqititos de plastico.
este	2	yo	para	y me gusta yo	para los que no saben, para los que no han escuchado vienen
este	3	en	en	para los que no han escuchado vienen	para los que no saben, para los que no han escuchado vienen
uhm	1	cuarto	juegos	veo anime con mis companeros el cuarto	juegos o anime? De juegos, estoy […] del tema de […] League of Legends con mis amigos que viene italia, yeah.

Sometimes, GPT would return more than one word for the word_before and word_after parts when I explicitly asked it not to! For example, I don’t think “enorme todavia esta construyendo” is a single word…my intuition here is that there’s an issue with the transcription’s formatting rather than with GPT, but I could be wrong.

I also could have been more limiting as to what defines a “phrase”, as some of the phrases GPT returned were pretty long. This is a somewhat okay problem to have though since these phrases are included for me to determine whether the word is a true marcador discursivo (eg. did the participant pause and say “este” as they were thinking, or did they use “este” in a context like “este parte”).

Note

A fun idea here may be to feed GPT the responses it already generated, and ask it to classify the context the word was used in to determine whether it was a true marcardor discursivo!

Wrapping Up

If I had GPT in college, I definitely would have let it take a swing at this project in a less formalized way. And by that, I mean I would have used ChatGPT rather than the API and just dumped the text in. Despite the issues I called out above, I feel pretty confident that the results would have been quite similar or even better than my original Python script.

I hope that this post has provided a clear idea of what a text analysis system looks like combining Julia and GPT. Of course, larger-scale operations would require a more formalized system of ingestion and analysis (probably in Python)–but for research/playing around, the provided code should give you a good idea of how you might apply GPT to your own linguistics analysis.

As always, if this post was interesting or helpful to you in any way, I’d love to hear from you!