Zero-shot classification with Hugging Face

Python

Hugging Face

Transformers

Zero-shot classification

Topic modeling

Author

Brian Leung

Published

February 23, 2024

Why Hugging Face?

Advances in natural language processing (NLP), particularly with the advent of of large language models (LLMs), have created exciting opportunities for social science researchers to deal with a large amount of text as data. But numerous barriers to entry existed: the knowledge, data, and computational resources required to train and fine-tune the models to specific tasks can be very daunting for us.

So, there is a gap between what NLP models or resources are available out there and what we as social scientists can reasonably digest and incorporate into our workflow. Researchers with a technical comparative advantage in training and fine-tuning models have already produced resources that have immense potentials for social science applications.

For example, PoliBERTweet is a pre-trained BERT model – a transformer-based model, much like its cousin GPT (“Generative Pre-trained Transformer”). It is pre-trained in the sense that it was trained on 83 million politics-related Tweets, making it suitable for a wide range of downstream, domain-specific tasks related to politics. But the problem is, how we as social scientists can take advantage of such readily available resources?

There is where Hugging Face comes into play. Much like Github, it is a community platform that allows practitioners and researchers to host and collaborate on AI models. Many state-of-the-art NLP models are available for specific downstream tasks, like text classification (e.g., for sentiment analysis or topic classification) or embedding documents to compare their similarity.

Most importantly, it comes with a Python package – transformers – that makes downloading and implementing those pre-trained models super easy and dramatically lowers the entry cost. But it does require some knowledge in Python.

How to get started as a R user?

In this post, I want to develop a workflow that centers on a R environment (e.g., writing a .rmd/.qmd, or wrangling data with tidyverse) that feels familiar to us, but one that incorporates the power of Python packages like transformers only when we need to.

I can’t tell you how much the fear and discomfort from an interrupted workflow – switching from one language to a less-familiar one, and transporting objects between different interfaces – have discouraged people (myself included) from taking advantage of Python.

Hopefully, an integrated workflow that makes R and Python interoperable will remove the last barrier to entry to unleash the power of NLP in our research.

Setting up Python in R with `reticulate`

First, let’s set up a virtual environment to install the required Python packages – particularly transformers via the reticulate package in R:

library(reticulate)

virtualenv_create("r-reticulate")

virtualenv: r-reticulate

packages <- c("transformers==4.37.2", "tensorflow", "torch", "datasets", "pandas")

virtualenv_install("r-reticulate", packages)

Using virtual environment 'r-reticulate' ...

If it is the first time for you to install the packages, it might take some time as they are quite large in size.

Basic text classification with `transformers`

To see if you have installed the packages and selected the correct Python interpreter, run the following code to import pipeline, the key function from transformers:

from transformers import pipeline

Now, we can take advantage of pre-trained models on Hugging Face and perform text analyses. It can be done in a few lines of code. But you must first define the language task you want to perform and select the corresponding model. For example, I can perform sentiment analysis on a text by running:

classifier = pipeline(task = "sentiment-analysis")
text = "This blog post is not unhelpful"
output = classifier(text)
print(output)

[{'label': 'POSITIVE', 'score': 0.7062997817993164}]

The sentiment classifier assigns a positive label to my double-negative sentence, which is reasonable. More generically, in pipeline(...), you have to declare the task (e.g., “sentiment-analysis”) and the model. The default model “distilbert/distilbert-base-uncased-finetuned-sst-2-english” is chosen because the user doesn’t specify one, which is not a recommended practice. You can go to Hugging Face to look for specific models for your particular NLP tasks. Be aware that NLP models tend to be quite large in size (some gigabytes), so it can take a while for your first time installation.

Classifying political stances with `transformers`

The following section showcases a DeBERTa-based model trained for stance detection, first by Laurer et al and further improved on by Michael Burnham. Behind the model, there is an interesting literature called natural language inference (NLI) or textual entailment. This is suitable for detecting political or issue stances behind some text in a zero-shot setting (i.e., the model can make prediction on arbitrary labels it wasn’t trained on but we care about).

To perform political stance detection:

# define pipeline and model
zeroshot_classifier = pipeline("zero-shot-classification", model = "mlburnham/deberta-v3-large-polistance-affect-v1.0")

# specify the text
text = "Many American jobs are shipped to Chinese factories."

# specify the hypothesis and classes 
hypothesis_template = "This text supports trading {} with China"
classes_verbalized = ["more", "less"]

# execute the pipeline 
output = zeroshot_classifier(text, classes_verbalized, hypothesis_template=hypothesis_template, multi_label=False)

print(output)

{'sequence': 'Many American jobs are shipped to Chinese factories.', 'labels': ['less', 'more'], 'scores': [0.9999955296516418, 4.500270733842626e-06]}

The classifier looks at the text and perform hypothesis testings: does the text (based on “common” understanding of the language) entail one hypothesis (e.g., it supports trading more with China) or the other (e.g., trading less with China)? It assigns probabilities to each hypothesis and the label with the highest probability is chosen (multiple labels are allowed as an option though). For example, the classifier correctly identify the text (“Many American jobs are shipped to Chinese factories.”) as a statement that supports trading less with China.

Enabling GPU: using Congress Tweets as example

To provide a concrete example and workflow, below are 360 sampled Tweets made by Congress Members in recent years that talked about China:

import pandas as pd
china_tweets = pd.read_csv("china_tweets_sample.csv")
china_tweets['text_clean'][1]

'Simply acknowledging #HumanRightsDay today isn’t enough. China is undermining freedoms for #HongKongers + others. Saudi Arabia killed a US journalist w/impunity. Our Border Patrol has let children die in custody. Rights are under threat around the globe. We cannot be silent. HTTPURL'

To run large transformer models at scale, we need to enable GPU for parallel computing – the difference can be hundred-fold faster! For example, I’m using a MacBook with M1 chip. To enable Apple Silicon chip for GPU computing, run:

import torch
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
print(f"Device: {device}")

Device: mps

If you’re using a Nvidia GPU (or renting GPUs on Google Colab), you can enable cuda by running device = torch.device("cuda" if torch.cuda.is_available() else "cpu") instead.

Back to the Tweets, we first need to coerce the pandas data frame into “dataset,” a preferred structure when using Hugging Face models:

# coerce it to a "dataset""
from datasets import Dataset
china_tweets_ds = Dataset.from_pandas(china_tweets)
print(china_tweets_ds)

Dataset({
    features: ['tweet_id', 'text_clean', 'label_job', 'label_trade', 'label_tech', 'label_hr', 'label_milSec'],
    num_rows: 360
})

We then define the pipeline for the classification task:

# define the pipeline
zeroshot_classifier = pipeline("zero-shot-classification", model = "MoritzLaurer/deberta-v3-large-zeroshot-v1.1-all-33", device=device, batch_size=16)

# specify the hypothesis and classes
hypothesis_template = "The author of this text critizies China for {}"
classes_verbalized = [
  "job concerns, for example, China taking away American workers' jobs",
  "trade concerns, for example, China engaging in unfair trade practices",
  "technology concerns, for example, China stealing American technologies or intellectual properties or IP",
  "human rights concerns, for example, China repressing protests and arresting activists",
  "national security concerns, for example, China posing a military threat to the US"
]

# define a function
def classify_tweet(x):
  output = zeroshot_classifier(x['text_clean'], classes_verbalized, hypothesis_template=hypothesis_template, multi_label=True)
  return {"output": output}

Run the function to classify the tweet using map:

# run the function by using map
china_tweets_classified = china_tweets_ds.map(classify_tweet, batched=True)


Map:   0%|          | 0/360 [00:00<?, ? examples/s]
Map: 100%|##########| 360/360 [02:26<00:00,  2.46 examples/s]
Map: 100%|##########| 360/360 [02:26<00:00,  2.46 examples/s]

To transport the result back to R for wrangling:

library(tidyverse)

output <- py$china_tweets_classified['output']

output <- output %>%
  bind_rows() %>%
  mutate(labels = str_extract(labels, ".+(?= concerns)")) %>%
  distinct() %>%
  pivot_wider(id_cols = sequence, names_from = labels, values_from = scores)

We can investigate the classification results:

library(DT)
datatable(output, filter = "top", options = list(pageLength=(3)))

Conclusion

This blog post serves as an introduction to unlock the power of Hugging Face and its transformer-based models. There are many topics that we don’t have time and space to explore here. For example:

If your text as data is moderately large in quantity (or in sequence length), you likely have to acquire additional GPU computing power via platforms like Google Colab.
To improve the performance of a model, you likely have to fine-tune the model to your specific data and use case (example script of fine-tuning a NLI model by Michael Burnham).
To assess the performance of the model (not to throw up your hands and just say “Trust AI/LLM”), you likely have to hand code a sample of the text and compute some evaluation metrics (e.g., precision, accuracy, etc) in a standard machine learning workflow