#####GLOSSARY SESSION 11#####

Common Crawl
Stochastic Parrots
Painting the field
training data
hegemonic views
uncurated datasets
dominant
crawling methodology
hegemony retains or maintains hegemony
biases and harms
Internet text collections
over representation in datasets
GPT-2
GPT-3
Training data
Users are men
Women underrepresented
Structural factors
moderation practices
suspension of accounts
harassment practices perpetuated
unwelcome populations
Alternative communities are then excluded from these datasets
anti-agist frames
age discrimination
filtering
The Colossal Clean Crawled Corpus
curate training datasets
decolonizing twitter
data sets for Chinese language
incentivized to publish in international contexts
geographies of datasets
appropriate data