Micro-Tutorial: Quick Text Preprocessing with NLTK

1 min readOct 14, 2020

I often do not remember which are the exact methods to run a quick pre-processing pipeline. And most of the times I just just the bare minimum: remove punctuation and remove stopwords.

First thing, install NLTK, the toolkit we are going to use to handle the preprocessing.

pip install nltk

Give me the code

I will just write here this quick function, so you can copy and paste it everywhere you want.

A few examples:

The function we just defined removes both punctuation and stopwords

How does it work?

It’s super easy:

line 11) we lowercase the sentence
line 12) we instantiate NLTK tokenizer to get only words
line 13) we actually tokenize the sentence (removing punctuation) and get a list of tokens
line 14) we remove stopwords
line 15) we join the words to make a new sentence without punctuation and stopwords

Micro-Tutorial: Quick Text Preprocessing with NLTK

Give me the code

How does it work?

Written by Federico Bianchi

No responses yet