Micro-Tutorial: Quick Text Preprocessing with NLTK

Federico Bianchi
1 min readOct 14, 2020

--

I often do not remember which are the exact methods to run a quick pre-processing pipeline. And most of the times I just just the bare minimum: remove punctuation and remove stopwords.

First thing, install NLTK, the toolkit we are going to use to handle the preprocessing.

pip install nltk

Give me the code

I will just write here this quick function, so you can copy and paste it everywhere you want.

A few examples:

The function we just defined removes both punctuation and stopwords

How does it work?

It’s super easy:

  • line 11) we lowercase the sentence
  • line 12) we instantiate NLTK tokenizer to get only words
  • line 13) we actually tokenize the sentence (removing punctuation) and get a list of tokens
  • line 14) we remove stopwords
  • line 15) we join the words to make a new sentence without punctuation and stopwords

--

--

Federico Bianchi
Federico Bianchi

Written by Federico Bianchi

Stanford University. NLP, Machine Learning and Artificial Intelligence. https://federicobianchi.io

No responses yet