When the class is first created we use:
class Our_Tokenizer:
def __init__(self):
# import spacy tokenizer/language model
self.nlp = en_core_web_sm.load()
self.nlp.max_length = 4500000 # increase max number of characters that spacy can process (default = 1,000,000)
def __call__(self, document):
tokens = self.nlp(document)
simplified_tokens = [str.lower(token.lemma_) for token in tokens]
return simplified_tokens
This issue relates to this line:
simplified_tokens = [str.lower(token.lemma_) for token in tokens]
Using string comprehension like this makes it shorter, but then we have to explain list comprehension to learners. Not the worst thing.
However, when we incorporate stop words into the class, we use a for-loop:
simplified_tokens = []
for token in tokens:
if not token.is_stop and not token.is_punct:
simplified_tokens.append(str.lower(token.lemma_))
Then we switch back to more complex list comprehension later:
simplified_tokens = [
token for token in tokens
if not token.is_stop
and not token.is_punct
and token.pos_ in {"ADJ", "ADV", "INTJ", "NOUN", "VERB"}
]
We should either stick with list comprehension (and include a brief note about what that is) or stick to a for-loop approach throughout this episode.
When the class is first created we use:
This issue relates to this line:
Using string comprehension like this makes it shorter, but then we have to explain list comprehension to learners. Not the worst thing.
However, when we incorporate stop words into the class, we use a for-loop:
Then we switch back to more complex list comprehension later:
We should either stick with list comprehension (and include a brief note about what that is) or stick to a for-loop approach throughout this episode.