Problem 4¶
Analyze the words of a text.
We consider not only the content of a word as string, but also its position in the text. A word can appear in multiple locations in the text. (We consider inflections as different words: “go” and “goes” are considered two different words.)
We define an order of the words. A word x
is smaller than a word y
if the
difference between the first and last occurrence of x
is smaller than the
corresponding difference for y
.
Provide a function to extract words from a text.
Provide a function top
to return the largest n
words based on the aforementioned
order.
Data:
Express a normalized word of a text. |
|
Express a token of a text. |
Classes:
|
Represent a word as a token of the text. |
|
Represent a word occurence in the text. |
Functions:
|
|
|
Tokenize the text into normalized word tokens ignoring the punctuation. |
|
Find the |
- WORD_RE = re.compile('^[a-z]+(-[a-z])*$')¶
Express a normalized word of a text.
- class Token(text: str)[source]¶
Represent a word as a token of the text.
Methods:
__new__
(cls, text)Enforce the properties on the
text
of the word.
- class WordOccurrence(first: int, last: int, text: Token)[source]¶
Represent a word occurence in the text.
Methods:
__init__
(first, last, text)Initialize with the given values.
__lt__
(other)__le__
(other)__repr__
()Represent the word occurrence as string for easier debugging.
Attributes:
Index of the first occurrence
Index of the last occurrence
Text of the word
- __init__(first: int, last: int, text: Token) None [source]¶
Initialize with the given values.
- Requires
last >= 0
first >= 0
first <= last
- first¶
Index of the first occurrence
- last¶
Index of the last occurrence
- text¶
Text of the word
- __lt__(other: WordOccurrence) bool [source]¶
- __le__(other: WordOccurrence) bool [source]¶
- tokens_to_words(tokens: List[Token]) List[WordOccurrence] [source]¶
- Ensures
len(tokens) > 0
⇒len(result) > 0
len(result) <= len(tokens)
all( tokens[word_occurrence.first] == word_occurrence.text and tokens[word_occurrence.last] == word_occurrence.text for word_occurrence in result )
( word_texts := [word_occurrence.text for word_occurrence in result], len(word_texts) == len(set(word_texts)) )[1]
(No duplicate word occurrences)
- TOKEN_RE = re.compile('[a-zA-Z]+(-[a-zA-Z])*')¶
Express a token of a text.
- tokenize(text: str) List[Token] [source]¶
Tokenize the text into normalized word tokens ignoring the punctuation.
- Ensures
sum(len(token) for token in result) <= len(text)
- find_top(word_occurrences: List[WordOccurrence], limit: int) List[WordOccurrence] [source]¶
Find the
limit
top occurrences inword_occurrences
.- Requires
limit > 0
- Ensures
len(result) == min(len(word_occurrences), limit)
all( result[i] >= result[i + 1] for i in range(len(result) - 1) )
( word_set := set(word_occurrences), all( word_occurrence in word_set # pylint: disable=used-before-assignment for word_occurrence in result ) )[1]