Conversation
It is better to put models somewhere else, and notebooks were broken.
add base classifier and global ngrams feature functions
1. rename DEFAULT_TAGSET to EXAMPLE_TAGSET; 2. rename DEFAULT_FEATURES to EXAMPLE_TOKEN_FEATURES; 3. make token_features empty by default in create_wapiti_pipeline.
except model_filename must be kwargs now. Also, this fixes the example from the tutorial.
…v/webstruct into speed_up_text_tokenizer
Speed up text tokenizer
Codecov Report
@@ Coverage Diff @@
## master #58 +/- ##
==========================================
+ Coverage 81.01% 81.14% +0.13%
==========================================
Files 40 41 +1
Lines 2091 2180 +89
==========================================
+ Hits 1694 1769 +75
- Misses 397 411 +14 |
|
Uh? Is it complaining because I did not write tests for the new features? |
| token = HtmlToken('1st') | ||
| expected = {'looks_like_day_ordinal': True} | ||
| result = looks_like_day_ordinal(token) | ||
| assert result == expected |
There was a problem hiding this comment.
would you mind cleaning it up a bit - e.g. you can make a function assert_looks_like_day_ordinal("1st", True) to reduce copy-paste and make code more clear
| from webstruct.features import token_lower, token_identity, Pattern | ||
|
|
||
|
|
||
| class PatternTest(unittest.TestCase): |
webstruct/features/data_features.py
Outdated
| return {'looks_like_date_pattern': True} # XX/XX/XXXX | ||
| if re.search('\d{1,2}\.\d{1,2}\.\d{2,4}', html_token.token): | ||
| return {'looks_like_date_pattern': True} # XX.XX.XXXX | ||
| if re.search('\d{1,2}-\d{1,2}-\d{2,4}', html_token.token): |
There was a problem hiding this comment.
It matches XX.X.XXX, right? I think it makes sense to exclude 3-letter years from the pattern.
This function also doesn't catch common date variants like YYYY-MM-DD
| def test_looks_like_ordinal(): | ||
|
|
||
| def assert_looks_like_ordinal(token, expected): | ||
| assert looks_like_ordinal(token) == expected |
There was a problem hiding this comment.
If you replace it with
assert looks_like_ordinal(HtmlToken(text)) == {'looks_like_ordinal': expected}test code will be smaller and more DRY, it'd be easier to add more tests. The same applies for test_looks_like_date_pattern.
|
|
||
| # FIXME: there should be a cleaner/faster way | ||
| if not all(v == out_value for v in values): | ||
| values = [str(v) for v in values] |
There was a problem hiding this comment.
This is not correct in Python 2, as you'll be casting unicode features to str (i.e. to bytes).
- looks_like_date now includes patterns like XXXX.XX.XX and excludes 3 digit years like XX/XX/XXX
|
I run some tests to check how much these features help identifying date objects and results were mixed:
scores were evaluated cross validating (3 fold) on 45 labelled pages and using crf model |
I added the features I created for Fireflax