Add date features by Kebniss · Pull Request #58 · scrapinghub/webstruct

Kebniss · 2018-01-30T15:34:02Z

I added the features I created for Fireflax

It is better to put models somewhere else, and notebooks were broken.

add base classifier and global ngrams feature functions

1. rename DEFAULT_TAGSET to EXAMPLE_TAGSET; 2. rename DEFAULT_FEATURES to EXAMPLE_TOKEN_FEATURES; 3. make token_features empty by default in create_wapiti_pipeline.

…r classes

except model_filename must be kwargs now. Also, this fixes the example from the tutorial.

…v/webstruct into speed_up_text_tokenizer

Speed up text tokenizer

codecov · 2018-01-30T15:37:29Z

Codecov Report

Merging #58 into master will increase coverage by 0.13%.
The diff coverage is 84.44%.

@@            Coverage Diff             @@
##           master      #58      +/-   ##
==========================================
+ Coverage   81.01%   81.14%   +0.13%     
==========================================
  Files          40       41       +1     
  Lines        2091     2180      +89     
==========================================
+ Hits         1694     1769      +75     
- Misses        397      411      +14

Kebniss · 2018-01-30T15:41:35Z

Uh? Is it complaining because I did not write tests for the new features?

…working

kmike · 2018-01-31T11:52:03Z

webstruct/tests/test_date_features.py

+    token = HtmlToken('1st')
+    expected = {'looks_like_day_ordinal': True}
+    result = looks_like_day_ordinal(token)
+    assert result == expected


would you mind cleaning it up a bit - e.g. you can make a function assert_looks_like_day_ordinal("1st", True) to reduce copy-paste and make code more clear

kmike · 2018-01-31T11:52:59Z

webstruct/tests/test_pattern_features.py

-from webstruct.features import token_lower, token_identity, Pattern
-
-
-class PatternTest(unittest.TestCase):


why is it removed?

kmike · 2018-03-14T11:05:43Z

webstruct/features/data_features.py

+        return {'looks_like_date_pattern': True}  # XX/XX/XXXX
+    if re.search('\d{1,2}\.\d{1,2}\.\d{2,4}', html_token.token):
+        return {'looks_like_date_pattern': True}  # XX.XX.XXXX
+    if re.search('\d{1,2}-\d{1,2}-\d{2,4}', html_token.token):


It matches XX.X.XXX, right? I think it makes sense to exclude 3-letter years from the pattern.

This function also doesn't catch common date variants like YYYY-MM-DD

kmike · 2018-03-14T11:07:48Z

webstruct/tests/test_date_features.py

+def test_looks_like_ordinal():
+
+    def assert_looks_like_ordinal(token, expected):
+        assert looks_like_ordinal(token) == expected


If you replace it with

assert looks_like_ordinal(HtmlToken(text)) == {'looks_like_ordinal': expected}

test code will be smaller and more DRY, it'd be easier to add more tests. The same applies for test_looks_like_date_pattern.

kmike · 2018-03-14T11:09:21Z

webstruct/features/global_features.py


        # FIXME: there should be a cleaner/faster way
        if not all(v == out_value for v in values):
+            values = [str(v) for v in values]


This is not correct in Python 2, as you'll be casting unicode features to str (i.e. to bytes).

- looks_like_date now includes patterns like XXXX.XX.XX and excludes 3 digit years like XX/XX/XXX

… date-features

Kebniss · 2018-05-03T07:00:09Z

I run some tests to check how much these features help identifying date objects and results were mixed:

when start and end dates were identified by a single entity the extra features slightly worsened the performance moving the F1 score for B-date and I-date from 0.567 and 0.628 to 0.548 and 0.611 respectively. Sequence accuracy remains the same
when start and end dates were identified in two separate entities the extra features slightly increased the performance. For B-END_DATE F1 score moved from 0.591 to 0.625, I-END_DATE went from 0.682 to 0.721, B-START_DATE went from 0.522 to 0.547 and I-START_DATE went from 0.667 to 0.690. sequence accuracy went from 1.5% to 3.1%

scores were evaluated cross validating (3 fold) on 45 labelled pages and using crf model

kmike and others added 30 commits April 23, 2014 02:37

Remove example notebooks and models from repo.

98b3ae4

It is better to put models somewhere else, and notebooks were broken.

Merge pull request #10 from tpeng/crfsuite-backend

e3defff

add base classifier and global ngrams feature functions

simplify CombinedFeatures and make it private

40a5415

features.utils -> feature.global_features

45a005f

TST fix tests

4ee7f40

replace Ngram global feature with Pattern

a636a9a

DOC fix autodocs

04eed65

DOC minor fixes

115a5a4

(backwards-incompatible) kill default features:

52759bd

1. rename DEFAULT_TAGSET to EXAMPLE_TAGSET; 2. rename DEFAULT_FEATURES to EXAMPLE_TOKEN_FEATURES; 3. make token_features empty by default in create_wapiti_pipeline.

(backwards-incompatible) rename "transform" to "predict" for estimato…

a91d1c9

…r classes

TST don't require NLTK for tests

ab1b589

simple __repr__ for HtmlToken

9204eec

(backwards-incompatible) all create_wapiti_pipeline wapiti params

829f708

except model_filename must be kwargs now. Also, this fixes the example from the tutorial.

WordTokenizer.tokenize rewritten

e52ab9e

doctests indent

98a2a0b

fix unicode handling for a new tokenizer; add pounds char to rules

989072c

Merge branch 'speed_up_text_tokenizer' of https://github.com/chekunko…

177ad80

…v/webstruct into speed_up_text_tokenizer

Merge pull request #16 from scrapinghub/speed_up_text_tokenizer

5fe04f6

Speed up text tokenizer

small tokenizer cleanup

226e53f

make min_length and max_length arguments required for utils.substrings

24926c5

add crfsuite backend base on python-crfsuite

b6d60f1

DOC: fix crfsuite docstring

e3ef37a

DOC fix style and typo

f96cae1

fix HtmlTokenizer pickling

383f8b7

WapitiCRF.fit returns self

0adaaf2

train_test_split_noshuffle

92553b7

TST runcoverage script

55598e0

python-crfsuite support; tests for NER and crfsuite pipeline

a2111d4

expose CRFsuiteCRF and CCRFsuiteFeatureEncoder

01b0ee6

rename wapiti_kwargs to crf_kwargs for consistency

0f248b6

Kebniss requested review from kmike and whalebot-helmsman January 30, 2018 15:34

Kebniss added 2 commits January 31, 2018 12:46

Remove XX\XX\XXXX from looks_like_date_pattern because regex was not …

539b20c

…working

Add todo list for solving small bugs

9cc8b80

kmike reviewed Jan 31, 2018

View reviewed changes

Kebniss added 2 commits January 31, 2018 15:40

put test_pattern_features.py back

f838fcc

Remove duplicate code from tests

2e77ec2

Kebniss changed the title ~~Add date features~~ [WIP] Add date features Feb 2, 2018

looks_like_day_ordinal True only for numbers between 0 and 32

d5f7deb

Kebniss changed the title ~~[WIP] Add date features~~ Add date features Feb 2, 2018

Kebniss added 2 commits March 6, 2018 17:15

add looks_like_ordinal and remove looks_like_ordinal_day + modify tests

2910304

force all values to be string in order to join them

c848bfe

kmike reviewed Mar 14, 2018

View reviewed changes

Kebniss added 10 commits March 14, 2018 16:34

Fix looks_like_date and tests

3aef718

- looks_like_date now includes patterns like XXXX.XX.XX and excludes 3 digit years like XX/XX/XXX

Cast values to string using py2 and py3 compatible method

f68431f

Update .travis.yml

f8fa819

swicth re.fullmatch to anchors for compatibility with py2

4d0fb06

Merge branch 'date-features' of github.com:scrapinghub/webstruct into…

cd82fc7

… date-features

speed looks_like_date + rename looks_like_ordinal_en + fix tests

09c4e0a

Copy w3lib/to_native_str in utils + remove w3lib dependency

c1b71f6

Remove todo

e906ccd

remove to_native_str

2aced50

fix list comprehension

4646095

Gallaecio force-pushed the master branch from 9e46156 to 17c8254 Compare December 19, 2019 13:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add date features#58

Add date features#58
Kebniss wants to merge 450 commits intomasterfrom
date-features

Kebniss commented Jan 30, 2018

Uh oh!

codecov bot commented Jan 30, 2018 •

edited

Loading

Uh oh!

Kebniss commented Jan 30, 2018

Uh oh!

kmike Jan 31, 2018

Uh oh!

kmike Jan 31, 2018

Uh oh!

kmike Mar 14, 2018

Uh oh!

kmike Mar 14, 2018 •

edited

Loading

Uh oh!

kmike Mar 14, 2018

Uh oh!

Kebniss commented May 3, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

		from webstruct.features import token_lower, token_identity, Pattern


		class PatternTest(unittest.TestCase):

Conversation

Kebniss commented Jan 30, 2018

Uh oh!

codecov bot commented Jan 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Kebniss commented Jan 30, 2018

Uh oh!

kmike Jan 31, 2018

Choose a reason for hiding this comment

Uh oh!

kmike Jan 31, 2018

Choose a reason for hiding this comment

Uh oh!

kmike Mar 14, 2018

Choose a reason for hiding this comment

Uh oh!

kmike Mar 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kmike Mar 14, 2018

Choose a reason for hiding this comment

Uh oh!

Kebniss commented May 3, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

codecov bot commented Jan 30, 2018 •

edited

Loading

kmike Mar 14, 2018 •

edited

Loading