Matt Dietz
d37c4fd551
Drops Python 2 support
...
REP-1030
In addition to some python 2 => 3 fixes, this change bumps the scikit-learn
version to latest. The previously pinned version of scikit-learn failed trying
to compile all necessary C modules under python 3.7+ due to included header files
that weren't compatible with C the API implemented in python 3.7+.
Simultaneously, with the restrictive compatibility supported by scikit-learn,
it seemed prudent to drop python 2 support altogether. Otherwise, we'd be stuck
with python 3.4 as the newest possible version we could support.
With this change, tests are currently passing under 3.9.2.
Lastly, imports the original training data. At some point, a new version
of the training data was committed to the repo but no classifier was
trained from it. Using a classifier trained from this new data resulted
in most of the tests failing.
2021-06-10 14:03:25 -05:00
Derrick J. Wippler
1018e88ec1
Now removing namespaces from parsed HTML
2019-05-10 11:16:12 -05:00
Sergey Obukhov
8138ea9a60
fix text with Date: misclassified as quotations splitter
2019-01-18 16:49:39 +03:00
Sergey Obukhov
0e6d5f993c
fix appointments in text
2017-10-23 16:32:42 -07:00
Ezra Pagel
221774c6f8
android_wrote regex was incorrectly iterating characters in 'wrote', resulting in greedy regex that
...
matched many strings with dashes
2017-08-21 12:47:06 -05:00
Hung Nguyen
b8e1894f3b
add test case
2017-07-10 13:28:33 +07:00
Yacine Filali
15e61768f2
Encoding fixes
2017-05-23 16:17:39 -07:00
Yacine Filali
dd0a0f5c4d
Python 2.7 backward compat
2017-05-23 16:10:13 -07:00
Yacine Filali
086f5ba43b
Updated talon for Python 3
2017-05-23 15:39:50 -07:00
Sergey Obukhov
95954a65a0
Merge branch 'master' into polymail_support
2017-04-25 11:30:53 -07:00
Sergey Obukhov
6f159e8959
loosen the encoding requirement for detect_encoding
2017-04-25 11:19:01 -07:00
Ethan Setnik
cca64d3ed1
add test case
2017-04-11 23:36:36 -04:00
Sergey Obukhov
0f5e72623b
add android quotation pattern
2017-04-10 16:33:21 -07:00
smitcona
29f1d21be7
fixed expected markers and incorrect condensed header not matching regex
2017-02-06 15:03:22 +00:00
smitcona
34c5b526c3
Remove the whitespace before the line if the flag is set
2017-02-03 12:57:26 +00:00
smitcona
984c036b6e
Set the marker back to 'm' rather than 't' if it matches the QUOT_PATTERN. Updated test case.
2017-02-01 18:28:19 +00:00
smitcona
a403ecb5c9
Adding two level indentation test
2017-02-01 18:09:35 +00:00
smitcona
a44713409c
Added additional case for testing new functionality of split_emails()
2017-02-01 17:40:59 +00:00
smitcona
b5e3397b88
Updating test to account for --original message-- case
2016-11-22 20:00:31 +00:00
smitcona
adfed748ce
split_emails function added, test added
2016-11-21 12:35:36 +00:00
Sergey Obukhov
534457e713
protect html_to_text as well
2016-09-14 09:58:41 -07:00
Sergey Obukhov
ea82a9730e
restrict html processing to a certain number of tags
2016-09-14 09:33:30 -07:00
Sergey Obukhov
37c95ff97b
fallback untouched html if we can not parse html tree
2016-08-19 11:38:12 -07:00
Sergey Obukhov
bcf97eccfa
use html5lib to parse html
2016-08-15 19:36:21 -07:00
Sergey Obukhov
a9719833e0
html with comment that has no parent crashes html_tree_to_text
2016-08-12 17:40:12 -07:00
Sergey Obukhov
69a44b10a1
Merge branch 'master' into sergey/empty-html
2016-08-11 23:58:11 -07:00
Sergey Obukhov
4b953bcddc
fixes mailgun/talon#103 keep newlines when parsing html quotations
2016-08-11 20:17:37 -07:00
Sergey Obukhov
315eaa7080
if html stripped off quotations does not have readable text fallback to unparsed html
2016-08-11 19:55:23 -07:00
Sergey Obukhov
21e9a31ffe
add test
2016-08-09 17:15:49 -07:00
Sergey Obukhov
a21ccdb21b
consider word capitilized only if it is camel case - not all upper case
2016-07-19 17:37:36 -07:00
Umair Khan
cefbcffd59
Make tests/text_quotations_test.py compatible with Python 3.
2016-07-13 14:45:26 +05:00
Umair Khan
622a98d6d5
Make utils compatible with Python 3.
2016-07-13 13:00:24 +05:00
Umair Khan
555c34d7a8
Make sure html_to_text processes bytes
2016-07-13 11:18:10 +05:00
Umair Khan
da998ddb60
Run modernizer on the code.
2016-07-12 17:25:46 +05:00
Sergey Obukhov
44e70939d6
fixes mailgun/talon#89
2016-05-17 15:31:01 -07:00
Doug Keen
333beb94af
Fix #85 (exception when stripping gmail quotes)
2016-04-04 14:22:50 -07:00
Sergey Obukhov
02adf53ab9
fixes mailgun/talon#12
2016-03-04 13:14:50 -08:00
Sergey Obukhov
31803d41bc
fixes mailgun/talon#18
2016-02-19 19:07:10 -08:00
Sergey Obukhov
999e9c3725
fixes mailgun/talon#19
2016-02-19 17:53:52 -08:00
Sergey Obukhov
ce65ff8fc8
Merge pull request #71 from clara-labs/ms-2010-issue
...
First pass at handling issue with ms outlook 2010 with unenclosed quo…
2015-12-18 19:14:13 -08:00
Sergey Obukhov
3d9ae356ea
add more tests, make standard reply tests more relaxed
2015-12-18 18:56:41 -08:00
Carlos Correa
f688d074b5
First pass at handling issue with ms outlook 2010 with unenclosed quoted text.
2015-12-10 19:16:13 -08:00
Sergey Obukhov
41457d8fbd
fixes mailgun/talon#38 mailgun/talon#20
2015-12-05 00:37:02 -08:00
Sergey Obukhov
2c416ecc0e
Merge pull request #62 from tgwizard/better-support-for-scandinavian-languages
...
Add better support for Scandinavian languages
2015-10-14 21:48:10 -07:00
Adam Renberg
14e3a0d80b
Add better support for Scandinavian languages
...
This is a port of https://github.com/tictail/claw/pull/6 by @simonflore.
2015-09-21 21:42:01 +02:00
Adam Renberg
fcd9e2716a
Add fix for Apple Mail email format
...
Where they have an initial > on the "date line".
2015-09-21 21:33:57 +02:00
Sergey Obukhov
ae508fe0e5
fixes mailgun/talon#26
2015-09-21 09:51:26 -07:00
Sergey Obukhov
d328c9d128
fixes mailgun/talon#43
2015-09-18 05:19:59 -07:00
Sergey Obukhov
ad09b18f3f
fixes mailgun/talon#52
2015-09-18 04:47:23 -07:00
Sergey Obukhov
15976888a0
use precise encoding when converting to unicode
2015-09-11 10:38:28 -07:00