68 Commits

Author SHA1 Message Date
Maxim Vladimirskiy 14f106ee76 Operate on unicode data exclusively 2022-02-04 17:31:53 +03:00
Maxim Vladimirskiy cec5acf58f Remove max tags limit 2022-01-06 14:18:11 +03:00
Matt Dietz d37c4fd551 Drops Python 2 support
REP-1030

In addition to some python 2 => 3 fixes, this change bumps the scikit-learn
version to latest. The previously pinned version of scikit-learn failed trying
to compile all necessary C modules under python 3.7+ due to included header files
that weren't compatible with C the API implemented in python 3.7+.

Simultaneously, with the restrictive compatibility supported by scikit-learn,
it seemed prudent to drop python 2 support altogether. Otherwise, we'd be stuck
with python 3.4 as the newest possible version we could support.

With this change, tests are currently passing under 3.9.2.

Lastly, imports the original training data. At some point, a new version
of the training data was committed to the repo but no classifier was
trained from it. Using a classifier trained from this new data resulted
in most of the tests failing.
2021-06-10 14:03:25 -05:00
Derrick J. Wippler 1018e88ec1 Now removing namespaces from parsed HTML 2019-05-10 11:16:12 -05:00
Sergey Obukhov 8138ea9a60 fix text with Date: misclassified as quotations splitter 2019-01-18 16:49:39 +03:00
Sergey Obukhov 0e6d5f993c fix appointments in text 2017-10-23 16:32:42 -07:00
Ezra Pagel 221774c6f8 android_wrote regex was incorrectly iterating characters in 'wrote', resulting in greedy regex that
matched many strings with dashes
2017-08-21 12:47:06 -05:00
Hung Nguyen b8e1894f3b add test case 2017-07-10 13:28:33 +07:00
Yacine Filali 15e61768f2 Encoding fixes 2017-05-23 16:17:39 -07:00
Yacine Filali dd0a0f5c4d Python 2.7 backward compat 2017-05-23 16:10:13 -07:00
Yacine Filali 086f5ba43b Updated talon for Python 3 2017-05-23 15:39:50 -07:00
Sergey Obukhov 95954a65a0 Merge branch 'master' into polymail_support 2017-04-25 11:30:53 -07:00
Sergey Obukhov 6f159e8959 loosen the encoding requirement for detect_encoding 2017-04-25 11:19:01 -07:00
Ethan Setnik cca64d3ed1 add test case 2017-04-11 23:36:36 -04:00
Sergey Obukhov 0f5e72623b add android quotation pattern 2017-04-10 16:33:21 -07:00
smitcona 29f1d21be7 fixed expected markers and incorrect condensed header not matching regex 2017-02-06 15:03:22 +00:00
smitcona 34c5b526c3 Remove the whitespace before the line if the flag is set 2017-02-03 12:57:26 +00:00
smitcona 984c036b6e Set the marker back to 'm' rather than 't' if it matches the QUOT_PATTERN. Updated test case. 2017-02-01 18:28:19 +00:00
smitcona a403ecb5c9 Adding two level indentation test 2017-02-01 18:09:35 +00:00
smitcona a44713409c Added additional case for testing new functionality of split_emails() 2017-02-01 17:40:59 +00:00
smitcona b5e3397b88 Updating test to account for --original message-- case 2016-11-22 20:00:31 +00:00
smitcona adfed748ce split_emails function added, test added 2016-11-21 12:35:36 +00:00
Sergey Obukhov 534457e713 protect html_to_text as well 2016-09-14 09:58:41 -07:00
Sergey Obukhov ea82a9730e restrict html processing to a certain number of tags 2016-09-14 09:33:30 -07:00
Sergey Obukhov 37c95ff97b fallback untouched html if we can not parse html tree 2016-08-19 11:38:12 -07:00
Sergey Obukhov bcf97eccfa use html5lib to parse html 2016-08-15 19:36:21 -07:00
Sergey Obukhov a9719833e0 html with comment that has no parent crashes html_tree_to_text 2016-08-12 17:40:12 -07:00
Sergey Obukhov 69a44b10a1 Merge branch 'master' into sergey/empty-html 2016-08-11 23:58:11 -07:00
Sergey Obukhov 4b953bcddc fixes mailgun/talon#103 keep newlines when parsing html quotations 2016-08-11 20:17:37 -07:00
Sergey Obukhov 315eaa7080 if html stripped off quotations does not have readable text fallback to unparsed html 2016-08-11 19:55:23 -07:00
Sergey Obukhov 21e9a31ffe add test 2016-08-09 17:15:49 -07:00
Sergey Obukhov a21ccdb21b consider word capitilized only if it is camel case - not all upper case 2016-07-19 17:37:36 -07:00
Umair Khan cefbcffd59 Make tests/text_quotations_test.py compatible with Python 3. 2016-07-13 14:45:26 +05:00
Umair Khan 622a98d6d5 Make utils compatible with Python 3. 2016-07-13 13:00:24 +05:00
Umair Khan 555c34d7a8 Make sure html_to_text processes bytes 2016-07-13 11:18:10 +05:00
Umair Khan da998ddb60 Run modernizer on the code. 2016-07-12 17:25:46 +05:00
Sergey Obukhov 44e70939d6 fixes mailgun/talon#89 2016-05-17 15:31:01 -07:00
Doug Keen 333beb94af Fix #85 (exception when stripping gmail quotes) 2016-04-04 14:22:50 -07:00
Sergey Obukhov 02adf53ab9 fixes mailgun/talon#12 2016-03-04 13:14:50 -08:00
Sergey Obukhov 31803d41bc fixes mailgun/talon#18 2016-02-19 19:07:10 -08:00
Sergey Obukhov 999e9c3725 fixes mailgun/talon#19 2016-02-19 17:53:52 -08:00
Sergey Obukhov ce65ff8fc8 Merge pull request #71 from clara-labs/ms-2010-issue
First pass at handling issue with ms outlook 2010 with unenclosed quo…
2015-12-18 19:14:13 -08:00
Sergey Obukhov 3d9ae356ea add more tests, make standard reply tests more relaxed 2015-12-18 18:56:41 -08:00
Carlos Correa f688d074b5 First pass at handling issue with ms outlook 2010 with unenclosed quoted text. 2015-12-10 19:16:13 -08:00
Sergey Obukhov 41457d8fbd fixes mailgun/talon#38 mailgun/talon#20 2015-12-05 00:37:02 -08:00
Sergey Obukhov 2c416ecc0e Merge pull request #62 from tgwizard/better-support-for-scandinavian-languages
Add better support for Scandinavian languages
2015-10-14 21:48:10 -07:00
Adam Renberg 14e3a0d80b Add better support for Scandinavian languages
This is a port of https://github.com/tictail/claw/pull/6 by @simonflore.
2015-09-21 21:42:01 +02:00
Adam Renberg fcd9e2716a Add fix for Apple Mail email format
Where they have an initial > on the "date line".
2015-09-21 21:33:57 +02:00
Sergey Obukhov ae508fe0e5 fixes mailgun/talon#26 2015-09-21 09:51:26 -07:00
Sergey Obukhov d328c9d128 fixes mailgun/talon#43 2015-09-18 05:19:59 -07:00