135 Commits

Author SHA1 Message Date
Derrick J. Wippler
3083f86c75 Continue with quotation cut even if html cut throws an exception 2020-02-10 11:40:00 -06:00
Derrick J. Wippler
c575beb27d Test import clean up and pep8 2020-01-30 11:50:41 -06:00
Sergey Obukhov
d9ed7cc6d1 Merge pull request #190 from yoks/master
Add __init__.py into data folder, add data files into MANIFEST.in
2019-07-02 18:56:47 +03:00
Sergey Obukhov
0a0808c0a8 Merge branch 'master' into master 2019-07-01 20:48:46 +03:00
Sergey Obukhov
16354e3528 Merge pull request #191 from mailgun/thrawn/develop
PIP-423: Now removing namespaces from parsed HTML
2019-05-12 11:54:17 +03:00
Derrick J. Wippler
1018e88ec1 Now removing namespaces from parsed HTML 2019-05-10 11:16:12 -05:00
Ivan Anisimov
2916351517 Update setup.py 2019-03-16 22:17:26 +03:00
Ivan Anisimov
46d4b02c81 Update setup.py 2019-03-16 22:15:43 +03:00
Ivan Anisimov
58eac88a10 Update MANIFEST.in 2019-03-16 22:03:40 +03:00
Ivan Anisimov
2ef3d8dfbe Update MANIFEST.in 2019-03-16 22:01:00 +03:00
Ivan Anisimov
7cf4c29340 Create __init__.py 2019-03-16 21:54:09 +03:00
Sergey Obukhov
cdd84563dd Merge pull request #183 from mailgun/sergey/date
fix text with Date: misclassified as quotations splitter
2019-01-18 17:32:10 +03:00
Sergey Obukhov
8138ea9a60 fix text with Date: misclassified as quotations splitter 2019-01-18 16:49:39 +03:00
Sergey Obukhov
c171f9a875 Merge pull request #169 from Savageman/patch-2
Use regex match to detect outlook 2007, 2010, 2013
2018-11-05 10:43:20 +03:00
Sergey Obukhov
3f97a8b8ff Merge branch 'master' into patch-2 2018-11-05 10:42:00 +03:00
Esperat Julian
1147767ff3 Fix regression: windows mail format was left forgotten
Missing a | at the end of the regex, so next lines are part of the global search.
2018-11-04 19:42:12 +01:00
Sergey Obukhov
6a304215c3 Merge pull request #177 from mailgun/obukhov-sergey-patch-1
Update Readme with how to retrain on your own data
2018-11-02 15:22:18 +03:00
Sergey Obukhov
31714506bd Update Readme with how to retrain on your own data 2018-11-02 15:21:36 +03:00
Sergey Obukhov
403d80cf3b Merge pull request #161 from glaand/master
Fix: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
2018-11-02 15:03:02 +03:00
Sergey Obukhov
7cf20f2877 Merge branch 'master' into master 2018-11-02 14:52:38 +03:00
Sergey Obukhov
afff08b017 Merge branch 'master' into patch-2 2018-11-02 09:13:42 +03:00
Sergey Obukhov
685abb1905 Merge pull request #171 from gabriellima95/Add-Portuguese-Language
Add Portuguese language to quotations
2018-11-02 09:12:43 +03:00
Sergey Obukhov
41990727a3 Merge branch 'master' into Add-Portuguese-Language 2018-11-02 09:11:07 +03:00
Sergey Obukhov
b113d8ab33 Merge pull request #172 from ad-m/patch-1
Fix catastrophic backtracking in regexp
2018-11-02 09:09:49 +03:00
Adam Dobrawy
7bd0e9cc2f Fix catastrophic backtracking in regexp
Co-Author: @Nipsuli
2018-09-21 22:00:10 +02:00
gabriellima95
1e030a51d4 Add Portuguese language to quotations 2018-09-11 15:27:39 -03:00
Esperat Julian
238a5de5cc Use regex match to detect outlook 2007, 2010, 2013
I encountered a variant of the outlook quotations with a space after the semicolon.

To prevent multiplying the number of rules, I implemented a regex match instead (I found how to here: https://stackoverflow.com/a/34093801/211204).

I documented all the different variants as cleanly as I could.
2018-08-31 12:39:52 +02:00
André Glatzl
53b24ffb3d Cut out first some encoding html tags such as xml and doctype for avoiding conflict with unicode decoding 2017-12-19 15:15:10 +01:00
Sergey Obukhov
a7404afbcb Merge pull request #155 from mailgun/sergey/appointment
fix appointments in text
2017-10-23 16:34:08 -07:00
Sergey Obukhov
0e6d5f993c fix appointments in text 2017-10-23 16:32:42 -07:00
Sergey Obukhov
60637ff13a Merge pull request #152 from mailgun/sergey/v1.4.4
bump version
2017-08-24 16:00:05 -07:00
Sergey Obukhov
df8259e3fe bump version 2017-08-24 15:58:53 -07:00
Sergey Obukhov
aab3b1cc75 Merge pull request #150 from ezrapagel/fix_greedy_dash_regex
android_wrote regex incorrectly matching
2017-08-24 15:52:29 -07:00
Sergey Obukhov
9492b39f2d Merge branch 'master' into fix_greedy_dash_regex 2017-08-24 15:39:28 -07:00
Sergey Obukhov
b9ac866ea7 Merge pull request #151 from mailgun/sergey/reshape
reshape data as suggested by sklearn
2017-08-24 12:04:58 -07:00
Sergey Obukhov
678517dd89 reshape data as suggested by sklearn 2017-08-24 12:03:47 -07:00
Ezra Pagel
221774c6f8 android_wrote regex was incorrectly iterating characters in 'wrote', resulting in greedy regex that
matched many strings with dashes
2017-08-21 12:47:06 -05:00
Sergey Obukhov
a2aa345712 Merge pull request #148 from mailgun/sergey/v1.4.2
bump version after adding support for Vietnamese format
2017-07-10 11:44:46 -07:00
Sergey Obukhov
d998beaff3 bump version after adding support for Vietnamese format 2017-07-10 11:42:52 -07:00
Sergey Obukhov
a379bc4e7c Merge pull request #147 from hnx116/master
add support for Vietnamese reply format
2017-07-10 11:40:04 -07:00
Hung Nguyen
b8e1894f3b add test case 2017-07-10 13:28:33 +07:00
Hung Nguyen
0b5a44090f add support for Vietnamese reply format 2017-07-10 11:18:57 +07:00
Sergey Obukhov
b40835eca2 Merge pull request #145 from mailgun/sergey/outlook-2013-version-bump
bump version after merging outlook 2013 support PR
2017-06-18 22:56:16 -07:00
Sergey Obukhov
b38562c7cc bump version after merging outlook 2013 support PR 2017-06-18 22:55:15 -07:00
Sergey Obukhov
70e9fb415e Merge pull request #139 from Savageman/patch-1
Added Outlook 2013 rules
2017-06-18 22:53:18 -07:00
Sergey Obukhov
64612099cd Merge branch 'master' into patch-1 2017-06-18 22:51:46 -07:00
Sergey Obukhov
45c20f979d Merge pull request #144 from mailgun/sergey/python3-support-version-bump
bump version after merging python 3 support PR
2017-06-18 22:49:20 -07:00
Sergey Obukhov
743c76f159 bump version after merging python 3 support PR 2017-06-18 22:48:12 -07:00
Sergey Obukhov
bc5dad75d3 Merge pull request #141 from yfilali/master
Python 3 compatibility up to 3.6.1
2017-06-18 22:44:07 -07:00
Yacine Filali
4acf05cf28 Only use load compat if we can't load the classifier 2017-05-24 13:29:59 -07:00
Yacine Filali
f5f7264077 Can now handle read only classifier data as well 2017-05-24 13:22:24 -07:00
Yacine Filali
4364bebf38 Added exception checking for pickle format conversion 2017-05-24 10:26:33 -07:00
Yacine Filali
15e61768f2 Encoding fixes 2017-05-23 16:17:39 -07:00
Yacine Filali
dd0a0f5c4d Python 2.7 backward compat 2017-05-23 16:10:13 -07:00
Yacine Filali
086f5ba43b Updated talon for Python 3 2017-05-23 15:39:50 -07:00
Esperat Julian
e16dcf629e Added Outlook 2013 rules
Only the border color changes (compared to Outlook 2007, 2010) from `#B5C4DF` to `#E1E1E1`.
2017-04-27 11:34:01 +02:00
Sergey Obukhov
f16ae5110b Merge pull request #138 from mailgun/sergey/v1.3.7
bumped talon version
2017-04-25 11:49:29 -07:00
Sergey Obukhov
ab5cbe5ec3 bumped talon version 2017-04-25 11:43:55 -07:00
Sergey Obukhov
be5da92f16 Merge pull request #135 from esetnik/polymail_support
Polymail Quote Support
2017-04-25 11:34:47 -07:00
Sergey Obukhov
95954a65a0 Merge branch 'master' into polymail_support 2017-04-25 11:30:53 -07:00
Sergey Obukhov
0b55e8fa77 Merge pull request #137 from mailgun/sergey/chardet
loosen the encoding requirement for detect_encoding
2017-04-25 11:29:06 -07:00
Sergey Obukhov
6f159e8959 loosen the encoding requirement for detect_encoding 2017-04-25 11:19:01 -07:00
Ethan Setnik
5c413b4b00 allow more lines since polymail has extra whitespace 2017-04-12 00:07:29 -04:00
Ethan Setnik
cca64d3ed1 add test case 2017-04-11 23:36:36 -04:00
Ethan Setnik
e11eaf6ff8 add support for polymail reply format 2017-04-11 22:38:29 -04:00
Sergey Obukhov
85a4c1d855 Merge pull request #133 from mailgun/sergey/android
add android quotation pattern
2017-04-10 16:37:17 -07:00
Sergey Obukhov
0f5e72623b add android quotation pattern 2017-04-10 16:33:21 -07:00
Sergey Obukhov
061e549ad7 Merge pull request #128 from mailgun/sergey/1.3.4
bump version
2017-02-14 11:17:35 -08:00
Sergey Obukhov
49d1a5d248 bump version 2017-02-14 11:05:50 -08:00
Sergey Obukhov
03d6b00db8 Merge pull request #127 from conalsmith49/mark-splitlines-in-email-quotation-indents
Split_Email(): Mark splitlines for headers indented with spaces or email quotation indents (">")
2017-02-14 11:03:51 -08:00
smitcona
a2eb0f7201 Creating new method which removes initial spaces and marks the message lines. Removing ambiguity introduced to mark_message_lines 2017-02-14 18:19:45 +00:00
smitcona
5c71a0ca07 Split the comment lines so that they are not over 80 characters 2017-02-13 16:45:26 +00:00
Sergey Obukhov
489d16fad9 Merge branch 'master' into mark-splitlines-in-email-quotation-indents 2017-02-09 21:10:16 -08:00
Sergey Obukhov
a458707777 Merge pull request #124 from phanindra-ramesh/issue_123
Fixes issue #123
2017-02-09 20:55:36 -08:00
smitcona
a1d0a86305 Pass ignore_initial_spaces=True as this has better clarity than separate boolean variable 2017-02-07 12:47:33 +00:00
smitcona
29f1d21be7 fixed expected markers and incorrect condensed header not matching regex 2017-02-06 15:03:22 +00:00
smitcona
34c5b526c3 Remove the whitespace before the line if the flag is set 2017-02-03 12:57:26 +00:00
smitcona
3edb6578ba Dividing preprocess method into two methods, split_emails() now calls one without email content being altered. 2017-02-03 11:49:23 +00:00
smitcona
984c036b6e Set the marker back to 'm' rather than 't' if it matches the QUOT_PATTERN. Updated test case. 2017-02-01 18:28:19 +00:00
smitcona
a403ecb5c9 Adding two level indentation test 2017-02-01 18:09:35 +00:00
smitcona
a44713409c Added additional case for testing new functionality of split_emails() 2017-02-01 17:40:59 +00:00
smitcona
567467b8ed Update comment 2017-02-01 17:29:05 +00:00
smitcona
139edd6104 Add new method which marks as splitlines, lines which are splitlines but start with email quotation indents ("> ") 2017-02-01 17:16:30 +00:00
Phanindra Ramesh Challa
e756d55abf Fixes issue #123 2016-12-27 13:53:40 +05:30
Sergey Obukhov
015c8d2a78 Merge pull request #120 from mailgun/sergey/talon-1.3.3
bump talon version
2016-11-30 18:28:39 -08:00
Sergey Obukhov
5af846c13d bump talon version 2016-11-30 12:56:06 -08:00
Sergey Obukhov
e69a9c7a54 Merge pull request #119 from conapart3/master
Addition of new split_email method for issue:115
2016-11-30 12:51:32 -08:00
conapart3
23cb2a9a53 Merge pull request #1 from conapart3/issue-115-date-split-in-headers
split_emails function added, test added
2016-11-22 20:02:54 +00:00
smitcona
b5e3397b88 Updating test to account for --original message-- case 2016-11-22 20:00:31 +00:00
smitcona
5685a4055a Improved algorithm 2016-11-22 19:56:57 +00:00
smitcona
97b72ef767 Adding in_header_block variable for reliability 2016-11-22 19:06:34 +00:00
smitcona
31489848be Remove print lines 2016-11-21 17:36:06 +00:00
smitcona
e5988d447b Add space 2016-11-21 12:48:29 +00:00
smitcona
adfed748ce split_emails function added, test added 2016-11-21 12:35:36 +00:00
Sergey Obukhov
2444ba87c0 Merge pull request #111 from mailgun/sergey/tagscount
restrict html processing to a certain number of tags
2016-09-14 11:06:29 -07:00
Sergey Obukhov
534457e713 protect html_to_text as well 2016-09-14 09:58:41 -07:00
Sergey Obukhov
ea82a9730e restrict html processing to a certain number of tags 2016-09-14 09:33:30 -07:00
Sergey Obukhov
f04b872e14 Merge pull request #108 from mailgun/sergey/html5lib-fix
use new parser each time we parse a document
2016-08-22 18:10:35 -07:00
Sergey Obukhov
e61894e425 bump version 2016-08-22 17:34:18 -07:00
Sergey Obukhov
35fbdaadac use new parser each time we parse a document 2016-08-22 16:25:04 -07:00
Sergey Obukhov
8441bc7328 Merge pull request #106 from mailgun/sergey/html5lib
use html5lib to parse html
2016-08-19 15:58:07 -07:00
Sergey Obukhov
37c95ff97b fallback untouched html if we can not parse html tree 2016-08-19 11:38:12 -07:00
Sergey Obukhov
5b1ca33c57 fix cssselect 2016-08-16 17:11:41 -07:00
Sergey Obukhov
ec8e09b34e fix 2016-08-15 20:31:04 -07:00
Sergey Obukhov
bcf97eccfa use html5lib to parse html 2016-08-15 19:36:21 -07:00
Sergey Obukhov
f53b5cc7a6 Merge pull request #105 from mailgun/sergey/fromstring
html with comment that has no parent crashes html_tree_to_text
2016-08-15 13:40:37 -07:00
Sergey Obukhov
27adde7aa7 bump version 2016-08-15 13:21:10 -07:00
Sergey Obukhov
a9719833e0 html with comment that has no parent crashes html_tree_to_text 2016-08-12 17:40:12 -07:00
Sergey Obukhov
7bf37090ca Merge pull request #101 from mailgun/sergey/empty-html
if html stripped off quotations does not have readable text fallback …
2016-08-12 12:18:50 -07:00
Sergey Obukhov
44fcef7123 bump version 2016-08-11 23:59:18 -07:00
Sergey Obukhov
69a44b10a1 Merge branch 'master' into sergey/empty-html 2016-08-11 23:58:11 -07:00
Sergey Obukhov
b085e3d049 Merge pull request #104 from mailgun/sergey/spaces
fixes mailgun/talon#103 keep newlines when parsing html quotations
2016-08-11 23:56:26 -07:00
Sergey Obukhov
4b953bcddc fixes mailgun/talon#103 keep newlines when parsing html quotations 2016-08-11 20:17:37 -07:00
Sergey Obukhov
315eaa7080 if html stripped off quotations does not have readable text fallback to unparsed html 2016-08-11 19:55:23 -07:00
Sergey Obukhov
5a9bc967f1 Merge pull request #100 from mailgun/sergey/restrict
do not parse html quotations if html is longer then certain threshold
2016-08-11 16:08:03 -07:00
Sergey Obukhov
a0d7236d0b bump version and add a comment 2016-08-11 15:49:09 -07:00
Sergey Obukhov
21e9a31ffe add test 2016-08-09 17:15:49 -07:00
Sergey Obukhov
4ee46c0a97 do not parse html quotations if html is longer then certain threshold 2016-08-09 17:08:58 -07:00
Sergey Obukhov
10d9a930f9 Merge pull request #99 from mailgun/sergey/capitalized
consider word capitilized only if it is camel case - not all upper case
2016-07-20 16:47:12 -07:00
Sergey Obukhov
a21ccdb21b consider word capitilized only if it is camel case - not all upper case 2016-07-19 17:37:36 -07:00
Sergey Obukhov
7cdd7a8f35 Merge pull request #98 from mailgun/sergey/1.2.11
version bump
2016-07-19 16:22:24 -07:00
Sergey Obukhov
01e03a47e0 version bump 2016-07-19 15:51:46 -07:00
Sergey Obukhov
1b9a71551a Merge pull request #97 from umairwaheed/strip-talon
Strip down Talon
2016-07-19 15:46:56 -07:00
Umair Khan
911efd1db4 Move encoding detection inside if condition. 2016-07-19 09:44:40 +05:00
Umair Khan
e61f0a68c4 Add six library to setup.py 2016-07-19 09:40:03 +05:00
Umair Khan
cefbcffd59 Make tests/text_quotations_test.py compatible with Python 3. 2016-07-13 14:45:26 +05:00
Umair Khan
622a98d6d5 Make utils compatible with Python 3. 2016-07-13 13:00:24 +05:00
Umair Khan
7901f5d1dc Convert msg_body into unicode in preprocess. 2016-07-13 11:18:10 +05:00
Umair Khan
555c34d7a8 Make sure html_to_text processes bytes 2016-07-13 11:18:10 +05:00
Umair Khan
dcc0d1de20 Convert msg_body to bytes in extract_from_html 2016-07-13 11:18:06 +05:00
Umair Khan
7bdf4d622b Only encode if str 2016-07-13 08:01:47 +05:00
Umair Khan
4a7207b0d0 Only convert to unicode if str 2016-07-13 08:01:47 +05:00
Umair Khan
ad9c2ca0e8 Upgrade quotations.py 2016-07-13 08:01:44 +05:00
Umair Khan
da998ddb60 Run modernizer on the code. 2016-07-12 17:25:46 +05:00
Umair Khan
07f68815df Allow installation of ML free version.
Add an option to the install script, `--no-ml`, that when given will
install Talon without ML support.

Fixes #96
2016-07-12 15:08:53 +05:00
29 changed files with 3508 additions and 2907 deletions

4
.gitignore vendored
View File

@@ -39,6 +39,8 @@ nosetests.xml
/.emacs.desktop /.emacs.desktop
/.emacs.desktop.lock /.emacs.desktop.lock
.elc .elc
.idea
.cache
auto-save-list auto-save-list
tramp tramp
.\#* .\#*
@@ -51,4 +53,4 @@ tramp
_trial_temp _trial_temp
# OSX # OSX
.DS_Store .DS_Store

View File

@@ -1,9 +1,14 @@
recursive-include tests *
recursive-include talon *
recursive-exclude tests *.pyc *~ recursive-exclude tests *.pyc *~
recursive-exclude talon *.pyc *~ recursive-exclude talon *.pyc *~
include train.data include train.data
include classifier include classifier
include LICENSE include LICENSE
include MANIFEST.in include MANIFEST.in
include README.rst include README.rst
include talon/signature/data/train.data
include talon/signature/data/classifier
include talon/signature/data/classifier_01.npy
include talon/signature/data/classifier_02.npy
include talon/signature/data/classifier_03.npy
include talon/signature/data/classifier_04.npy
include talon/signature/data/classifier_05.npy

View File

@@ -129,6 +129,22 @@ start using it for talon.
.. _EDRM: http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set .. _EDRM: http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set
.. _forge: https://github.com/mailgun/forge .. _forge: https://github.com/mailgun/forge
Training on your dataset
------------------------
talon comes with a pre-processed dataset and a pre-trained classifier. To retrain the classifier on your own dataset of raw emails, structure and annotate them in the same way the `forge`_ project does. Then do:
.. code:: python
from talon.signature.learning.dataset import build_extraction_dataset
from talon.signature.learning import classifier as c
build_extraction_dataset("/path/to/your/P/folder", "/path/to/talon/signature/data/train.data")
c.train(c.init(), "/path/to/talon/signature/data/train.data", "/path/to/talon/signature/data/classifier")
Note that for signature extraction you need just the folder with the positive samples with annotated signature lines (P folder).
.. _forge: https://github.com/mailgun/forge
Research Research
-------- --------

View File

@@ -1,8 +1,35 @@
from __future__ import absolute_import
from setuptools import setup, find_packages from setuptools import setup, find_packages
from setuptools.command.install import install
class InstallCommand(install):
user_options = install.user_options + [
('no-ml', None, "Don't install without Machine Learning modules."),
]
boolean_options = install.boolean_options + ['no-ml']
def initialize_options(self):
install.initialize_options(self)
self.no_ml = None
def finalize_options(self):
install.finalize_options(self)
if self.no_ml:
dist = self.distribution
dist.packages=find_packages(exclude=[
'tests',
'tests.*',
'talon.signature',
'talon.signature.*',
])
for not_required in ['numpy', 'scipy', 'scikit-learn==0.16.1']:
dist.install_requires.remove(not_required)
setup(name='talon', setup(name='talon',
version='1.2.10', version='1.4.8',
description=("Mailgun library " description=("Mailgun library "
"to extract message quotations and signatures."), "to extract message quotations and signatures."),
long_description=open("README.rst").read(), long_description=open("README.rst").read(),
@@ -10,7 +37,10 @@ setup(name='talon',
author_email='admin@mailgunhq.com', author_email='admin@mailgunhq.com',
url='https://github.com/mailgun/talon', url='https://github.com/mailgun/talon',
license='APACHE2', license='APACHE2',
packages=find_packages(exclude=['tests']), cmdclass={
'install': InstallCommand,
},
packages=find_packages(exclude=['tests', 'tests.*']),
include_package_data=True, include_package_data=True,
zip_safe=True, zip_safe=True,
install_requires=[ install_requires=[
@@ -21,7 +51,9 @@ setup(name='talon',
"scikit-learn==0.16.1", # pickled versions of classifier, else rebuild "scikit-learn==0.16.1", # pickled versions of classifier, else rebuild
'chardet>=1.0.1', 'chardet>=1.0.1',
'cchardet>=0.3.5', 'cchardet>=0.3.5',
'cssselect' 'cssselect',
'six>=1.10.0',
'html5lib'
], ],
tests_require=[ tests_require=[
"mock", "mock",

View File

@@ -1,7 +1,13 @@
from __future__ import absolute_import
from talon.quotations import register_xpath_extensions from talon.quotations import register_xpath_extensions
from talon import signature try:
from talon import signature
ML_ENABLED = True
except ImportError:
ML_ENABLED = False
def init(): def init():
register_xpath_extensions() register_xpath_extensions()
signature.initialize() if ML_ENABLED:
signature.initialize()

View File

@@ -1,3 +1,4 @@
from __future__ import absolute_import
import regex as re import regex as re

View File

@@ -3,8 +3,10 @@ The module's functions operate on message bodies trying to extract original
messages (without quoted messages) from html messages (without quoted messages) from html
""" """
from __future__ import absolute_import
import regex as re import regex as re
from talon.utils import cssselect
CHECKPOINT_PREFIX = '#!%!' CHECKPOINT_PREFIX = '#!%!'
CHECKPOINT_SUFFIX = '!%!#' CHECKPOINT_SUFFIX = '!%!#'
@@ -77,7 +79,7 @@ def delete_quotation_tags(html_note, counter, quotation_checkpoints):
def cut_gmail_quote(html_message): def cut_gmail_quote(html_message):
''' Cuts the outermost block element with class gmail_quote. ''' ''' Cuts the outermost block element with class gmail_quote. '''
gmail_quote = html_message.cssselect('div.gmail_quote') gmail_quote = cssselect('div.gmail_quote', html_message)
if gmail_quote and (gmail_quote[0].text is None or not RE_FWD.match(gmail_quote[0].text)): if gmail_quote and (gmail_quote[0].text is None or not RE_FWD.match(gmail_quote[0].text)):
gmail_quote[0].getparent().remove(gmail_quote[0]) gmail_quote[0].getparent().remove(gmail_quote[0])
return True return True
@@ -85,17 +87,24 @@ def cut_gmail_quote(html_message):
def cut_microsoft_quote(html_message): def cut_microsoft_quote(html_message):
''' Cuts splitter block and all following blocks. ''' ''' Cuts splitter block and all following blocks. '''
#use EXSLT extensions to have a regex match() function with lxml
ns = {"re": "http://exslt.org/regular-expressions"}
#general pattern: @style='border:none;border-top:solid <color> 1.0pt;padding:3.0pt 0<unit> 0<unit> 0<unit>'
#outlook 2007, 2010 (international) <color=#B5C4DF> <unit=cm>
#outlook 2007, 2010 (american) <color=#B5C4DF> <unit=pt>
#outlook 2013 (international) <color=#E1E1E1> <unit=cm>
#outlook 2013 (american) <color=#E1E1E1> <unit=pt>
#also handles a variant with a space after the semicolon
splitter = html_message.xpath( splitter = html_message.xpath(
#outlook 2007, 2010 (international) #outlook 2007, 2010, 2013 (international, american)
"//div[@style='border:none;border-top:solid #B5C4DF 1.0pt;" "//div[@style[re:match(., 'border:none; ?border-top:solid #(E1E1E1|B5C4DF) 1.0pt; ?"
"padding:3.0pt 0cm 0cm 0cm']|" "padding:3.0pt 0(in|cm) 0(in|cm) 0(in|cm)')]]|"
#outlook 2007, 2010 (american)
"//div[@style='border:none;border-top:solid #B5C4DF 1.0pt;"
"padding:3.0pt 0in 0in 0in']|"
#windows mail #windows mail
"//div[@style='padding-top: 5px; " "//div[@style='padding-top: 5px; "
"border-top-color: rgb(229, 229, 229); " "border-top-color: rgb(229, 229, 229); "
"border-top-width: 1px; border-top-style: solid;']" "border-top-width: 1px; border-top-style: solid;']"
, namespaces=ns
) )
if splitter: if splitter:
@@ -134,7 +143,7 @@ def cut_microsoft_quote(html_message):
def cut_by_id(html_message): def cut_by_id(html_message):
found = False found = False
for quote_id in QUOTE_IDS: for quote_id in QUOTE_IDS:
quote = html_message.cssselect('#{}'.format(quote_id)) quote = cssselect('#{}'.format(quote_id), html_message)
if quote: if quote:
found = True found = True
quote[0].getparent().remove(quote[0]) quote[0].getparent().remove(quote[0])

View File

@@ -5,20 +5,24 @@ The module's functions operate on message bodies trying to extract
original messages (without quoted messages) original messages (without quoted messages)
""" """
from __future__ import absolute_import
import regex as re import regex as re
import logging import logging
from copy import deepcopy from copy import deepcopy
from lxml import html, etree from lxml import html, etree
from talon.utils import get_delimiter, html_to_text from talon.utils import (get_delimiter, html_tree_to_text,
html_document_fromstring)
from talon import html_quotations from talon import html_quotations
from six.moves import range
import six
log = logging.getLogger(__name__) log = logging.getLogger(__name__)
RE_FWD = re.compile("^[-]+[ ]*Forwarded message[ ]*[-]+$", re.I | re.M) RE_FWD = re.compile("^[-]+[ ]*Forwarded message[ ]*[-]+\s*$", re.I | re.M)
RE_ON_DATE_SMB_WROTE = re.compile( RE_ON_DATE_SMB_WROTE = re.compile(
u'(-*[>]?[ ]?({0})[ ].*({1})(.*\n){{0,2}}.*({2}):?-*)'.format( u'(-*[>]?[ ]?({0})[ ].*({1})(.*\n){{0,2}}.*({2}):?-*)'.format(
@@ -34,10 +38,14 @@ RE_ON_DATE_SMB_WROTE = re.compile(
'Op', 'Op',
# German # German
'Am', 'Am',
# Portuguese
'Em',
# Norwegian # Norwegian
u'', u'',
# Swedish, Danish # Swedish, Danish
'Den', 'Den',
# Vietnamese
u'Vào',
)), )),
# Date and sender separator # Date and sender separator
u'|'.join(( u'|'.join((
@@ -58,8 +66,12 @@ RE_ON_DATE_SMB_WROTE = re.compile(
'schreef','verzond','geschreven', 'schreef','verzond','geschreven',
# German # German
'schrieb', 'schrieb',
# Portuguese
'escreveu',
# Norwegian, Swedish # Norwegian, Swedish
'skrev', 'skrev',
# Vietnamese
u'đã viết',
)) ))
)) ))
# Special case for languages where text is translated like this: 'on {date} wrote {somebody}:' # Special case for languages where text is translated like this: 'on {date} wrote {somebody}:'
@@ -127,14 +139,33 @@ RE_ORIGINAL_MESSAGE = re.compile(u'[\s]*[-]+[ ]*({})[ ]*[-]+'.format(
'Oprindelig meddelelse', 'Oprindelig meddelelse',
))), re.I) ))), re.I)
RE_FROM_COLON_OR_DATE_COLON = re.compile(u'(_+\r?\n)?[\s]*(:?[*]?{})[\s]?:[*]? .*'.format( RE_FROM_COLON_OR_DATE_COLON = re.compile(u'((_+\r?\n)?[\s]*:?[*]?({})[\s]?:([^\n$]+\n){{1,2}}){{2,}}'.format(
u'|'.join(( u'|'.join((
# "From" in different languages. # "From" in different languages.
'From', 'Van', 'De', 'Von', 'Fra', u'Från', 'From', 'Van', 'De', 'Von', 'Fra', u'Från',
# "Date" in different languages. # "Date" in different languages.
'Date', 'Datum', u'Envoyé', 'Skickat', 'Sendt', 'Date', '[S]ent', 'Datum', u'Envoyé', 'Skickat', 'Sendt', 'Gesendet',
# "Subject" in different languages.
'Subject', 'Betreff', 'Objet', 'Emne', u'Ämne',
# "To" in different languages.
'To', 'An', 'Til', u'À', 'Till'
))), re.I | re.M)
# ---- John Smith wrote ----
RE_ANDROID_WROTE = re.compile(u'[\s]*[-]+.*({})[ ]*[-]+'.format(
u'|'.join((
# English
'wrote',
))), re.I) ))), re.I)
# Support polymail.io reply format
# On Tue, Apr 11, 2017 at 10:07 PM John Smith
#
# <
# mailto:John Smith <johnsmith@gmail.com>
# > wrote:
RE_POLYMAIL = re.compile('On.*\s{2}<\smailto:.*\s> wrote:', re.I)
SPLITTER_PATTERNS = [ SPLITTER_PATTERNS = [
RE_ORIGINAL_MESSAGE, RE_ORIGINAL_MESSAGE,
RE_ON_DATE_SMB_WROTE, RE_ON_DATE_SMB_WROTE,
@@ -142,29 +173,36 @@ SPLITTER_PATTERNS = [
RE_FROM_COLON_OR_DATE_COLON, RE_FROM_COLON_OR_DATE_COLON,
# 02.04.2012 14:20 пользователь "bob@example.com" < # 02.04.2012 14:20 пользователь "bob@example.com" <
# bob@xxx.mailgun.org> написал: # bob@xxx.mailgun.org> написал:
re.compile("(\d+/\d+/\d+|\d+\.\d+\.\d+).*@", re.S), re.compile("(\d+/\d+/\d+|\d+\.\d+\.\d+).*\s\S+@\S+", re.S),
# 2014-10-17 11:28 GMT+03:00 Bob < # 2014-10-17 11:28 GMT+03:00 Bob <
# bob@example.com>: # bob@example.com>:
re.compile("\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}\s+GMT.*@", re.S), re.compile("\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}\s+GMT.*\s\S+@\S+", re.S),
# Thu, 26 Jun 2014 14:00:51 +0400 Bob <bob@example.com>: # Thu, 26 Jun 2014 14:00:51 +0400 Bob <bob@example.com>:
re.compile('\S{3,10}, \d\d? \S{3,10} 20\d\d,? \d\d?:\d\d(:\d\d)?' re.compile('\S{3,10}, \d\d? \S{3,10} 20\d\d,? \d\d?:\d\d(:\d\d)?'
'( \S+){3,6}@\S+:'), '( \S+){3,6}@\S+:'),
# Sent from Samsung MobileName <address@example.com> wrote: # Sent from Samsung MobileName <address@example.com> wrote:
re.compile('Sent from Samsung .*@.*> wrote') re.compile('Sent from Samsung.* \S+@\S+> wrote'),
RE_ANDROID_WROTE,
RE_POLYMAIL
] ]
RE_LINK = re.compile('<(http://[^>]*)>') RE_LINK = re.compile('<(http://[^>]*)>')
RE_NORMALIZED_LINK = re.compile('@@(http://[^>@]*)@@') RE_NORMALIZED_LINK = re.compile('@@(http://[^>@]*)@@')
RE_PARENTHESIS_LINK = re.compile("\(https?://") RE_PARENTHESIS_LINK = re.compile("\(https?://")
SPLITTER_MAX_LINES = 4 SPLITTER_MAX_LINES = 6
MAX_LINES_COUNT = 1000 MAX_LINES_COUNT = 1000
# an extensive research shows that exceeding this limit
# leads to excessive processing time
MAX_HTML_LEN = 2794202
QUOT_PATTERN = re.compile('^>+ ?') QUOT_PATTERN = re.compile('^>+ ?')
NO_QUOT_LINE = re.compile('^[^>].*[\S].*') NO_QUOT_LINE = re.compile('^[^>].*[\S].*')
# Regular expression to identify if a line is a header.
RE_HEADER = re.compile(": ")
def extract_from(msg_body, content_type='text/plain'): def extract_from(msg_body, content_type='text/plain'):
try: try:
@@ -178,6 +216,19 @@ def extract_from(msg_body, content_type='text/plain'):
return msg_body return msg_body
def remove_initial_spaces_and_mark_message_lines(lines):
"""
Removes the initial spaces in each line before marking message lines.
This ensures headers can be identified if they are indented with spaces.
"""
i = 0
while i < len(lines):
lines[i] = lines[i].lstrip(' ')
i += 1
return mark_message_lines(lines)
def mark_message_lines(lines): def mark_message_lines(lines):
"""Mark message lines with markers to distinguish quotation lines. """Mark message lines with markers to distinguish quotation lines.
@@ -191,7 +242,7 @@ def mark_message_lines(lines):
>>> mark_message_lines(['answer', 'From: foo@bar.com', '', '> question']) >>> mark_message_lines(['answer', 'From: foo@bar.com', '', '> question'])
'tsem' 'tsem'
""" """
markers = bytearray(len(lines)) markers = ['e' for _ in lines]
i = 0 i = 0
while i < len(lines): while i < len(lines):
if not lines[i].strip(): if not lines[i].strip():
@@ -207,7 +258,7 @@ def mark_message_lines(lines):
if splitter: if splitter:
# append as many splitter markers as lines in splitter # append as many splitter markers as lines in splitter
splitter_lines = splitter.group().splitlines() splitter_lines = splitter.group().splitlines()
for j in xrange(len(splitter_lines)): for j in range(len(splitter_lines)):
markers[i + j] = 's' markers[i + j] = 's'
# skip splitter lines # skip splitter lines
@@ -217,7 +268,7 @@ def mark_message_lines(lines):
markers[i] = 't' markers[i] = 't'
i += 1 i += 1
return markers return ''.join(markers)
def process_marked_lines(lines, markers, return_flags=[False, -1, -1]): def process_marked_lines(lines, markers, return_flags=[False, -1, -1]):
@@ -231,6 +282,7 @@ def process_marked_lines(lines, markers, return_flags=[False, -1, -1]):
return_flags = [were_lines_deleted, first_deleted_line, return_flags = [were_lines_deleted, first_deleted_line,
last_deleted_line] last_deleted_line]
""" """
markers = ''.join(markers)
# if there are no splitter there should be no markers # if there are no splitter there should be no markers
if 's' not in markers and not re.search('(me*){3}', markers): if 's' not in markers and not re.search('(me*){3}', markers):
markers = markers.replace('m', 't') markers = markers.replace('m', 't')
@@ -242,7 +294,7 @@ def process_marked_lines(lines, markers, return_flags=[False, -1, -1]):
# inlined reply # inlined reply
# use lookbehind assertions to find overlapping entries e.g. for 'mtmtm' # use lookbehind assertions to find overlapping entries e.g. for 'mtmtm'
# both 't' entries should be found # both 't' entries should be found
for inline_reply in re.finditer('(?<=m)e*((?:t+e*)+)m', markers): for inline_reply in re.finditer('(?<=m)e*(t[te]*)m', markers):
# long links could break sequence of quotation lines but they shouldn't # long links could break sequence of quotation lines but they shouldn't
# be considered an inline reply # be considered an inline reply
links = ( links = (
@@ -276,10 +328,27 @@ def preprocess(msg_body, delimiter, content_type='text/plain'):
Replaces link brackets so that they couldn't be taken for quotation marker. Replaces link brackets so that they couldn't be taken for quotation marker.
Splits line in two if splitter pattern preceded by some text on the same Splits line in two if splitter pattern preceded by some text on the same
line (done only for 'On <date> <person> wrote:' pattern). line (done only for 'On <date> <person> wrote:' pattern).
Converts msg_body into a unicode.
""" """
# normalize links i.e. replace '<', '>' wrapping the link with some symbols msg_body = _replace_link_brackets(msg_body)
# so that '>' closing the link couldn't be mistakenly taken for quotation
# marker. msg_body = _wrap_splitter_with_newline(msg_body, delimiter, content_type)
return msg_body
def _replace_link_brackets(msg_body):
"""
Normalize links i.e. replace '<', '>' wrapping the link with some symbols
so that '>' closing the link couldn't be mistakenly taken for quotation
marker.
Converts msg_body into a unicode
"""
if isinstance(msg_body, bytes):
msg_body = msg_body.decode('utf8')
def link_wrapper(link): def link_wrapper(link):
newline_index = msg_body[:link.start()].rfind("\n") newline_index = msg_body[:link.start()].rfind("\n")
if msg_body[newline_index + 1] == ">": if msg_body[newline_index + 1] == ">":
@@ -288,7 +357,14 @@ def preprocess(msg_body, delimiter, content_type='text/plain'):
return "@@%s@@" % link.group(1) return "@@%s@@" % link.group(1)
msg_body = re.sub(RE_LINK, link_wrapper, msg_body) msg_body = re.sub(RE_LINK, link_wrapper, msg_body)
return msg_body
def _wrap_splitter_with_newline(msg_body, delimiter, content_type='text/plain'):
"""
Splits line in two if splitter pattern preceded by some text on the same
line (done only for 'On <date> <person> wrote:' pattern.
"""
def splitter_wrapper(splitter): def splitter_wrapper(splitter):
"""Wraps splitter with new line""" """Wraps splitter with new line"""
if splitter.start() and msg_body[splitter.start() - 1] != '\n': if splitter.start() and msg_body[splitter.start() - 1] != '\n':
@@ -342,28 +418,70 @@ def extract_from_html(msg_body):
then extracting quotations from text, then extracting quotations from text,
then checking deleted checkpoints, then checking deleted checkpoints,
then deleting necessary tags. then deleting necessary tags.
Returns a unicode string.
""" """
if msg_body.strip() == '': if isinstance(msg_body, six.text_type):
msg_body = msg_body.encode('utf8')
elif not isinstance(msg_body, bytes):
msg_body = msg_body.encode('ascii')
result = _extract_from_html(msg_body)
if isinstance(result, bytes):
result = result.decode('utf8')
return result
def _extract_from_html(msg_body):
"""
Extract not quoted message from provided html message body
using tags and plain text algorithm.
Cut out first some encoding html tags such as xml and doctype
for avoiding conflict with unicode decoding
Cut out the 'blockquote', 'gmail_quote' tags.
Cut Microsoft quotations.
Then use plain text algorithm to cut out splitter or
leftover quotation.
This works by adding checkpoint text to all html tags,
then converting html to text,
then extracting quotations from text,
then checking deleted checkpoints,
then deleting necessary tags.
"""
if msg_body.strip() == b'':
return msg_body return msg_body
msg_body = msg_body.replace('\r\n', '').replace('\n', '') msg_body = msg_body.replace(b'\r\n', b'\n')
html_tree = html.document_fromstring(
msg_body, msg_body = re.sub(r"\<\?xml.+\?\>|\<\!DOCTYPE.+]\>", "", msg_body)
parser=html.HTMLParser(encoding="utf-8")
) html_tree = html_document_fromstring(msg_body)
cut_quotations = (html_quotations.cut_gmail_quote(html_tree) or
html_quotations.cut_zimbra_quote(html_tree) or if html_tree is None:
html_quotations.cut_blockquote(html_tree) or return msg_body
html_quotations.cut_microsoft_quote(html_tree) or
html_quotations.cut_by_id(html_tree) or cut_quotations = False
html_quotations.cut_from_block(html_tree) try:
) cut_quotations = (html_quotations.cut_gmail_quote(html_tree) or
html_quotations.cut_zimbra_quote(html_tree) or
html_quotations.cut_blockquote(html_tree) or
html_quotations.cut_microsoft_quote(html_tree) or
html_quotations.cut_by_id(html_tree) or
html_quotations.cut_from_block(html_tree)
)
except Exception as e:
log.exception('during html quotations cut')
pass
html_tree_copy = deepcopy(html_tree) html_tree_copy = deepcopy(html_tree)
number_of_checkpoints = html_quotations.add_checkpoint(html_tree, 0) number_of_checkpoints = html_quotations.add_checkpoint(html_tree, 0)
quotation_checkpoints = [False] * number_of_checkpoints quotation_checkpoints = [False] * number_of_checkpoints
msg_with_checkpoints = html.tostring(html_tree) plain_text = html_tree_to_text(html_tree)
plain_text = html_to_text(msg_with_checkpoints)
plain_text = preprocess(plain_text, '\n', content_type='text/html') plain_text = preprocess(plain_text, '\n', content_type='text/html')
lines = plain_text.splitlines() lines = plain_text.splitlines()
@@ -386,25 +504,166 @@ def extract_from_html(msg_body):
return_flags = [] return_flags = []
process_marked_lines(lines, markers, return_flags) process_marked_lines(lines, markers, return_flags)
lines_were_deleted, first_deleted, last_deleted = return_flags lines_were_deleted, first_deleted, last_deleted = return_flags
if not lines_were_deleted and not cut_quotations:
return msg_body
if lines_were_deleted: if lines_were_deleted:
#collect checkpoints from deleted lines #collect checkpoints from deleted lines
for i in xrange(first_deleted, last_deleted): for i in range(first_deleted, last_deleted):
for checkpoint in line_checkpoints[i]: for checkpoint in line_checkpoints[i]:
quotation_checkpoints[checkpoint] = True quotation_checkpoints[checkpoint] = True
else:
if cut_quotations:
return html.tostring(html_tree_copy)
else:
return msg_body
# Remove tags with quotation checkpoints # Remove tags with quotation checkpoints
html_quotations.delete_quotation_tags( html_quotations.delete_quotation_tags(
html_tree_copy, 0, quotation_checkpoints html_tree_copy, 0, quotation_checkpoints
) )
if _readable_text_empty(html_tree_copy):
return msg_body
# NOTE: We remove_namespaces() because we are using an HTML5 Parser, HTML
# parsers do not recognize namespaces in HTML tags. As such the rendered
# HTML tags are no longer recognizable HTML tags. Example: <o:p> becomes
# <oU0003Ap>. When we port this to golang we should look into using an
# XML Parser NOT and HTML5 Parser since we do not know what input a
# customer will send us. Switching to a common XML parser in python
# opens us up to a host of vulnerabilities.
# See https://docs.python.org/3/library/xml.html#xml-vulnerabilities
#
# The down sides to removing the namespaces is that customers might
# judge the XML namespaces important. If that is the case then support
# should encourage customers to preform XML parsing of the un-stripped
# body to get the full unmodified XML payload.
#
# Alternatives to this approach are
# 1. Ignore the U0003A in tag names and let the customer deal with it.
# This is not ideal, as most customers use stripped-html for viewing
# emails sent from a recipient, as such they cannot control the HTML
# provided by a recipient.
# 2. Preform a string replace of 'U0003A' to ':' on the rendered HTML
# string. While this would solve the issue simply, it runs the risk
# of replacing data outside the <tag> which might be essential to
# the customer.
remove_namespaces(html_tree_copy)
return html.tostring(html_tree_copy) return html.tostring(html_tree_copy)
def remove_namespaces(root):
"""
Given the root of an HTML document iterate through all the elements
and remove any namespaces that might have been provided and remove
any attributes that contain a namespace
<html xmlns:o="urn:schemas-microsoft-com:office:office">
becomes
<html>
<o:p>Hi</o:p>
becomes
<p>Hi</p>
Start tags do NOT have a namespace; COLON characters have no special meaning.
if we don't remove the namespace the parser translates the tag name into a
unicode representation. For example <o:p> becomes <oU0003Ap>
See https://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#start-tags
"""
for child in root.iter():
for key, value in child.attrib.items():
# If the attribute includes a colon
if key.rfind("U0003A") != -1:
child.attrib.pop(key)
# If the tag includes a colon
idx = child.tag.rfind("U0003A")
if idx != -1:
child.tag = child.tag[idx+6:]
return root
def split_emails(msg):
"""
Given a message (which may consist of an email conversation thread with
multiple emails), mark the lines to identify split lines, content lines and
empty lines.
Correct the split line markers inside header blocks. Header blocks are
identified by the regular expression RE_HEADER.
Return the corrected markers
"""
msg_body = _replace_link_brackets(msg)
# don't process too long messages
lines = msg_body.splitlines()[:MAX_LINES_COUNT]
markers = remove_initial_spaces_and_mark_message_lines(lines)
markers = _mark_quoted_email_splitlines(markers, lines)
# we don't want splitlines in header blocks
markers = _correct_splitlines_in_headers(markers, lines)
return markers
def _mark_quoted_email_splitlines(markers, lines):
"""
When there are headers indented with '>' characters, this method will
attempt to identify if the header is a splitline header. If it is, then we
mark it with 's' instead of leaving it as 'm' and return the new markers.
"""
# Create a list of markers to easily alter specific characters
markerlist = list(markers)
for i, line in enumerate(lines):
if markerlist[i] != 'm':
continue
for pattern in SPLITTER_PATTERNS:
matcher = re.search(pattern, line)
if matcher:
markerlist[i] = 's'
break
return "".join(markerlist)
def _correct_splitlines_in_headers(markers, lines):
"""
Corrects markers by removing splitlines deemed to be inside header blocks.
"""
updated_markers = ""
i = 0
in_header_block = False
for m in markers:
# Only set in_header_block flag when we hit an 's' and line is a header
if m == 's':
if not in_header_block:
if bool(re.search(RE_HEADER, lines[i])):
in_header_block = True
else:
if QUOT_PATTERN.match(lines[i]):
m = 'm'
else:
m = 't'
# If the line is not a header line, set in_header_block false.
if not bool(re.search(RE_HEADER, lines[i])):
in_header_block = False
# Add the marker to the new updated markers string.
updated_markers += m
i += 1
return updated_markers
def _readable_text_empty(html_tree):
return not bool(html_tree_to_text(html_tree).strip())
def is_splitter(line): def is_splitter(line):
''' '''
Returns Matcher object if provided string is a splitter and Returns Matcher object if provided string is a splitter and
@@ -418,7 +677,7 @@ def is_splitter(line):
def text_content(context): def text_content(context):
'''XPath Extension function to return a node text content.''' '''XPath Extension function to return a node text content.'''
return context.context_node.text_content().strip() return context.context_node.xpath("string()").strip()
def tail(context): def tail(context):

View File

@@ -20,6 +20,7 @@ trained against, don't forget to regenerate:
* signature/data/classifier * signature/data/classifier
""" """
from __future__ import absolute_import
import os import os
from . import extraction from . import extraction

View File

@@ -1,14 +1,15 @@
from __future__ import absolute_import
import logging import logging
import regex as re import regex as re
from talon.utils import get_delimiter
from talon.signature.constants import (SIGNATURE_MAX_LINES, from talon.signature.constants import (SIGNATURE_MAX_LINES,
TOO_LONG_SIGNATURE_LINE) TOO_LONG_SIGNATURE_LINE)
from talon.utils import get_delimiter
log = logging.getLogger(__name__) log = logging.getLogger(__name__)
# regex to fetch signature based on common signature words # regex to fetch signature based on common signature words
RE_SIGNATURE = re.compile(r''' RE_SIGNATURE = re.compile(r'''
( (
@@ -27,7 +28,6 @@ RE_SIGNATURE = re.compile(r'''
) )
''', re.I | re.X | re.M | re.S) ''', re.I | re.X | re.M | re.S)
# signatures appended by phone email clients # signatures appended by phone email clients
RE_PHONE_SIGNATURE = re.compile(r''' RE_PHONE_SIGNATURE = re.compile(r'''
( (
@@ -44,7 +44,6 @@ RE_PHONE_SIGNATURE = re.compile(r'''
) )
''', re.I | re.X | re.M | re.S) ''', re.I | re.X | re.M | re.S)
# see _mark_candidate_indexes() for details # see _mark_candidate_indexes() for details
# c - could be signature line # c - could be signature line
# d - line starts with dashes (could be signature or list item) # d - line starts with dashes (could be signature or list item)
@@ -63,7 +62,7 @@ RE_SIGNATURE_CANDIDATE = re.compile(r'''
def extract_signature(msg_body): def extract_signature(msg_body):
''' """
Analyzes message for a presence of signature block (by common patterns) Analyzes message for a presence of signature block (by common patterns)
and returns tuple with two elements: message text without signature block and returns tuple with two elements: message text without signature block
and the signature itself. and the signature itself.
@@ -73,7 +72,7 @@ def extract_signature(msg_body):
>>> extract_signature('Hey man!') >>> extract_signature('Hey man!')
('Hey man!', None) ('Hey man!', None)
''' """
try: try:
# identify line delimiter first # identify line delimiter first
delimiter = get_delimiter(msg_body) delimiter = get_delimiter(msg_body)
@@ -111,7 +110,7 @@ def extract_signature(msg_body):
return (stripped_body.strip(), return (stripped_body.strip(),
signature.strip()) signature.strip())
except Exception, e: except Exception:
log.exception('ERROR extracting signature') log.exception('ERROR extracting signature')
return (msg_body, None) return (msg_body, None)
@@ -162,7 +161,7 @@ def _mark_candidate_indexes(lines, candidate):
'cdc' 'cdc'
""" """
# at first consider everything to be potential signature lines # at first consider everything to be potential signature lines
markers = bytearray('c'*len(candidate)) markers = list('c' * len(candidate))
# mark lines starting from bottom up # mark lines starting from bottom up
for i, line_idx in reversed(list(enumerate(candidate))): for i, line_idx in reversed(list(enumerate(candidate))):
@@ -173,7 +172,7 @@ def _mark_candidate_indexes(lines, candidate):
if line.startswith('-') and line.strip("-"): if line.startswith('-') and line.strip("-"):
markers[i] = 'd' markers[i] = 'd'
return markers return "".join(markers)
def _process_marked_candidate_indexes(candidate, markers): def _process_marked_candidate_indexes(candidate, markers):

View File

@@ -0,0 +1 @@

File diff suppressed because it is too large Load Diff

View File

@@ -1,15 +1,15 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
from __future__ import absolute_import
import logging import logging
import regex as re
import numpy import numpy
import regex as re
from talon.signature.learning.featurespace import features, build_pattern
from talon.utils import get_delimiter
from talon.signature.bruteforce import get_signature_candidate from talon.signature.bruteforce import get_signature_candidate
from talon.signature.learning.featurespace import features, build_pattern
from talon.signature.learning.helpers import has_signature from talon.signature.learning.helpers import has_signature
from talon.utils import get_delimiter
log = logging.getLogger(__name__) log = logging.getLogger(__name__)
@@ -32,7 +32,7 @@ RE_REVERSE_SIGNATURE = re.compile(r'''
def is_signature_line(line, sender, classifier): def is_signature_line(line, sender, classifier):
'''Checks if the line belongs to signature. Returns True or False.''' '''Checks if the line belongs to signature. Returns True or False.'''
data = numpy.array(build_pattern(line, features(sender))) data = numpy.array(build_pattern(line, features(sender))).reshape(1, -1)
return classifier.predict(data) > 0 return classifier.predict(data) > 0
@@ -57,7 +57,7 @@ def extract(body, sender):
text = delimiter.join(text) text = delimiter.join(text)
if text.strip(): if text.strip():
return (text, delimiter.join(signature)) return (text, delimiter.join(signature))
except Exception: except Exception as e:
log.exception('ERROR when extracting signature with classifiers') log.exception('ERROR when extracting signature with classifiers')
return (body, None) return (body, None)
@@ -80,7 +80,7 @@ def _mark_lines(lines, sender):
candidate = get_signature_candidate(lines) candidate = get_signature_candidate(lines)
# at first consider everything to be text no signature # at first consider everything to be text no signature
markers = bytearray('t'*len(lines)) markers = list('t' * len(lines))
# mark lines starting from bottom up # mark lines starting from bottom up
# mark only lines that belong to candidate # mark only lines that belong to candidate
@@ -95,7 +95,7 @@ def _mark_lines(lines, sender):
elif is_signature_line(line, sender, EXTRACTOR): elif is_signature_line(line, sender, EXTRACTOR):
markers[j] = 's' markers[j] = 's'
return markers return "".join(markers)
def _process_marked_lines(lines, markers): def _process_marked_lines(lines, markers):
@@ -110,3 +110,4 @@ def _process_marked_lines(lines, markers):
return (lines[:-signature.end()], lines[-signature.end():]) return (lines[:-signature.end()], lines[-signature.end():])
return (lines, None) return (lines, None)

View File

@@ -5,9 +5,11 @@ The classifier could be used to detect if a certain line of the message
body belongs to the signature. body belongs to the signature.
""" """
from __future__ import absolute_import
from numpy import genfromtxt from numpy import genfromtxt
from sklearn.svm import LinearSVC
from sklearn.externals import joblib from sklearn.externals import joblib
from sklearn.svm import LinearSVC
def init(): def init():
@@ -28,4 +30,40 @@ def train(classifier, train_data_filename, save_classifier_filename=None):
def load(saved_classifier_filename, train_data_filename): def load(saved_classifier_filename, train_data_filename):
"""Loads saved classifier. """ """Loads saved classifier. """
return joblib.load(saved_classifier_filename) try:
return joblib.load(saved_classifier_filename)
except Exception:
import sys
if sys.version_info > (3, 0):
return load_compat(saved_classifier_filename)
raise
def load_compat(saved_classifier_filename):
import os
import pickle
import tempfile
# we need to switch to the data path to properly load the related _xx.npy files
cwd = os.getcwd()
os.chdir(os.path.dirname(saved_classifier_filename))
# convert encoding using pick.load and write to temp file which we'll tell joblib to use
pickle_file = open(saved_classifier_filename, 'rb')
classifier = pickle.load(pickle_file, encoding='latin1')
try:
# save our conversion if permissions allow
joblib.dump(classifier, saved_classifier_filename)
except Exception:
# can't write to classifier, use a temp file
tmp = tempfile.SpooledTemporaryFile()
joblib.dump(classifier, tmp)
saved_classifier_filename = tmp
# important, use joblib.load before switching back to original cwd
jb_classifier = joblib.load(saved_classifier_filename)
os.chdir(cwd)
return jb_classifier

View File

@@ -16,13 +16,16 @@ suffix and the corresponding sender file has the same name except for the
suffix which should be `_sender`. suffix which should be `_sender`.
""" """
from __future__ import absolute_import
import os import os
import regex as re import regex as re
from six.moves import range
from talon.signature.constants import SIGNATURE_MAX_LINES from talon.signature.constants import SIGNATURE_MAX_LINES
from talon.signature.learning.featurespace import build_pattern, features from talon.signature.learning.featurespace import build_pattern, features
SENDER_SUFFIX = '_sender' SENDER_SUFFIX = '_sender'
BODY_SUFFIX = '_body' BODY_SUFFIX = '_body'
@@ -55,9 +58,14 @@ def parse_msg_sender(filename, sender_known=True):
algorithm: algorithm:
>>> parse_msg_sender(filename, False) >>> parse_msg_sender(filename, False)
""" """
import sys
kwargs = {}
if sys.version_info > (3, 0):
kwargs["encoding"] = "utf8"
sender, msg = None, None sender, msg = None, None
if os.path.isfile(filename) and not is_sender_filename(filename): if os.path.isfile(filename) and not is_sender_filename(filename):
with open(filename) as f: with open(filename, **kwargs) as f:
msg = f.read() msg = f.read()
sender = u'' sender = u''
if sender_known: if sender_known:
@@ -144,8 +152,8 @@ def build_extraction_dataset(folder, dataset_filename,
if not sender or not msg: if not sender or not msg:
continue continue
lines = msg.splitlines() lines = msg.splitlines()
for i in xrange(1, min(SIGNATURE_MAX_LINES, for i in range(1, min(SIGNATURE_MAX_LINES,
len(lines)) + 1): len(lines)) + 1):
line = lines[-i] line = lines[-i]
label = -1 label = -1
if line[:len(SIGNATURE_ANNOTATION)] == \ if line[:len(SIGNATURE_ANNOTATION)] == \

View File

@@ -7,9 +7,12 @@ The body and the message sender string are converted into unicode before
applying features to them. applying features to them.
""" """
from __future__ import absolute_import
from talon.signature.constants import (SIGNATURE_MAX_LINES, from talon.signature.constants import (SIGNATURE_MAX_LINES,
TOO_LONG_SIGNATURE_LINE) TOO_LONG_SIGNATURE_LINE)
from talon.signature.learning.helpers import * from talon.signature.learning.helpers import *
from six.moves import zip
from functools import reduce
def features(sender=''): def features(sender=''):

View File

@@ -6,6 +6,7 @@
""" """
from __future__ import absolute_import
import unicodedata import unicodedata
import regex as re import regex as re
@@ -184,12 +185,13 @@ def capitalized_words_percent(s):
s = to_unicode(s, precise=True) s = to_unicode(s, precise=True)
words = re.split('\s', s) words = re.split('\s', s)
words = [w for w in words if w.strip()] words = [w for w in words if w.strip()]
words = [w for w in words if len(w) > 2]
capitalized_words_counter = 0 capitalized_words_counter = 0
valid_words_counter = 0 valid_words_counter = 0
for word in words: for word in words:
if not INVALID_WORD_START.match(word): if not INVALID_WORD_START.match(word):
valid_words_counter += 1 valid_words_counter += 1
if word[0].isupper(): if word[0].isupper() and not word[1].isupper():
capitalized_words_counter += 1 capitalized_words_counter += 1
if valid_words_counter > 0 and len(words) > 1: if valid_words_counter > 0 and len(words) > 1:
return 100 * float(capitalized_words_counter) / valid_words_counter return 100 * float(capitalized_words_counter) / valid_words_counter

View File

@@ -1,13 +1,16 @@
# coding:utf-8 # coding:utf-8
import logging from __future__ import absolute_import
from random import shuffle
import chardet
import cchardet
import regex as re
from lxml import html from random import shuffle
import cchardet
import chardet
import html5lib
import regex as re
import six
from lxml.cssselect import CSSSelector from lxml.cssselect import CSSSelector
from lxml.html import html5parser
from talon.constants import RE_DELIMITER from talon.constants import RE_DELIMITER
@@ -28,7 +31,7 @@ def safe_format(format_string, *args, **kwargs):
except (UnicodeEncodeError, UnicodeDecodeError): except (UnicodeEncodeError, UnicodeDecodeError):
format_string = to_utf8(format_string) format_string = to_utf8(format_string)
args = [to_utf8(p) for p in args] args = [to_utf8(p) for p in args]
kwargs = {k: to_utf8(v) for k, v in kwargs.iteritems()} kwargs = {k: to_utf8(v) for k, v in six.iteritems(kwargs)}
return format_string.format(*args, **kwargs) return format_string.format(*args, **kwargs)
# ignore other errors # ignore other errors
@@ -45,9 +48,9 @@ def to_unicode(str_or_unicode, precise=False):
u'привет' u'привет'
If `precise` flag is True, tries to guess the correct encoding first. If `precise` flag is True, tries to guess the correct encoding first.
""" """
encoding = quick_detect_encoding(str_or_unicode) if precise else 'utf-8' if not isinstance(str_or_unicode, six.text_type):
if isinstance(str_or_unicode, str): encoding = quick_detect_encoding(str_or_unicode) if precise else 'utf-8'
return unicode(str_or_unicode, encoding, 'replace') return six.text_type(str_or_unicode, encoding, 'replace')
return str_or_unicode return str_or_unicode
@@ -57,11 +60,12 @@ def detect_encoding(string):
Defaults to UTF-8. Defaults to UTF-8.
""" """
assert isinstance(string, bytes)
try: try:
detected = chardet.detect(string) detected = chardet.detect(string)
if detected: if detected:
return detected.get('encoding') or 'utf-8' return detected.get('encoding') or 'utf-8'
except Exception, e: except Exception as e:
pass pass
return 'utf-8' return 'utf-8'
@@ -72,11 +76,12 @@ def quick_detect_encoding(string):
Uses cchardet. Fallbacks to detect_encoding. Uses cchardet. Fallbacks to detect_encoding.
""" """
assert isinstance(string, bytes)
try: try:
detected = cchardet.detect(string) detected = cchardet.detect(string)
if detected: if detected:
return detected.get('encoding') or detect_encoding(string) return detected.get('encoding') or detect_encoding(string)
except Exception, e: except Exception as e:
pass pass
return detect_encoding(string) return detect_encoding(string)
@@ -87,7 +92,7 @@ def to_utf8(str_or_unicode):
>>> utils.to_utf8(u'hi') >>> utils.to_utf8(u'hi')
'hi' 'hi'
""" """
if isinstance(str_or_unicode, unicode): if not isinstance(str_or_unicode, six.text_type):
return str_or_unicode.encode("utf-8", "ignore") return str_or_unicode.encode("utf-8", "ignore")
return str(str_or_unicode) return str(str_or_unicode)
@@ -109,32 +114,24 @@ def get_delimiter(msg_body):
return delimiter return delimiter
def html_to_text(string): def html_tree_to_text(tree):
"""
Dead-simple HTML-to-text converter:
>>> html_to_text("one<br>two<br>three")
>>> "one\ntwo\nthree"
NOTES:
1. the string is expected to contain UTF-8 encoded HTML!
2. returns utf-8 encoded str (not unicode)
"""
s = _prepend_utf8_declaration(string)
s = s.replace("\n", "")
tree = html.fromstring(s)
for style in CSSSelector('style')(tree): for style in CSSSelector('style')(tree):
style.getparent().remove(style) style.getparent().remove(style)
for c in tree.xpath('//comment()'): for c in tree.xpath('//comment()'):
c.getparent().remove(c) parent = c.getparent()
text = "" # comment with no parent does not impact produced text
if parent is None:
continue
parent.remove(c)
text = ""
for el in tree.iter(): for el in tree.iter():
el_text = (el.text or '') + (el.tail or '') el_text = (el.text or '') + (el.tail or '')
if len(el_text) > 1: if len(el_text) > 1:
if el.tag in _BLOCKTAGS: if el.tag in _BLOCKTAGS + _HARDBREAKS:
text += "\n" text += "\n"
if el.tag == 'li': if el.tag == 'li':
text += " * " text += " * "
@@ -145,17 +142,80 @@ def html_to_text(string):
if href: if href:
text += "(%s) " % href text += "(%s) " % href
if el.tag in _HARDBREAKS and text and not text.endswith("\n"): if (el.tag in _HARDBREAKS and text and
not text.endswith("\n") and not el_text):
text += "\n" text += "\n"
retval = _rm_excessive_newlines(text) retval = _rm_excessive_newlines(text)
return _encode_utf8(retval) return _encode_utf8(retval)
def html_to_text(string):
"""
Dead-simple HTML-to-text converter:
>>> html_to_text("one<br>two<br>three")
>>> "one\ntwo\nthree"
NOTES:
1. the string is expected to contain UTF-8 encoded HTML!
2. returns utf-8 encoded str (not unicode)
3. if html can't be parsed returns None
"""
if isinstance(string, six.text_type):
string = string.encode('utf8')
s = _prepend_utf8_declaration(string)
s = s.replace(b"\n", b"")
tree = html_fromstring(s)
if tree is None:
return None
return html_tree_to_text(tree)
def html_fromstring(s):
"""Parse html tree from string. Return None if the string can't be parsed.
"""
if isinstance(s, six.text_type):
s = s.encode('utf8')
try:
if html_too_big(s):
return None
return html5parser.fromstring(s, parser=_html5lib_parser())
except Exception:
pass
def html_document_fromstring(s):
"""Parse html tree from string. Return None if the string can't be parsed.
"""
if isinstance(s, six.text_type):
s = s.encode('utf8')
try:
if html_too_big(s):
return None
return html5parser.document_fromstring(s, parser=_html5lib_parser())
except Exception:
pass
def cssselect(expr, tree):
return CSSSelector(expr)(tree)
def html_too_big(s):
if isinstance(s, six.text_type):
s = s.encode('utf8')
return s.count(b'<') > _MAX_TAGS_COUNT
def _contains_charset_spec(s): def _contains_charset_spec(s):
"""Return True if the first 4KB contain charset spec """Return True if the first 4KB contain charset spec
""" """
return s.lower().find('html; charset=', 0, 4096) != -1 return s.lower().find(b'html; charset=', 0, 4096) != -1
def _prepend_utf8_declaration(s): def _prepend_utf8_declaration(s):
@@ -173,15 +233,32 @@ def _rm_excessive_newlines(s):
def _encode_utf8(s): def _encode_utf8(s):
"""Encode in 'utf-8' if unicode """Encode in 'utf-8' if unicode
""" """
return s.encode('utf-8') if isinstance(s, unicode) else s return s.encode('utf-8') if isinstance(s, six.text_type) else s
_UTF8_DECLARATION = ('<meta http-equiv="Content-Type" content="text/html;' def _html5lib_parser():
'charset=utf-8">') """
html5lib is a pure-python library that conforms to the WHATWG HTML spec
and is not vulnarable to certain attacks common for XML libraries
"""
return html5lib.HTMLParser(
# build lxml tree
html5lib.treebuilders.getTreeBuilder("lxml"),
# remove namespace value from inside lxml.html.html5paser element tag
# otherwise it yields something like "{http://www.w3.org/1999/xhtml}div"
# instead of "div", throwing the algo off
namespaceHTMLElements=False
)
_BLOCKTAGS = ['div', 'p', 'ul', 'li', 'h1', 'h2', 'h3'] _UTF8_DECLARATION = (b'<meta http-equiv="Content-Type" content="text/html;'
b'charset=utf-8">')
_BLOCKTAGS = ['div', 'p', 'ul', 'li', 'h1', 'h2', 'h3']
_HARDBREAKS = ['br', 'hr', 'tr'] _HARDBREAKS = ['br', 'hr', 'tr']
_RE_EXCESSIVE_NEWLINES = re.compile("\n{2,10}") _RE_EXCESSIVE_NEWLINES = re.compile("\n{2,10}")
# an extensive research shows that exceeding this limit
# might lead to excessive processing time
_MAX_TAGS_COUNT = 419

View File

@@ -1,5 +1,4 @@
from nose.tools import * from __future__ import absolute_import
from mock import *
import talon import talon

View File

@@ -1,12 +1,12 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
from . import * from __future__ import absolute_import
from . fixtures import *
import regex as re
from tests.fixtures import REPLY_QUOTATIONS_SHARE_BLOCK, OLK_SRC_BODY_SECTION, REPLY_SEPARATED_BY_HR
from nose.tools import eq_, ok_, assert_false, assert_true
from talon import quotations, utils as u from talon import quotations, utils as u
from mock import Mock, patch
import re
RE_WHITESPACE = re.compile("\s") RE_WHITESPACE = re.compile("\s")
RE_DOUBLE_WHITESPACE = re.compile("\s") RE_DOUBLE_WHITESPACE = re.compile("\s")
@@ -26,7 +26,7 @@ def test_quotation_splitter_inside_blockquote():
</blockquote>""" </blockquote>"""
eq_("<html><body><p>Reply</p></body></html>", eq_("<html><head></head><body>Reply</body></html>",
RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body))) RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body)))
@@ -43,7 +43,7 @@ def test_quotation_splitter_outside_blockquote():
</div> </div>
</blockquote> </blockquote>
""" """
eq_("<html><body><p>Reply</p></body></html>", eq_("<html><head></head><body>Reply</body></html>",
RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body))) RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body)))
@@ -61,7 +61,7 @@ def test_regular_blockquote():
</div> </div>
</blockquote> </blockquote>
""" """
eq_("<html><body><p>Reply</p><blockquote>Regular</blockquote></body></html>", eq_("<html><head></head><body>Reply<blockquote>Regular</blockquote></body></html>",
RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body))) RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body)))
@@ -84,6 +84,7 @@ Reply
reply = """ reply = """
<html> <html>
<head></head>
<body> <body>
Reply Reply
@@ -127,7 +128,7 @@ def test_gmail_quote():
</div> </div>
</div> </div>
</div>""" </div>"""
eq_("<html><body><p>Reply</p></body></html>", eq_("<html><head></head><body>Reply</body></html>",
RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body))) RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body)))
@@ -138,7 +139,7 @@ def test_gmail_quote_compact():
'<div>Test</div>' \ '<div>Test</div>' \
'</div>' \ '</div>' \
'</div>' '</div>'
eq_("<html><body><p>Reply</p></body></html>", eq_("<html><head></head><body>Reply</body></html>",
RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body))) RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body)))
@@ -165,7 +166,7 @@ def test_unicode_in_reply():
Quote Quote
</blockquote>""".encode("utf-8") </blockquote>""".encode("utf-8")
eq_("<html><body><p>Reply&#160;&#160;Text<br></p><div><br></div>" eq_("<html><head></head><body>Reply&#160;&#160;Text<br><div><br></div>"
"</body></html>", "</body></html>",
RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body))) RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body)))
@@ -191,6 +192,7 @@ def test_blockquote_disclaimer():
stripped_html = """ stripped_html = """
<html> <html>
<head></head>
<body> <body>
<div> <div>
<div> <div>
@@ -222,7 +224,7 @@ def test_date_block():
</div> </div>
</div> </div>
""" """
eq_('<html><body><div>message<br></div></body></html>', eq_('<html><head></head><body><div>message<br></div></body></html>',
RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body))) RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body)))
@@ -239,7 +241,7 @@ Subject: You Have New Mail From Mary!<br><br>
text text
</div></div> </div></div>
""" """
eq_('<html><body><div>message<br></div></body></html>', eq_('<html><head></head><body><div>message<br></div></body></html>',
RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body))) RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body)))
@@ -257,7 +259,7 @@ def test_reply_shares_div_with_from_block():
</div> </div>
</body>''' </body>'''
eq_('<html><body><div>Blah<br><br></div></body></html>', eq_('<html><head></head><body><div>Blah<br><br></div></body></html>',
RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body))) RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body)))
@@ -268,13 +270,13 @@ def test_reply_quotations_share_block():
def test_OLK_SRC_BODY_SECTION_stripped(): def test_OLK_SRC_BODY_SECTION_stripped():
eq_('<html><body><div>Reply</div></body></html>', eq_('<html><head></head><body><div>Reply</div></body></html>',
RE_WHITESPACE.sub( RE_WHITESPACE.sub(
'', quotations.extract_from_html(OLK_SRC_BODY_SECTION))) '', quotations.extract_from_html(OLK_SRC_BODY_SECTION)))
def test_reply_separated_by_hr(): def test_reply_separated_by_hr():
eq_('<html><body><div>Hi<div>there</div></div></body></html>', eq_('<html><head></head><body><div>Hi<div>there</div></div></body></html>',
RE_WHITESPACE.sub( RE_WHITESPACE.sub(
'', quotations.extract_from_html(REPLY_SEPARATED_BY_HR))) '', quotations.extract_from_html(REPLY_SEPARATED_BY_HR)))
@@ -295,16 +297,22 @@ Reply
</div> </div>
</div> </div>
''' '''
eq_('<html><body><p>Reply</p><div><hr></div></body></html>', eq_('<html><head></head><body>Reply<div><hr></div></body></html>',
RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body))) RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body)))
def extract_reply_and_check(filename): def extract_reply_and_check(filename):
f = open(filename) import sys
kwargs = {}
if sys.version_info > (3, 0):
kwargs["encoding"] = "utf8"
f = open(filename, **kwargs)
msg_body = f.read() msg_body = f.read()
reply = quotations.extract_from_html(msg_body) reply = quotations.extract_from_html(msg_body)
plain_reply = u.html_to_text(reply) plain_reply = u.html_to_text(reply)
plain_reply = plain_reply.decode('utf8')
eq_(RE_WHITESPACE.sub('', "Hi. I am fine.\n\nThanks,\nAlex"), eq_(RE_WHITESPACE.sub('', "Hi. I am fine.\n\nThanks,\nAlex"),
RE_WHITESPACE.sub('', plain_reply)) RE_WHITESPACE.sub('', plain_reply))
@@ -354,7 +362,8 @@ def test_CRLF():
assert_false(symbol in extracted) assert_false(symbol in extracted)
eq_('<html></html>', RE_WHITESPACE.sub('', extracted)) eq_('<html></html>', RE_WHITESPACE.sub('', extracted))
msg_body = """Reply msg_body = """My
reply
<blockquote> <blockquote>
<div> <div>
@@ -368,9 +377,9 @@ def test_CRLF():
</blockquote>""" </blockquote>"""
msg_body = msg_body.replace('\n', '\r\n') msg_body = msg_body.replace('\n', '\r\n')
extracted = quotations.extract_from_html(msg_body) extracted = quotations.extract_from_html(msg_body)
assert_false(symbol in extracted) assert_false(symbol in extracted)
eq_("<html><body><p>Reply</p></body></html>", # Keep new lines otherwise "My reply" becomes one word - "Myreply"
RE_WHITESPACE.sub('', extracted)) eq_("<html><head></head><body>My\nreply\n</body></html>", extracted)
def test_gmail_forwarded_msg(): def test_gmail_forwarded_msg():
@@ -378,3 +387,59 @@ def test_gmail_forwarded_msg():
</div><br></div>""" </div><br></div>"""
extracted = quotations.extract_from_html(msg_body) extracted = quotations.extract_from_html(msg_body)
eq_(RE_WHITESPACE.sub('', msg_body), RE_WHITESPACE.sub('', extracted)) eq_(RE_WHITESPACE.sub('', msg_body), RE_WHITESPACE.sub('', extracted))
@patch.object(u, '_MAX_TAGS_COUNT', 4)
def test_too_large_html():
msg_body = 'Reply' \
'<div class="gmail_quote">' \
'<div class="gmail_quote">On 11-Apr-2011, at 6:54 PM, Bob &lt;bob@example.com&gt; wrote:' \
'<div>Test</div>' \
'</div>' \
'</div>'
eq_(RE_WHITESPACE.sub('', msg_body),
RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body)))
def test_readable_html_empty():
msg_body = """
<blockquote>
Reply
<div>
On 11-Apr-2011, at 6:54 PM, Bob &lt;bob@example.com&gt; wrote:
</div>
<div>
Test
</div>
</blockquote>"""
eq_(RE_WHITESPACE.sub('', msg_body),
RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body)))
@patch.object(quotations, 'html_document_fromstring', Mock(return_value=None))
def test_bad_html():
bad_html = "<html></html>"
eq_(bad_html, quotations.extract_from_html(bad_html))
def test_remove_namespaces():
msg_body = """
<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns="http://www.w3.org/TR/REC-html40">
<body>
<o:p>Dear Sir,</o:p>
<o:p>Thank you for the email.</o:p>
<blockquote>thing</blockquote>
</body>
</html>
"""
rendered = quotations.extract_from_html(msg_body)
assert_true("<p>" in rendered)
assert_true("xmlns" in rendered)
assert_true("<o:p>" not in rendered)
assert_true("<xmlns:o>" not in rendered)

View File

@@ -1,9 +1,10 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
from . import * from __future__ import absolute_import
from . fixtures import *
from mock import Mock, patch
from talon import quotations from talon import quotations
from nose.tools import eq_
@patch.object(quotations, 'extract_from_html') @patch.object(quotations, 'extract_from_html')

View File

@@ -1,8 +1,10 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
from .. import * from __future__ import absolute_import
from nose.tools import eq_
from talon.signature import bruteforce from talon.signature import bruteforce
from mock import patch, Mock
def test_empty_body(): def test_empty_body():

View File

@@ -1,13 +1,15 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
from .. import * from __future__ import absolute_import
import os from talon.signature import bruteforce, extraction, extract
from talon.signature.learning import dataset
from talon import signature
from talon.signature import extraction as e from talon.signature import extraction as e
from talon.signature import bruteforce from talon.signature.learning import dataset
from nose.tools import eq_
from .. import STRIPPED, UNICODE_MSG
from six.moves import range
from mock import patch
import os
def test_message_shorter_SIGNATURE_MAX_LINES(): def test_message_shorter_SIGNATURE_MAX_LINES():
@@ -16,23 +18,28 @@ def test_message_shorter_SIGNATURE_MAX_LINES():
Thanks in advance, Thanks in advance,
Bob""" Bob"""
text, extracted_signature = signature.extract(body, sender) text, extracted_signature = extract(body, sender)
eq_('\n'.join(body.splitlines()[:2]), text) eq_('\n'.join(body.splitlines()[:2]), text)
eq_('\n'.join(body.splitlines()[-2:]), extracted_signature) eq_('\n'.join(body.splitlines()[-2:]), extracted_signature)
def test_messages_longer_SIGNATURE_MAX_LINES(): def test_messages_longer_SIGNATURE_MAX_LINES():
import sys
kwargs = {}
if sys.version_info > (3, 0):
kwargs["encoding"] = "utf8"
for filename in os.listdir(STRIPPED): for filename in os.listdir(STRIPPED):
filename = os.path.join(STRIPPED, filename) filename = os.path.join(STRIPPED, filename)
if not filename.endswith('_body'): if not filename.endswith('_body'):
continue continue
sender, body = dataset.parse_msg_sender(filename) sender, body = dataset.parse_msg_sender(filename)
text, extracted_signature = signature.extract(body, sender) text, extracted_signature = extract(body, sender)
extracted_signature = extracted_signature or '' extracted_signature = extracted_signature or ''
with open(filename[:-len('body')] + 'signature') as ms: with open(filename[:-len('body')] + 'signature', **kwargs) as ms:
msg_signature = ms.read() msg_signature = ms.read()
eq_(msg_signature.strip(), extracted_signature.strip()) eq_(msg_signature.strip(), extracted_signature.strip())
stripped_msg = body.strip()[:len(body.strip())-len(msg_signature)] stripped_msg = body.strip()[:len(body.strip()) - len(msg_signature)]
eq_(stripped_msg.strip(), text.strip()) eq_(stripped_msg.strip(), text.strip())
@@ -45,7 +52,7 @@ Thanks in advance,
some text which doesn't seem to be a signature at all some text which doesn't seem to be a signature at all
Bob""" Bob"""
text, extracted_signature = signature.extract(body, sender) text, extracted_signature = extract(body, sender)
eq_('\n'.join(body.splitlines()[:2]), text) eq_('\n'.join(body.splitlines()[:2]), text)
eq_('\n'.join(body.splitlines()[-3:]), extracted_signature) eq_('\n'.join(body.splitlines()[-3:]), extracted_signature)
@@ -58,7 +65,7 @@ Thanks in advance,
some long text here which doesn't seem to be a signature at all some long text here which doesn't seem to be a signature at all
Bob""" Bob"""
text, extracted_signature = signature.extract(body, sender) text, extracted_signature = extract(body, sender)
eq_('\n'.join(body.splitlines()[:-1]), text) eq_('\n'.join(body.splitlines()[:-1]), text)
eq_('Bob', extracted_signature) eq_('Bob', extracted_signature)
@@ -66,13 +73,38 @@ Bob"""
some *long* text here which doesn't seem to be a signature at all some *long* text here which doesn't seem to be a signature at all
""" """
((body, None), signature.extract(body, "david@example.com")) ((body, None), extract(body, "david@example.com"))
def test_basic(): def test_basic():
msg_body = 'Blah\r\n--\r\n\r\nSergey Obukhov' msg_body = 'Blah\r\n--\r\n\r\nSergey Obukhov'
eq_(('Blah', '--\r\n\r\nSergey Obukhov'), eq_(('Blah', '--\r\n\r\nSergey Obukhov'),
signature.extract(msg_body, 'Sergey')) extract(msg_body, 'Sergey'))
def test_capitalized():
msg_body = """Hi Mary,
Do you still need a DJ for your wedding? I've included a video demo of one of our DJs available for your wedding date.
DJ Doe
http://example.com
Password: SUPERPASSWORD
Would you like to check out more?
At your service,
John Smith
Doe Inc
555-531-7967"""
sig = """John Smith
Doe Inc
555-531-7967"""
eq_(sig, extract(msg_body, 'Doe')[1])
def test_over_2_text_lines_after_signature(): def test_over_2_text_lines_after_signature():
@@ -83,25 +115,25 @@ def test_over_2_text_lines_after_signature():
2 non signature lines in the end 2 non signature lines in the end
It's not signature It's not signature
""" """
text, extracted_signature = signature.extract(body, "Bob") text, extracted_signature = extract(body, "Bob")
eq_(extracted_signature, None) eq_(extracted_signature, None)
def test_no_signature(): def test_no_signature():
sender, body = "bob@foo.bar", "Hello" sender, body = "bob@foo.bar", "Hello"
eq_((body, None), signature.extract(body, sender)) eq_((body, None), extract(body, sender))
def test_handles_unicode(): def test_handles_unicode():
sender, body = dataset.parse_msg_sender(UNICODE_MSG) sender, body = dataset.parse_msg_sender(UNICODE_MSG)
text, extracted_signature = signature.extract(body, sender) text, extracted_signature = extract(body, sender)
@patch.object(signature.extraction, 'has_signature') @patch.object(extraction, 'has_signature')
def test_signature_extract_crash(has_signature): def test_signature_extract_crash(has_signature):
has_signature.side_effect = Exception('Bam!') has_signature.side_effect = Exception('Bam!')
msg_body = u'Blah\r\n--\r\n\r\nСергей' msg_body = u'Blah\r\n--\r\n\r\nСергей'
eq_((msg_body, None), signature.extract(msg_body, 'Сергей')) eq_((msg_body, None), extract(msg_body, 'Сергей'))
def test_mark_lines(): def test_mark_lines():
@@ -110,37 +142,37 @@ def test_mark_lines():
# (starting from the bottom) because we don't count empty line # (starting from the bottom) because we don't count empty line
eq_('ttset', eq_('ttset',
e._mark_lines(['Bob Smith', e._mark_lines(['Bob Smith',
'Bob Smith', 'Bob Smith',
'Bob Smith', 'Bob Smith',
'', '',
'some text'], 'Bob Smith')) 'some text'], 'Bob Smith'))
with patch.object(bruteforce, 'SIGNATURE_MAX_LINES', 3): with patch.object(bruteforce, 'SIGNATURE_MAX_LINES', 3):
# we don't analyse the 1st line because # we don't analyse the 1st line because
# signature cant start from the 1st line # signature cant start from the 1st line
eq_('tset', eq_('tset',
e._mark_lines(['Bob Smith', e._mark_lines(['Bob Smith',
'Bob Smith', 'Bob Smith',
'', '',
'some text'], 'Bob Smith')) 'some text'], 'Bob Smith'))
def test_process_marked_lines(): def test_process_marked_lines():
# no signature found # no signature found
eq_((range(5), None), e._process_marked_lines(range(5), 'telt')) eq_((list(range(5)), None), e._process_marked_lines(list(range(5)), 'telt'))
# signature in the middle of the text # signature in the middle of the text
eq_((range(9), None), e._process_marked_lines(range(9), 'tesestelt')) eq_((list(range(9)), None), e._process_marked_lines(list(range(9)), 'tesestelt'))
# long line splits signature # long line splits signature
eq_((range(7), [7, 8]), eq_((list(range(7)), [7, 8]),
e._process_marked_lines(range(9), 'tsslsless')) e._process_marked_lines(list(range(9)), 'tsslsless'))
eq_((range(20), [20]), eq_((list(range(20)), [20]),
e._process_marked_lines(range(21), 'ttttttstttesllelelets')) e._process_marked_lines(list(range(21)), 'ttttttstttesllelelets'))
# some signature lines could be identified as text # some signature lines could be identified as text
eq_(([0], range(1, 9)), e._process_marked_lines(range(9), 'tsetetest')) eq_(([0], list(range(1, 9))), e._process_marked_lines(list(range(9)), 'tsetetest'))
eq_(([], range(5)), eq_(([], list(range(5))),
e._process_marked_lines(range(5), "ststt")) e._process_marked_lines(list(range(5)), "ststt"))

View File

@@ -1,13 +1,13 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
from ... import * from __future__ import absolute_import
import os
from numpy import genfromtxt
from talon.signature.learning import dataset as d
from ... import EML_MSG_FILENAME, MSG_FILENAME_WITH_BODY_SUFFIX, TMP_DIR, EMAILS_DIR
from talon.signature.learning.featurespace import features from talon.signature.learning.featurespace import features
from talon.signature.learning import dataset as d
from nose.tools import eq_, assert_false, ok_
from numpy import genfromtxt
import os
def test_is_sender_filename(): def test_is_sender_filename():

View File

@@ -1,8 +1,10 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
from ... import * from __future__ import absolute_import
from talon.signature.learning import featurespace as fs from talon.signature.learning import featurespace as fs
from nose.tools import eq_, assert_false, ok_
from mock import patch
def test_apply_features(): def test_apply_features():

View File

@@ -1,11 +1,13 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
from ... import * from __future__ import absolute_import
import regex as re
from talon.signature.learning import helpers as h from talon.signature.learning import helpers as h
from talon.signature.learning.helpers import * from talon.signature.learning.helpers import RE_RELAX_PHONE, RE_NAME
from nose.tools import eq_, ok_, assert_false, assert_in
from mock import patch, Mock
from six.moves import range
import re
# First testing regex constants. # First testing regex constants.
VALID = ''' VALID = '''
@@ -154,7 +156,7 @@ def test_extract_names():
# check that extracted names could be compiled # check that extracted names could be compiled
try: try:
re.compile("|".join(extracted_names)) re.compile("|".join(extracted_names))
except Exception, e: except Exception as e:
ok_(False, ("Failed to compile extracted names {}" ok_(False, ("Failed to compile extracted names {}"
"\n\nReason: {}").format(extracted_names, e)) "\n\nReason: {}").format(extracted_names, e))
if expected_names: if expected_names:
@@ -190,10 +192,11 @@ def test_punctuation_percent(categories_percent):
def test_capitalized_words_percent(): def test_capitalized_words_percent():
eq_(0.0, h.capitalized_words_percent('')) eq_(0.0, h.capitalized_words_percent(''))
eq_(100.0, h.capitalized_words_percent('Example Corp')) eq_(100.0, h.capitalized_words_percent('Example Corp'))
eq_(50.0, h.capitalized_words_percent('Qqq qqq QQQ 123 sss')) eq_(50.0, h.capitalized_words_percent('Qqq qqq Aqs 123 sss'))
eq_(100.0, h.capitalized_words_percent('Cell 713-444-7368')) eq_(100.0, h.capitalized_words_percent('Cell 713-444-7368'))
eq_(100.0, h.capitalized_words_percent('8th Floor')) eq_(100.0, h.capitalized_words_percent('8th Floor'))
eq_(0.0, h.capitalized_words_percent('(212) 230-9276')) eq_(0.0, h.capitalized_words_percent('(212) 230-9276'))
eq_(50.0, h.capitalized_words_percent('Password: REMARKABLE'))
def test_has_signature(): def test_has_signature():
@@ -204,7 +207,7 @@ def test_has_signature():
'sender@example.com')) 'sender@example.com'))
assert_false(h.has_signature('http://www.example.com/555-555-5555', assert_false(h.has_signature('http://www.example.com/555-555-5555',
'sender@example.com')) 'sender@example.com'))
long_line = ''.join(['q' for e in xrange(28)]) long_line = ''.join(['q' for e in range(28)])
assert_false(h.has_signature(long_line + ' sender', 'sender@example.com')) assert_false(h.has_signature(long_line + ' sender', 'sender@example.com'))
# wont crash on an empty string # wont crash on an empty string
assert_false(h.has_signature('', '')) assert_false(h.has_signature('', ''))

View File

@@ -1,12 +1,15 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
from . import * from __future__ import absolute_import
from . fixtures import *
import os from tests.fixtures import STANDARD_REPLIES
import email.iterators
from talon import quotations from talon import quotations
from six.moves import range
from nose.tools import eq_
from mock import patch
import email.iterators
import six
import os
@patch.object(quotations, 'MAX_LINES_COUNT', 1) @patch.object(quotations, 'MAX_LINES_COUNT', 1)
@@ -32,6 +35,20 @@ On 11-Apr-2011, at 6:54 PM, Roman Tkachenko <romant@example.com> wrote:
eq_("Test reply", quotations.extract_from_plain(msg_body)) eq_("Test reply", quotations.extract_from_plain(msg_body))
def test_pattern_on_date_polymail():
msg_body = """Test reply
On Tue, Apr 11, 2017 at 10:07 PM John Smith
<
mailto:John Smith <johnsmith@gmail.com>
> wrote:
Test quoted data
"""
eq_("Test reply", quotations.extract_from_plain(msg_body))
def test_pattern_sent_from_samsung_smb_wrote(): def test_pattern_sent_from_samsung_smb_wrote():
msg_body = """Test reply msg_body = """Test reply
@@ -50,7 +67,7 @@ def test_pattern_on_date_wrote_somebody():
"""Lorem """Lorem
Op 13-02-2014 3:18 schreef Julius Caesar <pantheon@rome.com>: Op 13-02-2014 3:18 schreef Julius Caesar <pantheon@rome.com>:
Veniam laborum mlkshk kale chips authentic. Normcore mumblecore laboris, fanny pack readymade eu blog chia pop-up freegan enim master cleanse. Veniam laborum mlkshk kale chips authentic. Normcore mumblecore laboris, fanny pack readymade eu blog chia pop-up freegan enim master cleanse.
""")) """))
@@ -102,6 +119,38 @@ On 11-Apr-2011, at 6:54 PM, Roman Tkachenko <romant@example.com> sent:
eq_("Test reply", quotations.extract_from_plain(msg_body)) eq_("Test reply", quotations.extract_from_plain(msg_body))
def test_appointment():
msg_body = """Response
10/19/2017 @ 9:30 am for physical therapy
Bla
1517 4th Avenue Ste 300
London CA 19129, 555-421-6780
John Doe, FCLS
Mailgun Inc
555-941-0697
From: from@example.com [mailto:from@example.com]
Sent: Wednesday, October 18, 2017 2:05 PM
To: John Doer - SIU <jd@example.com>
Subject: RE: Claim # 5551188-1
Text"""
expected = """Response
10/19/2017 @ 9:30 am for physical therapy
Bla
1517 4th Avenue Ste 300
London CA 19129, 555-421-6780
John Doe, FCLS
Mailgun Inc
555-941-0697"""
eq_(expected, quotations.extract_from_plain(msg_body))
def test_line_starts_with_on(): def test_line_starts_with_on():
msg_body = """Blah-blah-blah msg_body = """Blah-blah-blah
On blah-blah-blah""" On blah-blah-blah"""
@@ -138,16 +187,20 @@ def _check_pattern_original_message(original_message_indicator):
-----{}----- -----{}-----
Test""" Test"""
eq_('Test reply', quotations.extract_from_plain(msg_body.format(unicode(original_message_indicator)))) eq_('Test reply', quotations.extract_from_plain(
msg_body.format(six.text_type(original_message_indicator))))
def test_english_original_message(): def test_english_original_message():
_check_pattern_original_message('Original Message') _check_pattern_original_message('Original Message')
_check_pattern_original_message('Reply Message') _check_pattern_original_message('Reply Message')
def test_german_original_message(): def test_german_original_message():
_check_pattern_original_message(u'Ursprüngliche Nachricht') _check_pattern_original_message(u'Ursprüngliche Nachricht')
_check_pattern_original_message('Antwort Nachricht') _check_pattern_original_message('Antwort Nachricht')
def test_danish_original_message(): def test_danish_original_message():
_check_pattern_original_message('Oprindelig meddelelse') _check_pattern_original_message('Oprindelig meddelelse')
@@ -161,6 +214,17 @@ Test reply"""
eq_("Test reply", quotations.extract_from_plain(msg_body)) eq_("Test reply", quotations.extract_from_plain(msg_body))
def test_android_wrote():
msg_body = """Test reply
---- John Smith wrote ----
> quoted
> text
"""
eq_("Test reply", quotations.extract_from_plain(msg_body))
def test_reply_wraps_quotations(): def test_reply_wraps_quotations():
msg_body = """Test reply msg_body = """Test reply
@@ -235,14 +299,16 @@ On 04/19/2011 07:10 AM, Roman Tkachenko wrote:
> Hello""" > Hello"""
eq_("Hi", quotations.extract_from_plain(msg_body)) eq_("Hi", quotations.extract_from_plain(msg_body))
def test_with_indent(): def test_with_indent():
msg_body = """YOLO salvia cillum kogi typewriter mumblecore cardigan skateboard Austin. msg_body = """YOLO salvia cillum kogi typewriter mumblecore cardigan skateboard Austin.
------On 12/29/1987 17:32 PM, Julius Caesar wrote----- ------On 12/29/1987 17:32 PM, Julius Caesar wrote-----
Brunch mumblecore pug Marfa tofu, irure taxidermy hoodie readymade pariatur. Brunch mumblecore pug Marfa tofu, irure taxidermy hoodie readymade pariatur.
""" """
eq_("YOLO salvia cillum kogi typewriter mumblecore cardigan skateboard Austin.", quotations.extract_from_plain(msg_body)) eq_("YOLO salvia cillum kogi typewriter mumblecore cardigan skateboard Austin.",
quotations.extract_from_plain(msg_body))
def test_short_quotation_with_newline(): def test_short_quotation_with_newline():
@@ -282,6 +348,7 @@ Subject: The manager has commented on your Loop
Blah-blah-blah Blah-blah-blah
""")) """))
def test_german_from_block(): def test_german_from_block():
eq_('Allo! Follow up MIME!', quotations.extract_from_plain( eq_('Allo! Follow up MIME!', quotations.extract_from_plain(
"""Allo! Follow up MIME! """Allo! Follow up MIME!
@@ -294,6 +361,7 @@ Betreff: The manager has commented on your Loop
Blah-blah-blah Blah-blah-blah
""")) """))
def test_french_multiline_from_block(): def test_french_multiline_from_block():
eq_('Lorem ipsum', quotations.extract_from_plain( eq_('Lorem ipsum', quotations.extract_from_plain(
u"""Lorem ipsum u"""Lorem ipsum
@@ -306,6 +374,7 @@ Objet : Follow Up
Blah-blah-blah Blah-blah-blah
""")) """))
def test_french_from_block(): def test_french_from_block():
eq_('Lorem ipsum', quotations.extract_from_plain( eq_('Lorem ipsum', quotations.extract_from_plain(
u"""Lorem ipsum u"""Lorem ipsum
@@ -314,6 +383,7 @@ Le 23 janv. 2015 à 22:03, Brendan xxx <brendan.xxx@xxx.com<mailto:brendan.xxx@x
Bonjour!""")) Bonjour!"""))
def test_polish_from_block(): def test_polish_from_block():
eq_('Lorem ipsum', quotations.extract_from_plain( eq_('Lorem ipsum', quotations.extract_from_plain(
u"""Lorem ipsum u"""Lorem ipsum
@@ -324,6 +394,7 @@ napisał:
Blah! Blah!
""")) """))
def test_danish_from_block(): def test_danish_from_block():
eq_('Allo! Follow up MIME!', quotations.extract_from_plain( eq_('Allo! Follow up MIME!', quotations.extract_from_plain(
"""Allo! Follow up MIME! """Allo! Follow up MIME!
@@ -336,6 +407,7 @@ Emne: The manager has commented on your Loop
Blah-blah-blah Blah-blah-blah
""")) """))
def test_swedish_from_block(): def test_swedish_from_block():
eq_('Allo! Follow up MIME!', quotations.extract_from_plain( eq_('Allo! Follow up MIME!', quotations.extract_from_plain(
u"""Allo! Follow up MIME! u"""Allo! Follow up MIME!
@@ -347,6 +419,7 @@ Till: Isacson Leiff
Blah-blah-blah Blah-blah-blah
""")) """))
def test_swedish_from_line(): def test_swedish_from_line():
eq_('Lorem', quotations.extract_from_plain( eq_('Lorem', quotations.extract_from_plain(
"""Lorem """Lorem
@@ -355,6 +428,7 @@ Den 14 september, 2015 02:23:18, Valentino Rudy (valentino@rudy.be) skrev:
Veniam laborum mlkshk kale chips authentic. Normcore mumblecore laboris, fanny pack readymade eu blog chia pop-up freegan enim master cleanse. Veniam laborum mlkshk kale chips authentic. Normcore mumblecore laboris, fanny pack readymade eu blog chia pop-up freegan enim master cleanse.
""")) """))
def test_norwegian_from_line(): def test_norwegian_from_line():
eq_('Lorem', quotations.extract_from_plain( eq_('Lorem', quotations.extract_from_plain(
u"""Lorem u"""Lorem
@@ -363,13 +437,24 @@ På 14 september 2015 på 02:23:18, Valentino Rudy (valentino@rudy.be) skrev:
Veniam laborum mlkshk kale chips authentic. Normcore mumblecore laboris, fanny pack readymade eu blog chia pop-up freegan enim master cleanse. Veniam laborum mlkshk kale chips authentic. Normcore mumblecore laboris, fanny pack readymade eu blog chia pop-up freegan enim master cleanse.
""")) """))
def test_dutch_from_block(): def test_dutch_from_block():
eq_('Gluten-free culpa lo-fi et nesciunt nostrud.', quotations.extract_from_plain( eq_('Gluten-free culpa lo-fi et nesciunt nostrud.', quotations.extract_from_plain(
"""Gluten-free culpa lo-fi et nesciunt nostrud. """Gluten-free culpa lo-fi et nesciunt nostrud.
Op 17-feb.-2015, om 13:18 heeft Julius Caesar <pantheon@rome.com> het volgende geschreven: Op 17-feb.-2015, om 13:18 heeft Julius Caesar <pantheon@rome.com> het volgende geschreven:
Small batch beard laboris tempor, non listicle hella Tumblr heirloom. Small batch beard laboris tempor, non listicle hella Tumblr heirloom.
"""))
def test_vietnamese_from_block():
eq_('Hello', quotations.extract_from_plain(
u"""Hello
Vào 14:24 8 tháng 6, 2017, Hùng Nguyễn <hungnguyen@xxx.com> đã viết:
> Xin chào
""")) """))
@@ -384,7 +469,8 @@ def test_link_closed_with_quotation_marker_on_new_line():
msg_body = '''8.45am-1pm msg_body = '''8.45am-1pm
From: somebody@example.com From: somebody@example.com
Date: Wed, 16 May 2012 00:15:02 -0600
<http://email.example.com/c/dHJhY2tpbmdfY29kZT1mMDdjYzBmNzM1ZjYzMGIxNT <http://email.example.com/c/dHJhY2tpbmdfY29kZT1mMDdjYzBmNzM1ZjYzMGIxNT
> <bob@example.com <mailto:bob@example.com> > > <bob@example.com <mailto:bob@example.com> >
@@ -425,7 +511,9 @@ def test_from_block_starts_with_date():
msg_body = """Blah msg_body = """Blah
Date: Wed, 16 May 2012 00:15:02 -0600 Date: Wed, 16 May 2012 00:15:02 -0600
To: klizhentas@example.com""" To: klizhentas@example.com
"""
eq_('Blah', quotations.extract_from_plain(msg_body)) eq_('Blah', quotations.extract_from_plain(msg_body))
@@ -495,11 +583,12 @@ def test_mark_message_lines():
# next line should be marked as splitter # next line should be marked as splitter
'_____________', '_____________',
'From: foo@bar.com', 'From: foo@bar.com',
'Date: Wed, 16 May 2012 00:15:02 -0600',
'', '',
'> Hi', '> Hi',
'', '',
'Signature'] 'Signature']
eq_('tessemet', quotations.mark_message_lines(lines)) eq_('tesssemet', quotations.mark_message_lines(lines))
lines = ['Just testing the email reply', lines = ['Just testing the email reply',
'', '',
@@ -662,6 +751,15 @@ def test_preprocess_postprocess_2_links():
eq_(msg_body, quotations.extract_from_plain(msg_body)) eq_(msg_body, quotations.extract_from_plain(msg_body))
def body_iterator(msg, decode=False):
for subpart in msg.walk():
payload = subpart.get_payload(decode=decode)
if isinstance(payload, six.text_type):
yield payload
else:
yield payload.decode('utf8')
def test_standard_replies(): def test_standard_replies():
for filename in os.listdir(STANDARD_REPLIES): for filename in os.listdir(STANDARD_REPLIES):
filename = os.path.join(STANDARD_REPLIES, filename) filename = os.path.join(STANDARD_REPLIES, filename)
@@ -669,8 +767,8 @@ def test_standard_replies():
continue continue
with open(filename) as f: with open(filename) as f:
message = email.message_from_file(f) message = email.message_from_file(f)
body = email.iterators.typed_subpart_iterator(message, subtype='plain').next() body = next(email.iterators.typed_subpart_iterator(message, subtype='plain'))
text = ''.join(email.iterators.body_line_iterator(body, True)) text = ''.join(body_iterator(body, True))
stripped_text = quotations.extract_from_plain(text) stripped_text = quotations.extract_from_plain(text)
reply_text_fn = filename[:-4] + '_reply_text' reply_text_fn = filename[:-4] + '_reply_text'
@@ -683,3 +781,77 @@ def test_standard_replies():
"'%(reply)s' != %(stripped)s for %(fn)s" % \ "'%(reply)s' != %(stripped)s for %(fn)s" % \
{'reply': reply_text, 'stripped': stripped_text, {'reply': reply_text, 'stripped': stripped_text,
'fn': filename} 'fn': filename}
def test_split_email():
msg = """From: Mr. X
Date: 24 February 2016
To: Mr. Y
Subject: Hi
Attachments: none
Goodbye.
From: Mr. Y
To: Mr. X
Date: 24 February 2016
Subject: Hi
Attachments: none
Hello.
On 24th February 2016 at 09.32am, Conal wrote:
Hey!
On Mon, 2016-10-03 at 09:45 -0600, Stangel, Dan wrote:
> Mohan,
>
> We have not yet migrated the systems.
>
> Dan
>
> > -----Original Message-----
> > Date: Mon, 2 Apr 2012 17:44:22 +0400
> > Subject: Test
> > From: bob@xxx.mailgun.org
> > To: xxx@gmail.com; xxx@hotmail.com; xxx@yahoo.com; xxx@aol.com; xxx@comcast.net; xxx@nyc.rr.com
> >
> > Hi
> >
> > > From: bob@xxx.mailgun.org
> > > To: xxx@gmail.com; xxx@hotmail.com; xxx@yahoo.com; xxx@aol.com; xxx@comcast.net; xxx@nyc.rr.com
> > > Date: Mon, 2 Apr 2012 17:44:22 +0400
> > > Subject: Test
> > > Hi
> > >
> >
>
>
"""
expected_markers = "stttttsttttetesetesmmmmmmsmmmmmmmmmmmmmmmm"
markers = quotations.split_emails(msg)
eq_(markers, expected_markers)
def test_feedback_below_left_unparsed():
msg_body = """Please enter your feedback below. Thank you.
------------------------------------- Enter Feedback Below -------------------------------------
The user experience was unparallelled. Please continue production. I'm sending payment to ensure
that this line is intact."""
parsed = quotations.extract_from_plain(msg_body)
eq_(msg_body, parsed.decode('utf8'))
def test_appointment():
msg_body = """Invitation for an interview:
Date: Wednesday 3, October 2011
Time: 7 : 00am
Address: 130 Fox St
Please bring in your ID."""
parsed = quotations.extract_from_plain(msg_body)
eq_(msg_body, parsed.decode('utf8'))

View File

@@ -1,9 +1,13 @@
# coding:utf-8 # coding:utf-8
from . import * from __future__ import absolute_import
from nose.tools import eq_, ok_, assert_false
from talon import utils as u from talon import utils as u
from mock import patch, Mock
import cchardet import cchardet
import six
def test_get_delimiter(): def test_get_delimiter():
@@ -13,50 +17,54 @@ def test_get_delimiter():
def test_unicode(): def test_unicode():
eq_ (u'hi', u.to_unicode('hi')) eq_(u'hi', u.to_unicode('hi'))
eq_ (type(u.to_unicode('hi')), unicode ) eq_(type(u.to_unicode('hi')), six.text_type)
eq_ (type(u.to_unicode(u'hi')), unicode ) eq_(type(u.to_unicode(u'hi')), six.text_type)
eq_ (type(u.to_unicode('привет')), unicode ) eq_(type(u.to_unicode('привет')), six.text_type)
eq_ (type(u.to_unicode(u'привет')), unicode ) eq_(type(u.to_unicode(u'привет')), six.text_type)
eq_ (u"привет", u.to_unicode('привет')) eq_(u"привет", u.to_unicode('привет'))
eq_ (u"привет", u.to_unicode(u'привет')) eq_(u"привет", u.to_unicode(u'привет'))
# some latin1 stuff # some latin1 stuff
eq_ (u"Versión", u.to_unicode('Versi\xf3n', precise=True)) eq_(u"Versión", u.to_unicode(u'Versi\xf3n'.encode('iso-8859-2'), precise=True))
def test_detect_encoding(): def test_detect_encoding():
eq_ ('ascii', u.detect_encoding('qwe').lower()) eq_('ascii', u.detect_encoding(b'qwe').lower())
eq_ ('iso-8859-2', u.detect_encoding('Versi\xf3n').lower()) ok_(u.detect_encoding(
eq_ ('utf-8', u.detect_encoding('привет').lower()) u'Versi\xf3n'.encode('iso-8859-2')).lower() in [
'iso-8859-1', 'iso-8859-2'])
eq_('utf-8', u.detect_encoding(u'привет'.encode('utf8')).lower())
# fallback to utf-8 # fallback to utf-8
with patch.object(u.chardet, 'detect') as detect: with patch.object(u.chardet, 'detect') as detect:
detect.side_effect = Exception detect.side_effect = Exception
eq_ ('utf-8', u.detect_encoding('qwe').lower()) eq_('utf-8', u.detect_encoding('qwe'.encode('utf8')).lower())
def test_quick_detect_encoding(): def test_quick_detect_encoding():
eq_ ('ascii', u.quick_detect_encoding('qwe').lower()) eq_('ascii', u.quick_detect_encoding(b'qwe').lower())
eq_ ('windows-1252', u.quick_detect_encoding('Versi\xf3n').lower()) ok_(u.quick_detect_encoding(
eq_ ('utf-8', u.quick_detect_encoding('привет').lower()) u'Versi\xf3n'.encode('windows-1252')).lower() in [
'windows-1252', 'windows-1250'])
eq_('utf-8', u.quick_detect_encoding(u'привет'.encode('utf8')).lower())
@patch.object(cchardet, 'detect') @patch.object(cchardet, 'detect')
@patch.object(u, 'detect_encoding') @patch.object(u, 'detect_encoding')
def test_quick_detect_encoding_edge_cases(detect_encoding, cchardet_detect): def test_quick_detect_encoding_edge_cases(detect_encoding, cchardet_detect):
cchardet_detect.return_value = {'encoding': 'ascii'} cchardet_detect.return_value = {'encoding': 'ascii'}
eq_('ascii', u.quick_detect_encoding("qwe")) eq_('ascii', u.quick_detect_encoding(b"qwe"))
cchardet_detect.assert_called_once_with("qwe") cchardet_detect.assert_called_once_with(b"qwe")
# fallback to detect_encoding # fallback to detect_encoding
cchardet_detect.return_value = {} cchardet_detect.return_value = {}
detect_encoding.return_value = 'utf-8' detect_encoding.return_value = 'utf-8'
eq_('utf-8', u.quick_detect_encoding("qwe")) eq_('utf-8', u.quick_detect_encoding(b"qwe"))
# exception # exception
detect_encoding.reset_mock() detect_encoding.reset_mock()
cchardet_detect.side_effect = Exception() cchardet_detect.side_effect = Exception()
detect_encoding.return_value = 'utf-8' detect_encoding.return_value = 'utf-8'
eq_('utf-8', u.quick_detect_encoding("qwe")) eq_('utf-8', u.quick_detect_encoding(b"qwe"))
ok_(detect_encoding.called) ok_(detect_encoding.called)
@@ -73,11 +81,11 @@ Haha
</p> </p>
</body>""" </body>"""
text = u.html_to_text(html) text = u.html_to_text(html)
eq_("Hello world! \n\n * One! \n * Two \nHaha", text) eq_(b"Hello world! \n\n * One! \n * Two \nHaha", text)
eq_("привет!", u.html_to_text("<b>привет!</b>")) eq_(u"привет!", u.html_to_text("<b>привет!</b>").decode('utf8'))
html = '<body><br/><br/>Hi</body>' html = '<body><br/><br/>Hi</body>'
eq_ ('Hi', u.html_to_text(html)) eq_(b'Hi', u.html_to_text(html))
html = """Hi html = """Hi
<style type="text/css"> <style type="text/css">
@@ -97,11 +105,60 @@ font: 13px 'Lucida Grande', Arial, sans-serif;
} }
</style>""" </style>"""
eq_ ('Hi', u.html_to_text(html)) eq_(b'Hi', u.html_to_text(html))
html = """<div> html = """<div>
<!-- COMMENT 1 --> <!-- COMMENT 1 -->
<span>TEXT 1</span> <span>TEXT 1</span>
<p>TEXT 2 <!-- COMMENT 2 --></p> <p>TEXT 2 <!-- COMMENT 2 --></p>
</div>""" </div>"""
eq_('TEXT 1 \nTEXT 2', u.html_to_text(html)) eq_(b'TEXT 1 \nTEXT 2', u.html_to_text(html))
def test_comment_no_parent():
s = b'<!-- COMMENT 1 --> no comment'
d = u.html_document_fromstring(s)
eq_(b"no comment", u.html_tree_to_text(d))
@patch.object(u.html5parser, 'fromstring', Mock(side_effect=Exception()))
def test_html_fromstring_exception():
eq_(None, u.html_fromstring("<html></html>"))
@patch.object(u, 'html_too_big', Mock())
@patch.object(u.html5parser, 'fromstring')
def test_html_fromstring_too_big(fromstring):
eq_(None, u.html_fromstring("<html></html>"))
assert_false(fromstring.called)
@patch.object(u.html5parser, 'document_fromstring')
def test_html_document_fromstring_exception(document_fromstring):
document_fromstring.side_effect = Exception()
eq_(None, u.html_document_fromstring("<html></html>"))
@patch.object(u, 'html_too_big', Mock())
@patch.object(u.html5parser, 'document_fromstring')
def test_html_document_fromstring_too_big(document_fromstring):
eq_(None, u.html_document_fromstring("<html></html>"))
assert_false(document_fromstring.called)
@patch.object(u, 'html_fromstring', Mock(return_value=None))
def test_bad_html_to_text():
bad_html = "one<br>two<br>three"
eq_(None, u.html_to_text(bad_html))
@patch.object(u, '_MAX_TAGS_COUNT', 3)
def test_html_too_big():
eq_(False, u.html_too_big("<div></div>"))
eq_(True, u.html_too_big("<div><span>Hi</span></div>"))
@patch.object(u, '_MAX_TAGS_COUNT', 3)
def test_html_to_text():
eq_(b"Hello", u.html_to_text("<div>Hello</div>"))
eq_(None, u.html_to_text("<div><span>Hi</span></div>"))

View File

@@ -1,3 +1,4 @@
from __future__ import absolute_import
from talon.signature import EXTRACTOR_FILENAME, EXTRACTOR_DATA from talon.signature import EXTRACTOR_FILENAME, EXTRACTOR_DATA
from talon.signature.learning.classifier import train, init from talon.signature.learning.classifier import train, init