65 Commits

Author SHA1 Message Date
Maxim Vladimirskiy
1a5548f171 Merge pull request #222 from mailgun/maxim/develop
PIP-1409: Remove version pins from setup.py [python3]
2021-11-11 16:29:30 +03:00
Maxim Vladimirskiy
53c49b9121 Remove version pins from setup.py 2021-11-11 15:36:50 +03:00
Matt Dietz
bd50872043 Merge pull request #217 from mailgun/dietz/REP-1030
Drops Python 2 support [python3]
2021-06-15 09:46:29 -05:00
Matt Dietz
d37c4fd551 Drops Python 2 support
REP-1030

In addition to some python 2 => 3 fixes, this change bumps the scikit-learn
version to latest. The previously pinned version of scikit-learn failed trying
to compile all necessary C modules under python 3.7+ due to included header files
that weren't compatible with C the API implemented in python 3.7+.

Simultaneously, with the restrictive compatibility supported by scikit-learn,
it seemed prudent to drop python 2 support altogether. Otherwise, we'd be stuck
with python 3.4 as the newest possible version we could support.

With this change, tests are currently passing under 3.9.2.

Lastly, imports the original training data. At some point, a new version
of the training data was committed to the repo but no classifier was
trained from it. Using a classifier trained from this new data resulted
in most of the tests failing.
2021-06-10 14:03:25 -05:00
Sergey Obukhov
d9ed7cc6d1 Merge pull request #190 from yoks/master
Add __init__.py into data folder, add data files into MANIFEST.in
2019-07-02 18:56:47 +03:00
Sergey Obukhov
0a0808c0a8 Merge branch 'master' into master 2019-07-01 20:48:46 +03:00
Sergey Obukhov
16354e3528 Merge pull request #191 from mailgun/thrawn/develop
PIP-423: Now removing namespaces from parsed HTML
2019-05-12 11:54:17 +03:00
Derrick J. Wippler
1018e88ec1 Now removing namespaces from parsed HTML 2019-05-10 11:16:12 -05:00
Ivan Anisimov
2916351517 Update setup.py 2019-03-16 22:17:26 +03:00
Ivan Anisimov
46d4b02c81 Update setup.py 2019-03-16 22:15:43 +03:00
Ivan Anisimov
58eac88a10 Update MANIFEST.in 2019-03-16 22:03:40 +03:00
Ivan Anisimov
2ef3d8dfbe Update MANIFEST.in 2019-03-16 22:01:00 +03:00
Ivan Anisimov
7cf4c29340 Create __init__.py 2019-03-16 21:54:09 +03:00
Sergey Obukhov
cdd84563dd Merge pull request #183 from mailgun/sergey/date
fix text with Date: misclassified as quotations splitter
2019-01-18 17:32:10 +03:00
Sergey Obukhov
8138ea9a60 fix text with Date: misclassified as quotations splitter 2019-01-18 16:49:39 +03:00
Sergey Obukhov
c171f9a875 Merge pull request #169 from Savageman/patch-2
Use regex match to detect outlook 2007, 2010, 2013
2018-11-05 10:43:20 +03:00
Sergey Obukhov
3f97a8b8ff Merge branch 'master' into patch-2 2018-11-05 10:42:00 +03:00
Esperat Julian
1147767ff3 Fix regression: windows mail format was left forgotten
Missing a | at the end of the regex, so next lines are part of the global search.
2018-11-04 19:42:12 +01:00
Sergey Obukhov
6a304215c3 Merge pull request #177 from mailgun/obukhov-sergey-patch-1
Update Readme with how to retrain on your own data
2018-11-02 15:22:18 +03:00
Sergey Obukhov
31714506bd Update Readme with how to retrain on your own data 2018-11-02 15:21:36 +03:00
Sergey Obukhov
403d80cf3b Merge pull request #161 from glaand/master
Fix: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
2018-11-02 15:03:02 +03:00
Sergey Obukhov
7cf20f2877 Merge branch 'master' into master 2018-11-02 14:52:38 +03:00
Sergey Obukhov
afff08b017 Merge branch 'master' into patch-2 2018-11-02 09:13:42 +03:00
Sergey Obukhov
685abb1905 Merge pull request #171 from gabriellima95/Add-Portuguese-Language
Add Portuguese language to quotations
2018-11-02 09:12:43 +03:00
Sergey Obukhov
41990727a3 Merge branch 'master' into Add-Portuguese-Language 2018-11-02 09:11:07 +03:00
Sergey Obukhov
b113d8ab33 Merge pull request #172 from ad-m/patch-1
Fix catastrophic backtracking in regexp
2018-11-02 09:09:49 +03:00
Adam Dobrawy
7bd0e9cc2f Fix catastrophic backtracking in regexp
Co-Author: @Nipsuli
2018-09-21 22:00:10 +02:00
gabriellima95
1e030a51d4 Add Portuguese language to quotations 2018-09-11 15:27:39 -03:00
Esperat Julian
238a5de5cc Use regex match to detect outlook 2007, 2010, 2013
I encountered a variant of the outlook quotations with a space after the semicolon.

To prevent multiplying the number of rules, I implemented a regex match instead (I found how to here: https://stackoverflow.com/a/34093801/211204).

I documented all the different variants as cleanly as I could.
2018-08-31 12:39:52 +02:00
André Glatzl
53b24ffb3d Cut out first some encoding html tags such as xml and doctype for avoiding conflict with unicode decoding 2017-12-19 15:15:10 +01:00
Sergey Obukhov
a7404afbcb Merge pull request #155 from mailgun/sergey/appointment
fix appointments in text
2017-10-23 16:34:08 -07:00
Sergey Obukhov
0e6d5f993c fix appointments in text 2017-10-23 16:32:42 -07:00
Sergey Obukhov
60637ff13a Merge pull request #152 from mailgun/sergey/v1.4.4
bump version
2017-08-24 16:00:05 -07:00
Sergey Obukhov
df8259e3fe bump version 2017-08-24 15:58:53 -07:00
Sergey Obukhov
aab3b1cc75 Merge pull request #150 from ezrapagel/fix_greedy_dash_regex
android_wrote regex incorrectly matching
2017-08-24 15:52:29 -07:00
Sergey Obukhov
9492b39f2d Merge branch 'master' into fix_greedy_dash_regex 2017-08-24 15:39:28 -07:00
Sergey Obukhov
b9ac866ea7 Merge pull request #151 from mailgun/sergey/reshape
reshape data as suggested by sklearn
2017-08-24 12:04:58 -07:00
Sergey Obukhov
678517dd89 reshape data as suggested by sklearn 2017-08-24 12:03:47 -07:00
Ezra Pagel
221774c6f8 android_wrote regex was incorrectly iterating characters in 'wrote', resulting in greedy regex that
matched many strings with dashes
2017-08-21 12:47:06 -05:00
Sergey Obukhov
a2aa345712 Merge pull request #148 from mailgun/sergey/v1.4.2
bump version after adding support for Vietnamese format
2017-07-10 11:44:46 -07:00
Sergey Obukhov
d998beaff3 bump version after adding support for Vietnamese format 2017-07-10 11:42:52 -07:00
Sergey Obukhov
a379bc4e7c Merge pull request #147 from hnx116/master
add support for Vietnamese reply format
2017-07-10 11:40:04 -07:00
Hung Nguyen
b8e1894f3b add test case 2017-07-10 13:28:33 +07:00
Hung Nguyen
0b5a44090f add support for Vietnamese reply format 2017-07-10 11:18:57 +07:00
Sergey Obukhov
b40835eca2 Merge pull request #145 from mailgun/sergey/outlook-2013-version-bump
bump version after merging outlook 2013 support PR
2017-06-18 22:56:16 -07:00
Sergey Obukhov
b38562c7cc bump version after merging outlook 2013 support PR 2017-06-18 22:55:15 -07:00
Sergey Obukhov
70e9fb415e Merge pull request #139 from Savageman/patch-1
Added Outlook 2013 rules
2017-06-18 22:53:18 -07:00
Sergey Obukhov
64612099cd Merge branch 'master' into patch-1 2017-06-18 22:51:46 -07:00
Sergey Obukhov
45c20f979d Merge pull request #144 from mailgun/sergey/python3-support-version-bump
bump version after merging python 3 support PR
2017-06-18 22:49:20 -07:00
Sergey Obukhov
743c76f159 bump version after merging python 3 support PR 2017-06-18 22:48:12 -07:00
Sergey Obukhov
bc5dad75d3 Merge pull request #141 from yfilali/master
Python 3 compatibility up to 3.6.1
2017-06-18 22:44:07 -07:00
Yacine Filali
4acf05cf28 Only use load compat if we can't load the classifier 2017-05-24 13:29:59 -07:00
Yacine Filali
f5f7264077 Can now handle read only classifier data as well 2017-05-24 13:22:24 -07:00
Yacine Filali
4364bebf38 Added exception checking for pickle format conversion 2017-05-24 10:26:33 -07:00
Yacine Filali
15e61768f2 Encoding fixes 2017-05-23 16:17:39 -07:00
Yacine Filali
dd0a0f5c4d Python 2.7 backward compat 2017-05-23 16:10:13 -07:00
Yacine Filali
086f5ba43b Updated talon for Python 3 2017-05-23 15:39:50 -07:00
Esperat Julian
e16dcf629e Added Outlook 2013 rules
Only the border color changes (compared to Outlook 2007, 2010) from `#B5C4DF` to `#E1E1E1`.
2017-04-27 11:34:01 +02:00
Sergey Obukhov
f16ae5110b Merge pull request #138 from mailgun/sergey/v1.3.7
bumped talon version
2017-04-25 11:49:29 -07:00
Sergey Obukhov
ab5cbe5ec3 bumped talon version 2017-04-25 11:43:55 -07:00
Sergey Obukhov
be5da92f16 Merge pull request #135 from esetnik/polymail_support
Polymail Quote Support
2017-04-25 11:34:47 -07:00
Sergey Obukhov
95954a65a0 Merge branch 'master' into polymail_support 2017-04-25 11:30:53 -07:00
Ethan Setnik
5c413b4b00 allow more lines since polymail has extra whitespace 2017-04-12 00:07:29 -04:00
Ethan Setnik
cca64d3ed1 add test case 2017-04-11 23:36:36 -04:00
Ethan Setnik
e11eaf6ff8 add support for polymail reply format 2017-04-11 22:38:29 -04:00
22 changed files with 3049 additions and 2535 deletions

20
.build/Dockerfile Normal file
View File

@@ -0,0 +1,20 @@
FROM python:3.9-slim-buster AS deps
RUN apt-get update && \
apt-get install -y build-essential git curl python3-dev libatlas3-base libatlas-base-dev liblapack-dev libxml2 libxml2-dev libffi6 libffi-dev musl-dev libxslt-dev
FROM deps AS testable
ARG REPORT_PATH
VOLUME ["/var/mailgun", "/etc/mailgun/ssl", ${REPORT_PATH}]
ADD . /app
WORKDIR /app
COPY wheel/* /wheel/
RUN mkdir -p ${REPORT_PATH}
RUN python ./setup.py build bdist_wheel -d /wheel && \
pip install --no-deps /wheel/*
ENTRYPOINT ["/bin/sh", "/app/run_tests.sh"]

5
.gitignore vendored
View File

@@ -39,6 +39,8 @@ nosetests.xml
/.emacs.desktop /.emacs.desktop
/.emacs.desktop.lock /.emacs.desktop.lock
.elc .elc
.idea
.cache
auto-save-list auto-save-list
tramp tramp
.\#* .\#*
@@ -52,3 +54,6 @@ _trial_temp
# OSX # OSX
.DS_Store .DS_Store
# vim-backup
*.bak

View File

@@ -5,3 +5,10 @@ include classifier
include LICENSE include LICENSE
include MANIFEST.in include MANIFEST.in
include README.rst include README.rst
include talon/signature/data/train.data
include talon/signature/data/classifier
include talon/signature/data/classifier_01.npy
include talon/signature/data/classifier_02.npy
include talon/signature/data/classifier_03.npy
include talon/signature/data/classifier_04.npy
include talon/signature/data/classifier_05.npy

View File

@@ -129,6 +129,22 @@ start using it for talon.
.. _EDRM: http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set .. _EDRM: http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set
.. _forge: https://github.com/mailgun/forge .. _forge: https://github.com/mailgun/forge
Training on your dataset
------------------------
talon comes with a pre-processed dataset and a pre-trained classifier. To retrain the classifier on your own dataset of raw emails, structure and annotate them in the same way the `forge`_ project does. Then do:
.. code:: python
from talon.signature.learning.dataset import build_extraction_dataset
from talon.signature.learning import classifier as c
build_extraction_dataset("/path/to/your/P/folder", "/path/to/talon/signature/data/train.data")
c.train(c.init(), "/path/to/talon/signature/data/train.data", "/path/to/talon/signature/data/classifier")
Note that for signature extraction you need just the folder with the positive samples with annotated signature lines (P folder).
.. _forge: https://github.com/mailgun/forge
Research Research
-------- --------

11
requirements.txt Normal file
View File

@@ -0,0 +1,11 @@
chardet>=1.0.1
cchardet>=0.3.5
cssselect
html5lib
joblib
lxml>=2.3.3
numpy
regex>=1
scikit-learn>=1.0.0
scipy
six>=1.10.0

4
run_tests.sh Executable file
View File

@@ -0,0 +1,4 @@
#!/usr/bin/env bash
set -ex
REPORT_PATH="${REPORT_PATH:-./}"
nosetests --with-xunit --with-coverage --cover-xml --cover-xml-file $REPORT_PATH/coverage.xml --xunit-file=$REPORT_PATH/nosetests.xml --cover-package=talon .

View File

@@ -19,17 +19,17 @@ class InstallCommand(install):
if self.no_ml: if self.no_ml:
dist = self.distribution dist = self.distribution
dist.packages=find_packages(exclude=[ dist.packages=find_packages(exclude=[
'tests', "tests",
'tests.*', "tests.*",
'talon.signature', "talon.signature",
'talon.signature.*', "talon.signature.*",
]) ])
for not_required in ['numpy', 'scipy', 'scikit-learn==0.16.1']: for not_required in ["numpy", "scipy", "scikit-learn==0.24.1"]:
dist.install_requires.remove(not_required) dist.install_requires.remove(not_required)
setup(name='talon', setup(name='talon',
version='1.3.6', version='1.4.9',
description=("Mailgun library " description=("Mailgun library "
"to extract message quotations and signatures."), "to extract message quotations and signatures."),
long_description=open("README.rst").read(), long_description=open("README.rst").read(),
@@ -44,20 +44,21 @@ setup(name='talon',
include_package_data=True, include_package_data=True,
zip_safe=True, zip_safe=True,
install_requires=[ install_requires=[
"lxml>=2.3.3", "lxml",
"regex>=1", "regex",
"numpy", "numpy",
"scipy", "scipy",
"scikit-learn>=0.16.1", # pickled versions of classifier, else rebuild "scikit-learn>=1.0.0",
'chardet>=1.0.1', "chardet",
'cchardet>=0.3.5', "cchardet",
'cssselect', "cssselect",
'six>=1.10.0', "six",
'html5lib' "html5lib",
"joblib",
], ],
tests_require=[ tests_require=[
"mock", "mock",
"nose>=1.2.1", "nose",
"coverage" "coverage"
] ]
) )

View File

@@ -87,17 +87,24 @@ def cut_gmail_quote(html_message):
def cut_microsoft_quote(html_message): def cut_microsoft_quote(html_message):
''' Cuts splitter block and all following blocks. ''' ''' Cuts splitter block and all following blocks. '''
#use EXSLT extensions to have a regex match() function with lxml
ns = {"re": "http://exslt.org/regular-expressions"}
#general pattern: @style='border:none;border-top:solid <color> 1.0pt;padding:3.0pt 0<unit> 0<unit> 0<unit>'
#outlook 2007, 2010 (international) <color=#B5C4DF> <unit=cm>
#outlook 2007, 2010 (american) <color=#B5C4DF> <unit=pt>
#outlook 2013 (international) <color=#E1E1E1> <unit=cm>
#outlook 2013 (american) <color=#E1E1E1> <unit=pt>
#also handles a variant with a space after the semicolon
splitter = html_message.xpath( splitter = html_message.xpath(
#outlook 2007, 2010 (international) #outlook 2007, 2010, 2013 (international, american)
"//div[@style='border:none;border-top:solid #B5C4DF 1.0pt;" "//div[@style[re:match(., 'border:none; ?border-top:solid #(E1E1E1|B5C4DF) 1.0pt; ?"
"padding:3.0pt 0cm 0cm 0cm']|" "padding:3.0pt 0(in|cm) 0(in|cm) 0(in|cm)')]]|"
#outlook 2007, 2010 (american)
"//div[@style='border:none;border-top:solid #B5C4DF 1.0pt;"
"padding:3.0pt 0in 0in 0in']|"
#windows mail #windows mail
"//div[@style='padding-top: 5px; " "//div[@style='padding-top: 5px; "
"border-top-color: rgb(229, 229, 229); " "border-top-color: rgb(229, 229, 229); "
"border-top-width: 1px; border-top-style: solid;']" "border-top-width: 1px; border-top-style: solid;']"
, namespaces=ns
) )
if splitter: if splitter:

View File

@@ -22,7 +22,7 @@ import six
log = logging.getLogger(__name__) log = logging.getLogger(__name__)
RE_FWD = re.compile("^[-]+[ ]*Forwarded message[ ]*[-]+$", re.I | re.M) RE_FWD = re.compile("^[-]+[ ]*Forwarded message[ ]*[-]+\s*$", re.I | re.M)
RE_ON_DATE_SMB_WROTE = re.compile( RE_ON_DATE_SMB_WROTE = re.compile(
u'(-*[>]?[ ]?({0})[ ].*({1})(.*\n){{0,2}}.*({2}):?-*)'.format( u'(-*[>]?[ ]?({0})[ ].*({1})(.*\n){{0,2}}.*({2}):?-*)'.format(
@@ -38,10 +38,14 @@ RE_ON_DATE_SMB_WROTE = re.compile(
'Op', 'Op',
# German # German
'Am', 'Am',
# Portuguese
'Em',
# Norwegian # Norwegian
u'', u'',
# Swedish, Danish # Swedish, Danish
'Den', 'Den',
# Vietnamese
u'Vào',
)), )),
# Date and sender separator # Date and sender separator
u'|'.join(( u'|'.join((
@@ -62,8 +66,12 @@ RE_ON_DATE_SMB_WROTE = re.compile(
'schreef','verzond','geschreven', 'schreef','verzond','geschreven',
# German # German
'schrieb', 'schrieb',
# Portuguese
'escreveu',
# Norwegian, Swedish # Norwegian, Swedish
'skrev', 'skrev',
# Vietnamese
u'đã viết',
)) ))
)) ))
# Special case for languages where text is translated like this: 'on {date} wrote {somebody}:' # Special case for languages where text is translated like this: 'on {date} wrote {somebody}:'
@@ -131,21 +139,33 @@ RE_ORIGINAL_MESSAGE = re.compile(u'[\s]*[-]+[ ]*({})[ ]*[-]+'.format(
'Oprindelig meddelelse', 'Oprindelig meddelelse',
))), re.I) ))), re.I)
RE_FROM_COLON_OR_DATE_COLON = re.compile(u'(_+\r?\n)?[\s]*(:?[*]?{})[\s]?:[*]?.*'.format( RE_FROM_COLON_OR_DATE_COLON = re.compile(u'((_+\r?\n)?[\s]*:?[*]?({})[\s]?:([^\n$]+\n){{1,2}}){{2,}}'.format(
u'|'.join(( u'|'.join((
# "From" in different languages. # "From" in different languages.
'From', 'Van', 'De', 'Von', 'Fra', u'Från', 'From', 'Van', 'De', 'Von', 'Fra', u'Från',
# "Date" in different languages. # "Date" in different languages.
'Date', 'Datum', u'Envoyé', 'Skickat', 'Sendt', 'Date', '[S]ent', 'Datum', u'Envoyé', 'Skickat', 'Sendt', 'Gesendet',
))), re.I) # "Subject" in different languages.
'Subject', 'Betreff', 'Objet', 'Emne', u'Ämne',
# "To" in different languages.
'To', 'An', 'Til', u'À', 'Till'
))), re.I | re.M)
# ---- John Smith wrote ---- # ---- John Smith wrote ----
RE_ANDROID_WROTE = re.compile(u'[\s]*[-]+.*({})[ ]*[-]+'.format( RE_ANDROID_WROTE = re.compile(u'[\s]*[-]+.*({})[ ]*[-]+'.format(
u'|'.join(( u'|'.join((
# English # English
'wrote' 'wrote',
))), re.I) ))), re.I)
# Support polymail.io reply format
# On Tue, Apr 11, 2017 at 10:07 PM John Smith
#
# <
# mailto:John Smith <johnsmith@gmail.com>
# > wrote:
RE_POLYMAIL = re.compile('On.*\s{2}<\smailto:.*\s> wrote:', re.I)
SPLITTER_PATTERNS = [ SPLITTER_PATTERNS = [
RE_ORIGINAL_MESSAGE, RE_ORIGINAL_MESSAGE,
RE_ON_DATE_SMB_WROTE, RE_ON_DATE_SMB_WROTE,
@@ -153,16 +173,17 @@ SPLITTER_PATTERNS = [
RE_FROM_COLON_OR_DATE_COLON, RE_FROM_COLON_OR_DATE_COLON,
# 02.04.2012 14:20 пользователь "bob@example.com" < # 02.04.2012 14:20 пользователь "bob@example.com" <
# bob@xxx.mailgun.org> написал: # bob@xxx.mailgun.org> написал:
re.compile("(\d+/\d+/\d+|\d+\.\d+\.\d+).*@", re.S), re.compile("(\d+/\d+/\d+|\d+\.\d+\.\d+).*\s\S+@\S+", re.S),
# 2014-10-17 11:28 GMT+03:00 Bob < # 2014-10-17 11:28 GMT+03:00 Bob <
# bob@example.com>: # bob@example.com>:
re.compile("\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}\s+GMT.*@", re.S), re.compile("\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}\s+GMT.*\s\S+@\S+", re.S),
# Thu, 26 Jun 2014 14:00:51 +0400 Bob <bob@example.com>: # Thu, 26 Jun 2014 14:00:51 +0400 Bob <bob@example.com>:
re.compile('\S{3,10}, \d\d? \S{3,10} 20\d\d,? \d\d?:\d\d(:\d\d)?' re.compile('\S{3,10}, \d\d? \S{3,10} 20\d\d,? \d\d?:\d\d(:\d\d)?'
'( \S+){3,6}@\S+:'), '( \S+){3,6}@\S+:'),
# Sent from Samsung MobileName <address@example.com> wrote: # Sent from Samsung MobileName <address@example.com> wrote:
re.compile('Sent from Samsung .*@.*> wrote'), re.compile('Sent from Samsung.* \S+@\S+> wrote'),
RE_ANDROID_WROTE RE_ANDROID_WROTE,
RE_POLYMAIL
] ]
RE_LINK = re.compile('<(http://[^>]*)>') RE_LINK = re.compile('<(http://[^>]*)>')
@@ -170,7 +191,7 @@ RE_NORMALIZED_LINK = re.compile('@@(http://[^>@]*)@@')
RE_PARENTHESIS_LINK = re.compile("\(https?://") RE_PARENTHESIS_LINK = re.compile("\(https?://")
SPLITTER_MAX_LINES = 4 SPLITTER_MAX_LINES = 6
MAX_LINES_COUNT = 1000 MAX_LINES_COUNT = 1000
# an extensive research shows that exceeding this limit # an extensive research shows that exceeding this limit
# leads to excessive processing time # leads to excessive processing time
@@ -273,7 +294,7 @@ def process_marked_lines(lines, markers, return_flags=[False, -1, -1]):
# inlined reply # inlined reply
# use lookbehind assertions to find overlapping entries e.g. for 'mtmtm' # use lookbehind assertions to find overlapping entries e.g. for 'mtmtm'
# both 't' entries should be found # both 't' entries should be found
for inline_reply in re.finditer('(?<=m)e*((?:t+e*)+)m', markers): for inline_reply in re.finditer('(?<=m)e*(t[te]*)m', markers):
# long links could break sequence of quotation lines but they shouldn't # long links could break sequence of quotation lines but they shouldn't
# be considered an inline reply # be considered an inline reply
links = ( links = (
@@ -417,6 +438,9 @@ def _extract_from_html(msg_body):
Extract not quoted message from provided html message body Extract not quoted message from provided html message body
using tags and plain text algorithm. using tags and plain text algorithm.
Cut out first some encoding html tags such as xml and doctype
for avoiding conflict with unicode decoding
Cut out the 'blockquote', 'gmail_quote' tags. Cut out the 'blockquote', 'gmail_quote' tags.
Cut Microsoft quotations. Cut Microsoft quotations.
@@ -432,6 +456,9 @@ def _extract_from_html(msg_body):
return msg_body return msg_body
msg_body = msg_body.replace(b'\r\n', b'\n') msg_body = msg_body.replace(b'\r\n', b'\n')
msg_body = re.sub(br"\<\?xml.+\?\>|\<\!DOCTYPE.+]\>", "", msg_body)
html_tree = html_document_fromstring(msg_body) html_tree = html_document_fromstring(msg_body)
if html_tree is None: if html_tree is None:
@@ -489,9 +516,69 @@ def _extract_from_html(msg_body):
if _readable_text_empty(html_tree_copy): if _readable_text_empty(html_tree_copy):
return msg_body return msg_body
# NOTE: We remove_namespaces() because we are using an HTML5 Parser, HTML
# parsers do not recognize namespaces in HTML tags. As such the rendered
# HTML tags are no longer recognizable HTML tags. Example: <o:p> becomes
# <oU0003Ap>. When we port this to golang we should look into using an
# XML Parser NOT and HTML5 Parser since we do not know what input a
# customer will send us. Switching to a common XML parser in python
# opens us up to a host of vulnerabilities.
# See https://docs.python.org/3/library/xml.html#xml-vulnerabilities
#
# The down sides to removing the namespaces is that customers might
# judge the XML namespaces important. If that is the case then support
# should encourage customers to preform XML parsing of the un-stripped
# body to get the full unmodified XML payload.
#
# Alternatives to this approach are
# 1. Ignore the U0003A in tag names and let the customer deal with it.
# This is not ideal, as most customers use stripped-html for viewing
# emails sent from a recipient, as such they cannot control the HTML
# provided by a recipient.
# 2. Preform a string replace of 'U0003A' to ':' on the rendered HTML
# string. While this would solve the issue simply, it runs the risk
# of replacing data outside the <tag> which might be essential to
# the customer.
remove_namespaces(html_tree_copy)
return html.tostring(html_tree_copy) return html.tostring(html_tree_copy)
def remove_namespaces(root):
"""
Given the root of an HTML document iterate through all the elements
and remove any namespaces that might have been provided and remove
any attributes that contain a namespace
<html xmlns:o="urn:schemas-microsoft-com:office:office">
becomes
<html>
<o:p>Hi</o:p>
becomes
<p>Hi</p>
Start tags do NOT have a namespace; COLON characters have no special meaning.
if we don't remove the namespace the parser translates the tag name into a
unicode representation. For example <o:p> becomes <oU0003Ap>
See https://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#start-tags
"""
for child in root.iter():
for key, value in child.attrib.items():
# If the attribute includes a colon
if key.rfind("U0003A") != -1:
child.attrib.pop(key)
# If the tag includes a colon
idx = child.tag.rfind("U0003A")
if idx != -1:
child.tag = child.tag[idx+6:]
return root
def split_emails(msg): def split_emails(msg):
""" """
Given a message (which may consist of an email conversation thread with Given a message (which may consist of an email conversation thread with
@@ -544,7 +631,6 @@ def _correct_splitlines_in_headers(markers, lines):
updated_markers = "" updated_markers = ""
i = 0 i = 0
in_header_block = False in_header_block = False
for m in markers: for m in markers:
# Only set in_header_block flag when we hit an 's' and line is a header # Only set in_header_block flag when we hit an 's' and line is a header
if m == 's': if m == 's':

View File

@@ -1,15 +1,15 @@
from __future__ import absolute_import from __future__ import absolute_import
import logging import logging
import regex as re import regex as re
from talon.utils import get_delimiter
from talon.signature.constants import (SIGNATURE_MAX_LINES, from talon.signature.constants import (SIGNATURE_MAX_LINES,
TOO_LONG_SIGNATURE_LINE) TOO_LONG_SIGNATURE_LINE)
from talon.utils import get_delimiter
log = logging.getLogger(__name__) log = logging.getLogger(__name__)
# regex to fetch signature based on common signature words # regex to fetch signature based on common signature words
RE_SIGNATURE = re.compile(r''' RE_SIGNATURE = re.compile(r'''
( (
@@ -28,7 +28,6 @@ RE_SIGNATURE = re.compile(r'''
) )
''', re.I | re.X | re.M | re.S) ''', re.I | re.X | re.M | re.S)
# signatures appended by phone email clients # signatures appended by phone email clients
RE_PHONE_SIGNATURE = re.compile(r''' RE_PHONE_SIGNATURE = re.compile(r'''
( (
@@ -45,7 +44,6 @@ RE_PHONE_SIGNATURE = re.compile(r'''
) )
''', re.I | re.X | re.M | re.S) ''', re.I | re.X | re.M | re.S)
# see _mark_candidate_indexes() for details # see _mark_candidate_indexes() for details
# c - could be signature line # c - could be signature line
# d - line starts with dashes (could be signature or list item) # d - line starts with dashes (could be signature or list item)
@@ -112,7 +110,7 @@ def extract_signature(msg_body):
return (stripped_body.strip(), return (stripped_body.strip(),
signature.strip()) signature.strip())
except Exception as e: except Exception:
log.exception('ERROR extracting signature') log.exception('ERROR extracting signature')
return (msg_body, None) return (msg_body, None)
@@ -163,7 +161,7 @@ def _mark_candidate_indexes(lines, candidate):
'cdc' 'cdc'
""" """
# at first consider everything to be potential signature lines # at first consider everything to be potential signature lines
markers = bytearray('c'*len(candidate)) markers = list('c' * len(candidate))
# mark lines starting from bottom up # mark lines starting from bottom up
for i, line_idx in reversed(list(enumerate(candidate))): for i, line_idx in reversed(list(enumerate(candidate))):
@@ -174,7 +172,7 @@ def _mark_candidate_indexes(lines, candidate):
if line.startswith('-') and line.strip("-"): if line.startswith('-') and line.strip("-"):
markers[i] = 'd' markers[i] = 'd'
return markers return "".join(markers)
def _process_marked_candidate_indexes(candidate, markers): def _process_marked_candidate_indexes(candidate, markers):

View File

@@ -0,0 +1 @@

Binary file not shown.

File diff suppressed because it is too large Load Diff

View File

@@ -1,16 +1,15 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
from __future__ import absolute_import from __future__ import absolute_import
import logging import logging
import regex as re
import numpy import numpy
import regex as re
from talon.signature.learning.featurespace import features, build_pattern
from talon.utils import get_delimiter
from talon.signature.bruteforce import get_signature_candidate from talon.signature.bruteforce import get_signature_candidate
from talon.signature.learning.featurespace import features, build_pattern
from talon.signature.learning.helpers import has_signature from talon.signature.learning.helpers import has_signature
from talon.utils import get_delimiter
log = logging.getLogger(__name__) log = logging.getLogger(__name__)
@@ -33,7 +32,7 @@ RE_REVERSE_SIGNATURE = re.compile(r'''
def is_signature_line(line, sender, classifier): def is_signature_line(line, sender, classifier):
'''Checks if the line belongs to signature. Returns True or False.''' '''Checks if the line belongs to signature. Returns True or False.'''
data = numpy.array(build_pattern(line, features(sender))) data = numpy.array(build_pattern(line, features(sender))).reshape(1, -1)
return classifier.predict(data) > 0 return classifier.predict(data) > 0
@@ -58,7 +57,7 @@ def extract(body, sender):
text = delimiter.join(text) text = delimiter.join(text)
if text.strip(): if text.strip():
return (text, delimiter.join(signature)) return (text, delimiter.join(signature))
except Exception: except Exception as e:
log.exception('ERROR when extracting signature with classifiers') log.exception('ERROR when extracting signature with classifiers')
return (body, None) return (body, None)
@@ -81,7 +80,7 @@ def _mark_lines(lines, sender):
candidate = get_signature_candidate(lines) candidate = get_signature_candidate(lines)
# at first consider everything to be text no signature # at first consider everything to be text no signature
markers = bytearray('t'*len(lines)) markers = list('t' * len(lines))
# mark lines starting from bottom up # mark lines starting from bottom up
# mark only lines that belong to candidate # mark only lines that belong to candidate
@@ -96,7 +95,7 @@ def _mark_lines(lines, sender):
elif is_signature_line(line, sender, EXTRACTOR): elif is_signature_line(line, sender, EXTRACTOR):
markers[j] = 's' markers[j] = 's'
return markers return "".join(markers)
def _process_marked_lines(lines, markers): def _process_marked_lines(lines, markers):
@@ -111,3 +110,4 @@ def _process_marked_lines(lines, markers):
return (lines[:-signature.end()], lines[-signature.end():]) return (lines[:-signature.end()], lines[-signature.end():])
return (lines, None) return (lines, None)

View File

@@ -6,9 +6,10 @@ body belongs to the signature.
""" """
from __future__ import absolute_import from __future__ import absolute_import
from numpy import genfromtxt from numpy import genfromtxt
import joblib
from sklearn.svm import LinearSVC from sklearn.svm import LinearSVC
from sklearn.externals import joblib
def init(): def init():
@@ -29,4 +30,40 @@ def train(classifier, train_data_filename, save_classifier_filename=None):
def load(saved_classifier_filename, train_data_filename): def load(saved_classifier_filename, train_data_filename):
"""Loads saved classifier. """ """Loads saved classifier. """
try:
return joblib.load(saved_classifier_filename) return joblib.load(saved_classifier_filename)
except Exception:
import sys
if sys.version_info > (3, 0):
return load_compat(saved_classifier_filename)
raise
def load_compat(saved_classifier_filename):
import os
import pickle
import tempfile
# we need to switch to the data path to properly load the related _xx.npy files
cwd = os.getcwd()
os.chdir(os.path.dirname(saved_classifier_filename))
# convert encoding using pick.load and write to temp file which we'll tell joblib to use
pickle_file = open(saved_classifier_filename, 'rb')
classifier = pickle.load(pickle_file, encoding='latin1')
try:
# save our conversion if permissions allow
joblib.dump(classifier, saved_classifier_filename)
except Exception:
# can't write to classifier, use a temp file
tmp = tempfile.SpooledTemporaryFile()
joblib.dump(classifier, tmp)
saved_classifier_filename = tmp
# important, use joblib.load before switching back to original cwd
jb_classifier = joblib.load(saved_classifier_filename)
os.chdir(cwd)
return jb_classifier

View File

@@ -17,13 +17,14 @@ suffix which should be `_sender`.
""" """
from __future__ import absolute_import from __future__ import absolute_import
import os import os
import regex as re import regex as re
from six.moves import range
from talon.signature.constants import SIGNATURE_MAX_LINES from talon.signature.constants import SIGNATURE_MAX_LINES
from talon.signature.learning.featurespace import build_pattern, features from talon.signature.learning.featurespace import build_pattern, features
from six.moves import range
SENDER_SUFFIX = '_sender' SENDER_SUFFIX = '_sender'
BODY_SUFFIX = '_body' BODY_SUFFIX = '_body'
@@ -57,9 +58,14 @@ def parse_msg_sender(filename, sender_known=True):
algorithm: algorithm:
>>> parse_msg_sender(filename, False) >>> parse_msg_sender(filename, False)
""" """
import sys
kwargs = {}
if sys.version_info > (3, 0):
kwargs["encoding"] = "utf8"
sender, msg = None, None sender, msg = None, None
if os.path.isfile(filename) and not is_sender_filename(filename): if os.path.isfile(filename) and not is_sender_filename(filename):
with open(filename) as f: with open(filename, **kwargs) as f:
msg = f.read() msg = f.read()
sender = u'' sender = u''
if sender_known: if sender_known:

View File

@@ -1,19 +1,18 @@
# coding:utf-8 # coding:utf-8
from __future__ import absolute_import from __future__ import absolute_import
import logging
from random import shuffle from random import shuffle
import chardet
import cchardet import cchardet
import regex as re import chardet
from lxml.html import html5parser
from lxml.cssselect import CSSSelector
import html5lib import html5lib
import regex as re
import six
from lxml.cssselect import CSSSelector
from lxml.html import html5parser
from talon.constants import RE_DELIMITER from talon.constants import RE_DELIMITER
import six
def safe_format(format_string, *args, **kwargs): def safe_format(format_string, *args, **kwargs):
@@ -132,7 +131,7 @@ def html_tree_to_text(tree):
for el in tree.iter(): for el in tree.iter():
el_text = (el.text or '') + (el.tail or '') el_text = (el.text or '') + (el.tail or '')
if len(el_text) > 1: if len(el_text) > 1:
if el.tag in _BLOCKTAGS: if el.tag in _BLOCKTAGS + _HARDBREAKS:
text += "\n" text += "\n"
if el.tag == 'li': if el.tag == 'li':
text += " * " text += " * "
@@ -143,7 +142,8 @@ def html_tree_to_text(tree):
if href: if href:
text += "(%s) " % href text += "(%s) " % href
if el.tag in _HARDBREAKS and text and not text.endswith("\n"): if (el.tag in _HARDBREAKS and text and
not text.endswith("\n") and not el_text):
text += "\n" text += "\n"
retval = _rm_excessive_newlines(text) retval = _rm_excessive_newlines(text)
@@ -177,6 +177,8 @@ def html_to_text(string):
def html_fromstring(s): def html_fromstring(s):
"""Parse html tree from string. Return None if the string can't be parsed. """Parse html tree from string. Return None if the string can't be parsed.
""" """
if isinstance(s, six.text_type):
s = s.encode('utf8')
try: try:
if html_too_big(s): if html_too_big(s):
return None return None
@@ -189,6 +191,8 @@ def html_fromstring(s):
def html_document_fromstring(s): def html_document_fromstring(s):
"""Parse html tree from string. Return None if the string can't be parsed. """Parse html tree from string. Return None if the string can't be parsed.
""" """
if isinstance(s, six.text_type):
s = s.encode('utf8')
try: try:
if html_too_big(s): if html_too_big(s):
return None return None
@@ -203,7 +207,9 @@ def cssselect(expr, tree):
def html_too_big(s): def html_too_big(s):
return s.count('<') > _MAX_TAGS_COUNT if isinstance(s, six.text_type):
s = s.encode('utf8')
return s.count(b'<') > _MAX_TAGS_COUNT
def _contains_charset_spec(s): def _contains_charset_spec(s):
@@ -248,7 +254,6 @@ def _html5lib_parser():
_UTF8_DECLARATION = (b'<meta http-equiv="Content-Type" content="text/html;' _UTF8_DECLARATION = (b'<meta http-equiv="Content-Type" content="text/html;'
b'charset=utf-8">') b'charset=utf-8">')
_BLOCKTAGS = ['div', 'p', 'ul', 'li', 'h1', 'h2', 'h3'] _BLOCKTAGS = ['div', 'p', 'ul', 'li', 'h1', 'h2', 'h3']
_HARDBREAKS = ['br', 'hr', 'tr'] _HARDBREAKS = ['br', 'hr', 'tr']

3
test-requirements.txt Normal file
View File

@@ -0,0 +1,3 @@
coverage
mock
nose>=1.2.1

View File

@@ -1,13 +1,14 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
from __future__ import absolute_import from __future__ import absolute_import
from . import *
from . fixtures import *
import regex as re # noinspection PyUnresolvedReferences
import re
from talon import quotations, utils as u from talon import quotations, utils as u
from . import *
from .fixtures import *
from lxml import html
RE_WHITESPACE = re.compile("\s") RE_WHITESPACE = re.compile("\s")
RE_DOUBLE_WHITESPACE = re.compile("\s") RE_DOUBLE_WHITESPACE = re.compile("\s")
@@ -303,7 +304,12 @@ Reply
def extract_reply_and_check(filename): def extract_reply_and_check(filename):
f = open(filename) import sys
kwargs = {}
if sys.version_info > (3, 0):
kwargs["encoding"] = "utf8"
f = open(filename, **kwargs)
msg_body = f.read() msg_body = f.read()
reply = quotations.extract_from_html(msg_body) reply = quotations.extract_from_html(msg_body)
@@ -419,3 +425,23 @@ def test_readable_html_empty():
def test_bad_html(): def test_bad_html():
bad_html = "<html></html>" bad_html = "<html></html>"
eq_(bad_html, quotations.extract_from_html(bad_html)) eq_(bad_html, quotations.extract_from_html(bad_html))
def test_remove_namespaces():
msg_body = """
<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns="http://www.w3.org/TR/REC-html40">
<body>
<o:p>Dear Sir,</o:p>
<o:p>Thank you for the email.</o:p>
<blockquote>thing</blockquote>
</body>
</html>
"""
rendered = quotations.extract_from_html(msg_body)
assert_true("<p>" in rendered)
assert_true("xmlns" in rendered)
assert_true("<o:p>" not in rendered)
assert_true("<xmlns:o>" not in rendered)

View File

@@ -1,16 +1,16 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
from __future__ import absolute_import from __future__ import absolute_import
from .. import *
import os import os
from talon.signature.learning import dataset
from talon import signature
from talon.signature import extraction as e
from talon.signature import bruteforce
from six.moves import range from six.moves import range
from talon.signature import bruteforce, extraction, extract
from talon.signature import extraction as e
from talon.signature.learning import dataset
from .. import *
def test_message_shorter_SIGNATURE_MAX_LINES(): def test_message_shorter_SIGNATURE_MAX_LINES():
sender = "bob@foo.bar" sender = "bob@foo.bar"
@@ -18,20 +18,25 @@ def test_message_shorter_SIGNATURE_MAX_LINES():
Thanks in advance, Thanks in advance,
Bob""" Bob"""
text, extracted_signature = signature.extract(body, sender) text, extracted_signature = extract(body, sender)
eq_('\n'.join(body.splitlines()[:2]), text) eq_('\n'.join(body.splitlines()[:2]), text)
eq_('\n'.join(body.splitlines()[-2:]), extracted_signature) eq_('\n'.join(body.splitlines()[-2:]), extracted_signature)
def test_messages_longer_SIGNATURE_MAX_LINES(): def test_messages_longer_SIGNATURE_MAX_LINES():
import sys
kwargs = {}
if sys.version_info > (3, 0):
kwargs["encoding"] = "utf8"
for filename in os.listdir(STRIPPED): for filename in os.listdir(STRIPPED):
filename = os.path.join(STRIPPED, filename) filename = os.path.join(STRIPPED, filename)
if not filename.endswith('_body'): if not filename.endswith('_body'):
continue continue
sender, body = dataset.parse_msg_sender(filename) sender, body = dataset.parse_msg_sender(filename)
text, extracted_signature = signature.extract(body, sender) text, extracted_signature = extract(body, sender)
extracted_signature = extracted_signature or '' extracted_signature = extracted_signature or ''
with open(filename[:-len('body')] + 'signature') as ms: with open(filename[:-len('body')] + 'signature', **kwargs) as ms:
msg_signature = ms.read() msg_signature = ms.read()
eq_(msg_signature.strip(), extracted_signature.strip()) eq_(msg_signature.strip(), extracted_signature.strip())
stripped_msg = body.strip()[:len(body.strip()) - len(msg_signature)] stripped_msg = body.strip()[:len(body.strip()) - len(msg_signature)]
@@ -47,7 +52,7 @@ Thanks in advance,
some text which doesn't seem to be a signature at all some text which doesn't seem to be a signature at all
Bob""" Bob"""
text, extracted_signature = signature.extract(body, sender) text, extracted_signature = extract(body, sender)
eq_('\n'.join(body.splitlines()[:2]), text) eq_('\n'.join(body.splitlines()[:2]), text)
eq_('\n'.join(body.splitlines()[-3:]), extracted_signature) eq_('\n'.join(body.splitlines()[-3:]), extracted_signature)
@@ -60,7 +65,7 @@ Thanks in advance,
some long text here which doesn't seem to be a signature at all some long text here which doesn't seem to be a signature at all
Bob""" Bob"""
text, extracted_signature = signature.extract(body, sender) text, extracted_signature = extract(body, sender)
eq_('\n'.join(body.splitlines()[:-1]), text) eq_('\n'.join(body.splitlines()[:-1]), text)
eq_('Bob', extracted_signature) eq_('Bob', extracted_signature)
@@ -68,13 +73,13 @@ Bob"""
some *long* text here which doesn't seem to be a signature at all some *long* text here which doesn't seem to be a signature at all
""" """
((body, None), signature.extract(body, "david@example.com")) ((body, None), extract(body, "david@example.com"))
def test_basic(): def test_basic():
msg_body = 'Blah\r\n--\r\n\r\nSergey Obukhov' msg_body = 'Blah\r\n--\r\n\r\nSergey Obukhov'
eq_(('Blah', '--\r\n\r\nSergey Obukhov'), eq_(('Blah', '--\r\n\r\nSergey Obukhov'),
signature.extract(msg_body, 'Sergey')) extract(msg_body, 'Sergey'))
def test_capitalized(): def test_capitalized():
@@ -99,7 +104,7 @@ Doe Inc
Doe Inc Doe Inc
555-531-7967""" 555-531-7967"""
eq_(sig, signature.extract(msg_body, 'Doe')[1]) eq_(sig, extract(msg_body, 'Doe')[1])
def test_over_2_text_lines_after_signature(): def test_over_2_text_lines_after_signature():
@@ -110,25 +115,25 @@ def test_over_2_text_lines_after_signature():
2 non signature lines in the end 2 non signature lines in the end
It's not signature It's not signature
""" """
text, extracted_signature = signature.extract(body, "Bob") text, extracted_signature = extract(body, "Bob")
eq_(extracted_signature, None) eq_(extracted_signature, None)
def test_no_signature(): def test_no_signature():
sender, body = "bob@foo.bar", "Hello" sender, body = "bob@foo.bar", "Hello"
eq_((body, None), signature.extract(body, sender)) eq_((body, None), extract(body, sender))
def test_handles_unicode(): def test_handles_unicode():
sender, body = dataset.parse_msg_sender(UNICODE_MSG) sender, body = dataset.parse_msg_sender(UNICODE_MSG)
text, extracted_signature = signature.extract(body, sender) text, extracted_signature = extract(body, sender)
@patch.object(signature.extraction, 'has_signature') @patch.object(extraction, 'has_signature')
def test_signature_extract_crash(has_signature): def test_signature_extract_crash(has_signature):
has_signature.side_effect = Exception('Bam!') has_signature.side_effect = Exception('Bam!')
msg_body = u'Blah\r\n--\r\n\r\nСергей' msg_body = u'Blah\r\n--\r\n\r\nСергей'
eq_((msg_body, None), signature.extract(msg_body, 'Сергей')) eq_((msg_body, None), extract(msg_body, 'Сергей'))
def test_mark_lines(): def test_mark_lines():

View File

@@ -35,6 +35,19 @@ On 11-Apr-2011, at 6:54 PM, Roman Tkachenko <romant@example.com> wrote:
eq_("Test reply", quotations.extract_from_plain(msg_body)) eq_("Test reply", quotations.extract_from_plain(msg_body))
def test_pattern_on_date_polymail():
msg_body = """Test reply
On Tue, Apr 11, 2017 at 10:07 PM John Smith
<
mailto:John Smith <johnsmith@gmail.com>
> wrote:
Test quoted data
"""
eq_("Test reply", quotations.extract_from_plain(msg_body))
def test_pattern_sent_from_samsung_smb_wrote(): def test_pattern_sent_from_samsung_smb_wrote():
msg_body = """Test reply msg_body = """Test reply
@@ -106,6 +119,38 @@ On 11-Apr-2011, at 6:54 PM, Roman Tkachenko <romant@example.com> sent:
eq_("Test reply", quotations.extract_from_plain(msg_body)) eq_("Test reply", quotations.extract_from_plain(msg_body))
def test_appointment():
msg_body = """Response
10/19/2017 @ 9:30 am for physical therapy
Bla
1517 4th Avenue Ste 300
London CA 19129, 555-421-6780
John Doe, FCLS
Mailgun Inc
555-941-0697
From: from@example.com [mailto:from@example.com]
Sent: Wednesday, October 18, 2017 2:05 PM
To: John Doer - SIU <jd@example.com>
Subject: RE: Claim # 5551188-1
Text"""
expected = """Response
10/19/2017 @ 9:30 am for physical therapy
Bla
1517 4th Avenue Ste 300
London CA 19129, 555-421-6780
John Doe, FCLS
Mailgun Inc
555-941-0697"""
eq_(expected, quotations.extract_from_plain(msg_body))
def test_line_starts_with_on(): def test_line_starts_with_on():
msg_body = """Blah-blah-blah msg_body = """Blah-blah-blah
On blah-blah-blah""" On blah-blah-blah"""
@@ -388,6 +433,14 @@ Op 17-feb.-2015, om 13:18 heeft Julius Caesar <pantheon@rome.com> het volgende g
Small batch beard laboris tempor, non listicle hella Tumblr heirloom. Small batch beard laboris tempor, non listicle hella Tumblr heirloom.
""")) """))
def test_vietnamese_from_block():
eq_('Hello', quotations.extract_from_plain(
u"""Hello
Vào 14:24 8 tháng 6, 2017, Hùng Nguyễn <hungnguyen@xxx.com> đã viết:
> Xin chào
"""))
def test_quotation_marker_false_positive(): def test_quotation_marker_false_positive():
msg_body = """Visit us now for assistance... msg_body = """Visit us now for assistance...
@@ -400,6 +453,7 @@ def test_link_closed_with_quotation_marker_on_new_line():
msg_body = '''8.45am-1pm msg_body = '''8.45am-1pm
From: somebody@example.com From: somebody@example.com
Date: Wed, 16 May 2012 00:15:02 -0600
<http://email.example.com/c/dHJhY2tpbmdfY29kZT1mMDdjYzBmNzM1ZjYzMGIxNT <http://email.example.com/c/dHJhY2tpbmdfY29kZT1mMDdjYzBmNzM1ZjYzMGIxNT
> <bob@example.com <mailto:bob@example.com> > > <bob@example.com <mailto:bob@example.com> >
@@ -441,7 +495,9 @@ def test_from_block_starts_with_date():
msg_body = """Blah msg_body = """Blah
Date: Wed, 16 May 2012 00:15:02 -0600 Date: Wed, 16 May 2012 00:15:02 -0600
To: klizhentas@example.com""" To: klizhentas@example.com
"""
eq_('Blah', quotations.extract_from_plain(msg_body)) eq_('Blah', quotations.extract_from_plain(msg_body))
@@ -511,11 +567,12 @@ def test_mark_message_lines():
# next line should be marked as splitter # next line should be marked as splitter
'_____________', '_____________',
'From: foo@bar.com', 'From: foo@bar.com',
'Date: Wed, 16 May 2012 00:15:02 -0600',
'', '',
'> Hi', '> Hi',
'', '',
'Signature'] 'Signature']
eq_('tessemet', quotations.mark_message_lines(lines)) eq_('tesssemet', quotations.mark_message_lines(lines))
lines = ['Just testing the email reply', lines = ['Just testing the email reply',
'', '',
@@ -754,6 +811,31 @@ def test_split_email():
> >
> >
""" """
expected_markers = "stttttsttttetesetesmmmmmmssmmmmmmsmmmmmmmm" expected_markers = "stttttsttttetesetesmmmmmmsmmmmmmmmmmmmmmmm"
markers = quotations.split_emails(msg) markers = quotations.split_emails(msg)
eq_(markers, expected_markers) eq_(markers, expected_markers)
def test_feedback_below_left_unparsed():
msg_body = """Please enter your feedback below. Thank you.
------------------------------------- Enter Feedback Below -------------------------------------
The user experience was unparallelled. Please continue production. I'm sending payment to ensure
that this line is intact."""
parsed = quotations.extract_from_plain(msg_body)
eq_(msg_body, parsed)
def test_appointment_2():
msg_body = """Invitation for an interview:
Date: Wednesday 3, October 2011
Time: 7 : 00am
Address: 130 Fox St
Please bring in your ID."""
parsed = quotations.extract_from_plain(msg_body)
eq_(msg_body, parsed)

View File

@@ -1,12 +1,12 @@
# coding:utf-8 # coding:utf-8
from __future__ import absolute_import from __future__ import absolute_import
from . import *
from talon import utils as u
import cchardet import cchardet
import six import six
from lxml import html
from talon import utils as u
from . import *
def test_get_delimiter(): def test_get_delimiter():
@@ -115,15 +115,16 @@ font: 13px 'Lucida Grande', Arial, sans-serif;
def test_comment_no_parent(): def test_comment_no_parent():
s = "<!-- COMMENT 1 --> no comment" s = b'<!-- COMMENT 1 --> no comment'
d = u.html_document_fromstring(s) d = u.html_document_fromstring(s)
eq_("no comment", u.html_tree_to_text(d)) eq_(b"no comment", u.html_tree_to_text(d))
@patch.object(u.html5parser, 'fromstring', Mock(side_effect=Exception())) @patch.object(u.html5parser, 'fromstring', Mock(side_effect=Exception()))
def test_html_fromstring_exception(): def test_html_fromstring_exception():
eq_(None, u.html_fromstring("<html></html>")) eq_(None, u.html_fromstring("<html></html>"))
@patch.object(u, 'html_too_big', Mock()) @patch.object(u, 'html_too_big', Mock())
@patch.object(u.html5parser, 'fromstring') @patch.object(u.html5parser, 'fromstring')
def test_html_fromstring_too_big(fromstring): def test_html_fromstring_too_big(fromstring):
@@ -158,5 +159,5 @@ def test_html_too_big():
@patch.object(u, '_MAX_TAGS_COUNT', 3) @patch.object(u, '_MAX_TAGS_COUNT', 3)
def test_html_to_text(): def test_html_to_text():
eq_("Hello", u.html_to_text("<div>Hello</div>")) eq_(b"Hello", u.html_to_text("<div>Hello</div>"))
eq_(None, u.html_to_text("<div><span>Hi</span></div>")) eq_(None, u.html_to_text("<div><span>Hi</span></div>"))