initial commit

This commit is contained in:
Sergey Obukhov
2014-07-23 21:12:54 -07:00
commit 170f11038b
80 changed files with 7481 additions and 0 deletions

51
.gitignore vendored Normal file
View File

@@ -0,0 +1,51 @@
*.py[cod]
# C extensions
*.so
# Packages
*.egg
*.egg-info
dist
build
eggs
parts
bin
var
sdist
develop-eggs
.installed.cfg
lib
lib64
# Installer logs
pip-log.txt
# Unit test / coverage reports
.coverage
.tox
nosetests.xml
# Translations
*.mo
# Mr Developer
.mr.developer.cfg
.project
.pydevproject
*~
\#*\#
/.emacs.desktop
/.emacs.desktop.lock
.elc
auto-save-list
tramp
.\#*
# Org-mode
.org-id-locations
*_archive
# Trial temp
_trial_temp

202
LICENSE Normal file
View File

@@ -0,0 +1,202 @@
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "{}"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright {yyyy} {name of copyright owner}
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

97
README.md Normal file
View File

@@ -0,0 +1,97 @@
talon
=====
Mailgun library to extract message quotations and signatures.
If you ever tried to parse message quotations or signatures you know that absense of any formatting standards in this area
could make this task a nightmare. Hopefully this library will make your life much eathier. The name of the project is
inspired by TALON - multipurpose robot designed to perform missions ranging from reconnaissance to combat and operate in
a number of hostile environments. That's what a good quotations and signature parser should be like :smile:
Usage
-----
Here's how you initialize the library and extract a reply from a text message:
```python
import talon
from talon import quotations
talon.init()
text = """Reply
-----Original Message-----
Quote"""
reply = quotations.extract_from(text, 'text/plain')
reply = quotations.extract_from_plain(text)
# reply == "Reply"
```
To extract a reply from html:
```python
html = """Reply
<blockquote>
<div>
On 11-Apr-2011, at 6:54 PM, Bob &lt;bob@example.com&gt; wrote:
</div>
<div>
Quote
</div>
</blockquote>"""
reply = quotations.extract_from(html, 'text/html')
reply = quotations.extract_from_html(html)
# reply == "<html><body><p>Reply</p></body></html>"
```
Often the best way is the easiest one. Here's how you can extract signature from email message without any
machine learning fancy stuff:
```python
from talon.signature.bruteforce import extract_signature
message = """Wow. Awesome!
--
Bob Smith"""
text, signature = extract_signature(message)
# text == "Wow. Awesome!"
# signature == "--\nBob Smith"
```
Quick and works like a charm 90% of the time. For other 10% you can use the power of machine learning algorithms:
```python
from talon import signature
message = """Thanks Sasha, I can't go any higher and is why I limited it to the
homepage.
John Doe
via mobile"""
text, signature = signature.extract(message, sender='john.doe@example.com')
# text == "Thanks Sasha, I can't go any higher and is why I limited it to the\nhomepage."
# signature == "John Doe\nvia mobile"
```
For machine learning talon currently uses [PyML](http://pyml.sourceforge.net/) library to build SVM classifiers. The core of machine learning algorithm lays in ``talon.signature.learning package``. It defines a set of features to apply to a message (``featurespace.py``), how data sets are built (``dataset.py``), classifier's interface (``classifier.py``).
The data used for training is taken from our personal email conversations and from [ENRON](https://www.cs.cmu.edu/~enron/) dataset. As a result of applying our set of features to the dataset we provide files ``classifier`` and ``train.data`` that don't have any personal information but could be used to load trained classifier. Those files should be regenerated every time the feature/data set is changed.
Research
--------
The library is inspired by the following research papers and projects:
* http://www.cs.cmu.edu/~vitor/papers/sigFilePaper_finalversion.pdf
* http://www.cs.cornell.edu/people/tj/publications/joachims_01a.pdf

2
setup.cfg Normal file
View File

@@ -0,0 +1,2 @@
[metadata]
description-file = README.md

105
setup.py Normal file
View File

@@ -0,0 +1,105 @@
import os
import sys
import contextlib
from distutils.spawn import find_executable
from setuptools import setup, find_packages
setup(name='talon',
version='1.0',
description=("Mailgun library "
"to extract message quotations and signatures."),
long_description=open("README.md").read(),
author='Mailgun Inc.',
author_email='admin@mailgunhq.com',
url='https://github.com/mailgun/talon',
license='APACHE2',
packages=find_packages(exclude=['tests']),
include_package_data=True,
zip_safe=True,
install_requires=[
"lxml==2.3.3",
"regex==0.1.20110315",
"chardet==1.0.1",
"dnspython==1.11.1",
"html2text",
"nose==1.2.1",
"mock",
"coverage"
]
)
def install_pyml():
'''
Downloads and installs PyML
'''
try:
import PyML
except:
pass
else:
return
# install numpy first
pip('install numpy==1.6.1 --upgrade')
pyml_tarball = (
'http://09cce49df173f6f6e61f-fd6930021b51685920a6fa76529ee321'
'.r45.cf2.rackcdn.com/PyML-0.7.9.tar.gz')
pyml_srcidr = 'PyML-0.7.9'
# see if PyML tarball needs to be fetched:
if not dir_exists(pyml_srcidr):
run("curl %s | tar -xz" % pyml_tarball)
# compile&install:
with cd(pyml_srcidr):
python('setup.py build')
python('setup.py install')
def run(command):
if os.system(command) != 0:
raise Exception("Failed '{}'".format(command))
else:
return 0
def python(command):
command = '{} {}'.format(sys.executable, command)
run(command)
def enforce_executable(name, install_info):
if os.system("which {}".format(name)) != 0:
raise Exception(
'{} utility is missing.\nTo install, run:\n\n{}\n'.format(
name, install_info))
def pip(command):
command = '{} {}'.format(find_executable('pip'), command)
run(command)
def dir_exists(path):
return os.path.isdir(path)
@contextlib.contextmanager
def cd(directory):
curdir = os.getcwd()
try:
os.chdir(directory)
yield {}
finally:
os.chdir(curdir)
if __name__ == '__main__':
if len(sys.argv) > 1 and sys.argv[1] in ['develop', 'install']:
enforce_executable('curl', 'sudo aptitude install curl')
install_pyml()

7
talon/__init__.py Normal file
View File

@@ -0,0 +1,7 @@
from talon.quotations import register_xpath_extensions
from talon import signature
def init():
register_xpath_extensions()
signature.initialize()

4
talon/constants.py Normal file
View File

@@ -0,0 +1,4 @@
import regex as re
RE_DELIMITER = re.compile('\r?\n')

174
talon/html_quotations.py Normal file
View File

@@ -0,0 +1,174 @@
"""
The module's functions operate on message bodies trying to extract original
messages (without quoted messages) from html
"""
import regex as re
CHECKPOINT_PREFIX = '#!%!'
CHECKPOINT_SUFFIX = '!%!#'
CHECKPOINT_PATTERN = re.compile(CHECKPOINT_PREFIX + '\d+' + CHECKPOINT_SUFFIX)
# HTML quote indicators (tag ids)
QUOTE_IDS = ['OLK_SRC_BODY_SECTION']
def add_checkpoint(html_note, counter):
"""Recursively adds checkpoints to html tree.
"""
if html_note.text:
html_note.text = (html_note.text + CHECKPOINT_PREFIX +
str(counter) + CHECKPOINT_SUFFIX)
else:
html_note.text = (CHECKPOINT_PREFIX + str(counter) +
CHECKPOINT_SUFFIX)
counter += 1
for child in html_note.iterchildren():
counter = add_checkpoint(child, counter)
if html_note.tail:
html_note.tail = (html_note.tail + CHECKPOINT_PREFIX +
str(counter) + CHECKPOINT_SUFFIX)
else:
html_note.tail = (CHECKPOINT_PREFIX + str(counter) +
CHECKPOINT_SUFFIX)
counter += 1
return counter
def delete_quotation_tags(html_note, counter, quotation_checkpoints):
"""Deletes tags with quotation checkpoints from html tree.
"""
tag_in_quotation = True
if quotation_checkpoints[counter]:
html_note.text = ''
else:
tag_in_quotation = False
counter += 1
quotation_children = [] # Children tags which are in quotation.
for child in html_note.iterchildren():
counter, child_tag_in_quotation = delete_quotation_tags(
child, counter,
quotation_checkpoints
)
if child_tag_in_quotation:
quotation_children.append(child)
if quotation_checkpoints[counter]:
html_note.tail = ''
else:
tag_in_quotation = False
counter += 1
if tag_in_quotation:
return counter, tag_in_quotation
else:
# Remove quotation children.
for child in quotation_children:
html_note.remove(child)
return counter, tag_in_quotation
def cut_gmail_quote(html_message):
''' Cuts the outermost block element with class gmail_quote. '''
gmail_quote = html_message.cssselect('.gmail_quote')
if gmail_quote:
gmail_quote[0].getparent().remove(gmail_quote[0])
return True
def cut_microsoft_quote(html_message):
''' Cuts splitter block and all following blocks. '''
splitter = html_message.xpath(
#outlook 2007, 2010
"//div[@style='border:none;border-top:solid #B5C4DF 1.0pt;"
"padding:3.0pt 0cm 0cm 0cm']|"
#windows mail
"//div[@style='padding-top: 5px; "
"border-top-color: rgb(229, 229, 229); "
"border-top-width: 1px; border-top-style: solid;']"
)
if splitter:
splitter = splitter[0]
#outlook 2010
if splitter == splitter.getparent().getchildren()[0]:
splitter = splitter.getparent()
else:
#outlook 2003
splitter = html_message.xpath(
"//div"
"/div[@class='MsoNormal' and @align='center' "
"and @style='text-align:center']"
"/font"
"/span"
"/hr[@size='3' and @width='100%' and @align='center' "
"and @tabindex='-1']"
)
if len(splitter):
splitter = splitter[0]
splitter = splitter.getparent().getparent()
splitter = splitter.getparent().getparent()
if len(splitter):
parent = splitter.getparent()
after_splitter = splitter.getnext()
while after_splitter is not None:
parent.remove(after_splitter)
after_splitter = splitter.getnext()
parent.remove(splitter)
return True
return False
def cut_by_id(html_message):
found = False
for quote_id in QUOTE_IDS:
quote = html_message.cssselect('#{}'.format(quote_id))
if quote:
found = True
quote[0].getparent().remove(quote[0])
return found
def cut_blockquote(html_message):
''' Cuts blockquote with wrapping elements. '''
quote = html_message.find('.//blockquote')
if quote is not None:
quote.getparent().remove(quote)
return True
def cut_from_block(html_message):
"""Cuts div tag which wraps block starting with "From:"."""
# handle the case when From: block is enclosed in some tag
block = html_message.xpath(
("//*[starts-with(mg:text_content(), 'From:')]|"
"//*[starts-with(mg:text_content(), 'Date:')]"))
if block:
block = block[-1]
while block.getparent() is not None:
if block.tag == 'div':
block.getparent().remove(block)
return True
else:
block = block.getparent()
else:
# handle the case when From: block goes right after e.g. <hr>
# and not enclosed in some tag
block = html_message.xpath(
("//*[starts-with(mg:tail(), 'From:')]|"
"//*[starts-with(mg:tail(), 'Date:')]"))
if block:
block = block[0]
while(block.getnext() is not None):
block.getparent().remove(block.getnext())
block.getparent().remove(block)
return True

376
talon/quotations.py Normal file
View File

@@ -0,0 +1,376 @@
# -*- coding: utf-8 -*-
"""
The module's functions operate on message bodies trying to extract
original messages (without quoted messages)
"""
import regex as re
import logging
from copy import deepcopy
from lxml import html, etree
import html2text
from talon.constants import RE_DELIMITER
from talon.utils import random_token, get_delimiter
from talon import html_quotations
log = logging.getLogger(__name__)
RE_FWD = re.compile("^[-]+[ ]*Forwarded message[ ]*[-]+$", re.I | re.M)
RE_ON_DATE_SMB_WROTE = re.compile(
r'''
(
-* # could include dashes
[ ]?On[ ].*, # date part ends with comma
(.*\n){0,2} # splitter takes 4 lines at most
.*(wrote|sent):
)
''', re.VERBOSE)
RE_QUOTATION = re.compile(
r'''
(
# quotation border: splitter line or a number of quotation marker lines
(?:
s
|
(?:me*){2,}
)
# quotation lines could be marked as splitter or text, etc.
.*
# but we expect it to end with a quotation marker line
me*
)
# after quotations should be text only or nothing at all
[te]*$
''', re.VERBOSE)
RE_EMPTY_QUOTATION = re.compile(
r'''
(
# quotation border: splitter line or a number of quotation marker lines
(?:
s
|
(?:me*){2,}
)
)
e*
''', re.VERBOSE)
SPLITTER_PATTERNS = [
# ------Original Message------ or ---- Reply Message ----
re.compile("[\s]*[-]+[ ]*(Original|Reply) Message[ ]*[-]+", re.I),
# <date> <person>
re.compile("(\d+/\d+/\d+|\d+\.\d+\.\d+).*@", re.VERBOSE),
RE_ON_DATE_SMB_WROTE,
re.compile('(_+\r?\n)?[\s]*(:?[*]?From|Date):[*]? .*'),
re.compile('\S{3,10}, \d\d? \S{3,10} 20\d\d,? \d\d?:\d\d(:\d\d)?'
'( \S+){3,6}@\S+:')
]
RE_LINK = re.compile('<(http://[^>]*)>')
RE_NORMALIZED_LINK = re.compile('@@(http://[^>@]*)@@')
RE_PARANTHESIS_LINK = re.compile("\(https?://")
SPLITTER_MAX_LINES = 4
MAX_LINES_COUNT = 1000
QUOT_PATTERN = re.compile('^>+ ?')
NO_QUOT_LINE = re.compile('^[^>].*[\S].*')
def extract_from(msg_body, content_type='text/plain'):
try:
if content_type == 'text/plain':
return extract_from_plain(msg_body)
elif content_type == 'text/html':
return extract_from_html(msg_body)
except Exception, e:
log.exception('ERROR extracting message')
return msg_body
def mark_message_lines(lines):
"""Mark message lines with markers to distinguish quotation lines.
Markers:
* e - empty line
* m - line that starts with quotation marker '>'
* s - splitter line
* t - presumably lines from the last message in the conversation
>>> mark_message_lines(['answer', 'From: foo@bar.com', '', '> question'])
'tsem'
"""
markers = bytearray(len(lines))
i = 0
while i < len(lines):
if not lines[i].strip():
markers[i] = 'e' # empty line
elif QUOT_PATTERN.match(lines[i]):
markers[i] = 'm' # line with quotation marker
elif RE_FWD.match(lines[i]):
markers[i] = 'f' # ---- Forwarded message ----
else:
# in case splitter is spread across several lines
splitter = is_splitter('\n'.join(lines[i:i + SPLITTER_MAX_LINES]))
if splitter:
# append as many splitter markers as lines in splitter
splitter_lines = splitter.group().splitlines()
for j in xrange(len(splitter_lines)):
markers[i + j] = 's'
# skip splitter lines
i += len(splitter_lines) - 1
else:
# probably the line from the last message in the conversation
markers[i] = 't'
i += 1
return markers
def process_marked_lines(lines, markers, return_flags=[False, -1, -1]):
"""Run regexes against message's marked lines to strip quotations.
Return only last message lines.
>>> mark_message_lines(['Hello', 'From: foo@bar.com', '', '> Hi', 'tsem'])
['Hello']
Also returns return_flags.
return_flags = [were_lines_deleted, first_deleted_line,
last_deleted_line]
"""
# if there are no splitter there should be no markers
if 's' not in markers and not re.search('(me*){3}', markers):
markers = markers.replace('m', 't')
if re.match('[te]*f', markers):
return_flags[:] = [False, -1, -1]
return lines
# inlined reply
# use lookbehind assertions to find overlapping entries e.g. for 'mtmtm'
# both 't' entries should be found
for inline_reply in re.finditer('(?<=m)e*((?:t+e*)+)m', markers):
# long links could break sequence of quotation lines but they shouldn't
# be considered an inline reply
links = (
RE_PARANTHESIS_LINK.search(lines[inline_reply.start() - 1]) or
RE_PARANTHESIS_LINK.match(lines[inline_reply.start()].strip()))
if not links:
return_flags[:] = [False, -1, -1]
return lines
# cut out text lines coming after splitter if there are no markers there
quotation = re.search('(se*)+((t|f)+e*)+', markers)
if quotation:
return_flags[:] = [True, quotation.start(), len(lines)]
return lines[:quotation.start()]
# handle the case with markers
quotation = (RE_QUOTATION.search(markers) or
RE_EMPTY_QUOTATION.search(markers))
if quotation:
return_flags[:] = True, quotation.start(1), quotation.end(1)
return lines[:quotation.start(1)] + lines[quotation.end(1):]
return_flags[:] = [False, -1, -1]
return lines
def preprocess(msg_body, delimiter, content_type='text/plain'):
"""Prepares msg_body for being stripped.
Replaces link brackets so that they couldn't be taken for quotation marker.
Splits line in two if splitter pattern preceeded by some text on the same
line (done only for 'On <date> <person> wrote:' pattern).
"""
# normalize links i.e. replace '<', '>' wrapping the link with some symbols
# so that '>' closing the link couldn't be mistakenly taken for quotation
# marker.
def link_wrapper(link):
newline_index = msg_body[:link.start()].rfind("\n")
if msg_body[newline_index + 1] == ">":
return link.group()
else:
return "@@%s@@" % link.group(1)
msg_body = re.sub(RE_LINK, link_wrapper, msg_body)
def splitter_wrapper(splitter):
"""Wrapps splitter with new line"""
if splitter.start() and msg_body[splitter.start() - 1] != '\n':
return '%s%s' % (delimiter, splitter.group())
else:
return splitter.group()
if content_type == 'text/plain':
msg_body = re.sub(RE_ON_DATE_SMB_WROTE, splitter_wrapper, msg_body)
return msg_body
def postprocess(msg_body):
"""Make up for changes done at preprocessing message.
Replace link brackets back to '<' and '>'.
"""
return re.sub(RE_NORMALIZED_LINK, r'<\1>', msg_body).strip()
def extract_from_plain(msg_body):
"""Extracts a non quoted message from provided plain text."""
stripped_text = msg_body
delimiter = get_delimiter(msg_body)
msg_body = preprocess(msg_body, delimiter)
lines = msg_body.splitlines()
# don't process too long messages
if len(lines) > MAX_LINES_COUNT:
return stripped_text
markers = mark_message_lines(lines)
lines = process_marked_lines(lines, markers)
# concatenate lines, change links back, strip and return
msg_body = delimiter.join(lines)
msg_body = postprocess(msg_body)
return msg_body
def extract_from_html(msg_body):
"""
Extract not quoted message from provided html message body
using tags and plain text algorithm.
Cut out the 'blockquote', 'gmail_quote' tags.
Cut Microsoft quotations.
Then use plain text algorithm to cut out splitter or
leftover quotation.
This works by adding checkpoint text to all html tags,
then converting html to text,
then extracting quotations from text,
then checking deleted checkpoints,
then deleting neccessary tags.
"""
if msg_body.strip() == '':
return msg_body
html_tree = html.document_fromstring(
msg_body,
parser=html.HTMLParser(encoding="utf-8")
)
cut_quotations = (html_quotations.cut_gmail_quote(html_tree) or
html_quotations.cut_blockquote(html_tree) or
html_quotations.cut_microsoft_quote(html_tree) or
html_quotations.cut_by_id(html_tree) or
html_quotations.cut_from_block(html_tree)
)
html_tree_copy = deepcopy(html_tree)
number_of_checkpoints = html_quotations.add_checkpoint(html_tree, 0)
quotation_checkpoints = [False for i in xrange(number_of_checkpoints)]
msg_with_checkpoints = html.tostring(html_tree)
h = html2text.HTML2Text()
h.body_width = 0 # generate plain text without wrap
# html2text adds unnecessary star symbols. Remove them.
# Mask star symbols
msg_with_checkpoints = msg_with_checkpoints.replace('*', '3423oorkg432')
plain_text = h.handle(msg_with_checkpoints)
# Remove created star symbols
plain_text = plain_text.replace('*', '')
# Unmask saved star symbols
plain_text = plain_text.replace('3423oorkg432', '*')
delimiter = get_delimiter(plain_text)
plain_text = preprocess(plain_text, delimiter, content_type='text/html')
lines = plain_text.splitlines()
# Don't process too long messages
if len(lines) > MAX_LINES_COUNT:
return msg_body
# Collect checkpoints on each line
line_checkpoints = [
[int(i[4:-4]) # Only checkpoint number
for i in re.findall(html_quotations.CHECKPOINT_PATTERN, line)]
for line in lines]
# Remove checkpoints
lines = [re.sub(html_quotations.CHECKPOINT_PATTERN, '', line)
for line in lines]
# Use plain text quotation extracting algorithm
markers = mark_message_lines(lines)
return_flags = []
process_marked_lines(lines, markers, return_flags)
lines_were_deleted, first_deleted, last_deleted = return_flags
if lines_were_deleted:
#collect checkpoints from deleted lines
for i in xrange(first_deleted, last_deleted):
for checkpoint in line_checkpoints[i]:
quotation_checkpoints[checkpoint] = True
else:
if cut_quotations:
return html.tostring(html_tree_copy)
else:
return msg_body
# Remove tags with quotation checkpoints
html_quotations.delete_quotation_tags(
html_tree_copy, 0, quotation_checkpoints
)
return html.tostring(html_tree_copy)
def is_splitter(line):
'''
Returns Matcher object if provided string is a splitter and
None otherwise.
'''
for pattern in SPLITTER_PATTERNS:
matcher = re.match(pattern, line)
if matcher:
return matcher
def text_content(context):
'''XPath Extension function to return a node text content.'''
return context.context_node.text_content().strip()
def tail(context):
'''XPath Extension function to return a node tail text.'''
return context.context_node.tail or ''
def register_xpath_extensions():
ns = etree.FunctionNamespace("http://mailgun.net")
ns.prefix = 'mg'
ns['text_content'] = text_content
ns['tail'] = tail

View File

@@ -0,0 +1,48 @@
"""The package exploits machine learning for parsing message signatures.
The public interface consists of only one `extract` function:
>>> (body, signature) = extract(body, sender)
Where body is the original message `body` and `sender` corresponds to a person
who sent the message.
When importing the package classifiers instances are loaded.
So each process will have it's classifiers in memory.
The import of the package and the call to the `extract` function are better be
enclosed in a try-catch block in case they fail.
.. warning:: When making changes to features or emails the classifier is
trained against, don't forget to regenerate:
* signature/data/train.data and
* signature/data/classifier
"""
import os
import sys
from cStringIO import StringIO
from . import extraction
from . extraction import extract
from . learning import classifier
DATA_DIR = os.path.join(os.path.dirname(__file__), 'data')
EXTRACTOR_FILENAME = os.path.join(DATA_DIR, 'classifier')
EXTRACTOR_DATA = os.path.join(DATA_DIR, 'train.data')
def initialize():
try:
# redirect output
so, sys.stdout = sys.stdout, StringIO()
extraction.EXTRACTOR = classifier.load(EXTRACTOR_FILENAME,
EXTRACTOR_DATA)
sys.stdout = so
except Exception, e:
raise Exception(
"Failed initializing signature parsing with classifiers", e)

View File

@@ -0,0 +1,188 @@
import logging
import regex as re
from talon.utils import get_delimiter
from talon.signature.constants import (SIGNATURE_MAX_LINES,
TOO_LONG_SIGNATURE_LINE)
log = logging.getLogger(__name__)
# regex to fetch signature based on common signature words
RE_SIGNATURE = re.compile(r'''
(
(?:
^[\s]*--*[\s]*[a-z \.]*$
|
^thanks[\s,!]*$
|
^regards[\s,!]*$
|
^cheers[\s,!]*$
|
^best[ a-z]*[\s,!]*$
)
.*
)
''', re.I | re.X | re.M | re.S)
# signatures appended by phone email clients
RE_PHONE_SIGNATURE = re.compile(r'''
(
(?:
^sent[ ]{1}from[ ]{1}my[\s,!\w]*$
|
^sent[ ]from[ ]Mailbox[ ]for[ ]iPhone.*$
|
^sent[ ]([\S]*[ ])?from[ ]my[ ]BlackBerry.*$
|
^Enviado[ ]desde[ ]mi[ ]([\S]+[ ]){0,2}BlackBerry.*$
)
.*
)
''', re.I | re.X | re.M | re.S)
# see _mark_candidate_indexes() for details
# c - could be signature line
# d - line starts with dashes (could be signature or list item)
# l - long line
RE_SIGNATURE_CANDIDAATE = re.compile(r'''
(?P<candidate>c+d)[^d]
|
(?P<candidate>c+d)$
|
(?P<candidate>c+)
|
(?P<candidate>d)[^d]
|
(?P<candidate>d)$
''', re.I | re.X | re.M | re.S)
def extract_signature(msg_body):
'''
Analyzes message for a presence of signature block (by common patterns)
and returns tuple with two elements: message text without signature block
and the signature itself.
>>> extract_signature('Hey man! How r u?\n\n--\nRegards,\nRoman')
('Hey man! How r u?', '--\nRegards,\nRoman')
>>> extract_signature('Hey man!')
('Hey man!', None)
'''
try:
# identify line delimiter first
delimiter = get_delimiter(msg_body)
# make an assumption
stripped_body = msg_body.strip()
phone_signature = None
# strip off phone signature
phone_signature = RE_PHONE_SIGNATURE.search(msg_body)
if phone_signature:
stripped_body = stripped_body[:phone_signature.start()]
phone_signature = phone_signature.group()
# decide on signature candidate
lines = stripped_body.splitlines()
candidate = get_signature_candidate(lines)
candidate = delimiter.join(candidate)
# try to extract signature
signature = RE_SIGNATURE.search(candidate)
if not signature:
return (stripped_body.strip(), phone_signature)
else:
signature = signature.group()
# when we splitlines() and then join them
# we can lose a new line at the end
# we did it when identifying a candidate
# so we had to do it for stripped_body now
stripped_body = delimiter.join(lines)
stripped_body = stripped_body[:-len(signature)]
if phone_signature:
signature = delimiter.join([signature, phone_signature])
return (stripped_body.strip(),
signature.strip())
except Exception, e:
log.exception('ERROR extracting signature')
return (msg_body, None)
def get_signature_candidate(lines):
"""Return lines that could hold signature
The lines should:
* be among last SIGNATURE_MAX_LINES non-empty lines.
* not include first line
* be shorter than TOO_LONG_SIGNATURE_LINE
* not include more than one line that starts with dashes
"""
# non empty lines indexes
non_empty = [i for i, line in enumerate(lines) if line.strip()]
# if message is empty or just one line then there is no signature
if len(non_empty) <= 1:
return []
# we don't expect signature to start at the 1st line
candidate = non_empty[1:]
# signature shouldn't be longer then SIGNATURE_MAX_LINES
candidate = candidate[-SIGNATURE_MAX_LINES:]
markers = _mark_candidate_indexes(lines, candidate)
candidate = _process_marked_candidate_indexes(candidate, markers)
# get actual lines for the candidate instead of indexes
if candidate:
candidate = lines[candidate[0]:]
return candidate
return []
def _mark_candidate_indexes(lines, candidate):
"""Mark candidate indexes with markers
Markers:
* c - line that could be a signature line
* l - long line
* d - line that starts with dashes but has other chars as well
>>> _mark_candidate_lines(['Some text', '', '-', 'Bob'], [0, 2, 3])
'cdc'
"""
# at first consider everything to be potential signature lines
markers = bytearray('c'*len(candidate))
# mark lines starting from bottom up
for i, line_idx in reversed(list(enumerate(candidate))):
if len(lines[line_idx].strip()) > TOO_LONG_SIGNATURE_LINE:
markers[i] = 'l'
else:
line = lines[line_idx].strip()
if line.startswith('-') and line.strip("-"):
markers[i] = 'd'
return markers
def _process_marked_candidate_indexes(candidate, markers):
"""
Run regexes against candidate's marked indexes to strip
signature candidate.
>>> _process_marked_candidate_indexes([9, 12, 14, 15, 17], 'clddc')
[15, 17]
"""
match = RE_SIGNATURE_CANDIDAATE.match(markers[::-1])
return candidate[-match.end('candidate'):] if match else []

View File

@@ -0,0 +1,2 @@
SIGNATURE_MAX_LINES = 11
TOO_LONG_SIGNATURE_LINE = 60

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,116 @@
# -*- coding: utf-8 -*-
import os
import logging
import regex as re
from PyML import SparseDataSet
from talon.constants import RE_DELIMITER
from talon.signature.constants import (SIGNATURE_MAX_LINES,
TOO_LONG_SIGNATURE_LINE)
from talon.signature.learning.featurespace import features, build_pattern
from talon.utils import get_delimiter
from talon.signature.bruteforce import get_signature_candidate
from talon.signature.learning.helpers import has_signature
log = logging.getLogger(__name__)
EXTRACTOR = None
# regex signature pattern for reversed lines
# assumes that all long lines have been excluded
RE_REVERSE_SIGNATURE = re.compile(r'''
# signature should consists of blocks like this
(?:
# it could end with empty line
e*
# there could be text lines but no more than 2 in a row
(te*){,2}
# every block should end with signature line
s
)+
''', re.I | re.X | re.M | re.S)
def is_signature_line(line, sender, classifier):
'''Checks if the line belongs to signature. Returns True or False.'''
data = SparseDataSet([build_pattern(line, features(sender))])
return classifier.decisionFunc(data, 0) > 0
def extract(body, sender):
"""Strips signature from the body of the message.
Returns stripped body and signature as a tuple.
If no signature is found the corresponding returned value is None.
"""
try:
delimiter = get_delimiter(body)
body = body.strip()
if has_signature(body, sender):
lines = body.splitlines()
markers = _mark_lines(lines, sender)
text, signature = _process_marked_lines(lines, markers)
if signature:
text = delimiter.join(text)
if text.strip():
return (text, delimiter.join(signature))
except Exception, e:
log.exception('ERROR when extracting signature with classifiers')
return (body, None)
def _mark_lines(lines, sender):
"""Mark message lines with markers to distinguish signature lines.
Markers:
* e - empty line
* s - line identified as signature
* t - other i.e. ordinary text line
>>> mark_message_lines(['Some text', '', 'Bob'], 'Bob')
'tes'
"""
global EXTRACTOR
candidate = get_signature_candidate(lines)
# at first consider everything to be text no signature
markers = bytearray('t'*len(lines))
# mark lines starting from bottom up
# mark only lines that belong to candidate
# no need to mark all lines of the message
for i, line in reversed(list(enumerate(candidate))):
# markers correspond to lines not candidate
# so we need to recalculate our index to be
# relative to lines not candidate
j = len(lines) - len(candidate) + i
if not line.strip():
markers[j] = 'e'
elif is_signature_line(line, sender, EXTRACTOR):
markers[j] = 's'
return markers
def _process_marked_lines(lines, markers):
"""Run regexes against message's marked lines to strip signature.
>>> _process_marked_lines(['Some text', '', 'Bob'], 'tes')
(['Some text', ''], ['Bob'])
"""
# reverse lines and match signature pattern for reversed lines
signature = RE_REVERSE_SIGNATURE.match(markers[::-1])
if signature:
return (lines[:-signature.end()], lines[-signature.end():])
return (lines, None)

View File

View File

@@ -0,0 +1,36 @@
# -*- coding: utf-8 -*-
"""The module's functions could init, train, save and load a classifier.
The classifier could be used to detect if a certain line of the message
body belongs to the signature.
"""
import os
import sys
from PyML import SparseDataSet, SVM
def init():
'''Inits classifier with optimal options.'''
return SVM(C=10, optimization='liblinear')
def train(classifier, train_data_filename, save_classifier_filename=None):
'''Trains and saves classifier so that it could be easily loaded later.'''
data = SparseDataSet(train_data_filename, labelsColumn=-1)
classifier.train(data)
if save_classifier_filename:
classifier.save(save_classifier_filename)
return classifier
def load(saved_classifier_filename, train_data_filename):
"""Loads saved classifier.
Classifier should be loaded with the same data it was trained against
"""
train_data = SparseDataSet(train_data_filename, labelsColumn=-1)
classifier = init()
classifier.load(saved_classifier_filename, train_data)
return classifier

View File

@@ -0,0 +1,161 @@
# -*- coding: utf-8 -*-
"""The module's functions build datasets to train/assess classifiers.
For signature detection the input should be a folder with two directories
that contain emails with and without signatures.
For signature extraction the input should be a folder with annotated emails.
To indicate that a line is a signature line use #sig# at the start of the line.
A sender of an email could be specified in the same file as
the message body e.g. when .eml format is used or in a separate file.
In the letter case it is assumed that a body filename ends with the `_body`
suffix and the corresponding sender file has the same name except for the
suffix which should be `_sender`.
"""
import os
import regex as re
from talon.signature.constants import SIGNATURE_MAX_LINES
from talon.signature.learning.featurespace import build_pattern, features
SENDER_SUFFIX = '_sender'
BODY_SUFFIX = '_body'
SIGNATURE_ANNOTATION = '#sig#'
REPLY_ANNOTATION = '#reply#'
ANNOTATIONS = [SIGNATURE_ANNOTATION, REPLY_ANNOTATION]
def is_sender_filename(filename):
"""Checks if the file could contain message sender's name."""
return filename.endswith(SENDER_SUFFIX)
def build_sender_filename(msg_filename):
"""By the message filename gives expected sender's filename."""
return msg_filename[:-len(BODY_SUFFIX)] + SENDER_SUFFIX
def parse_msg_sender(filename, sender_known=True):
"""Given a filename returns the sender and the message.
Here the message is assumed to be a whole MIME message or just
message body.
>>> sender, msg = parse_msg_sender('msg.eml')
>>> sender, msg = parse_msg_sender('msg_body')
If you don't want to consider the sender's name in your classification
algorithm:
>>> parse_msg_sender(filename, False)
"""
sender, msg = None, None
if os.path.isfile(filename) and not is_sender_filename(filename):
with open(filename) as f:
msg = f.read()
sender = u''
if sender_known:
sender_filename = build_sender_filename(filename)
if os.path.exists(sender_filename):
with open(sender_filename) as sender_file:
sender = sender_file.read().strip()
else:
# if sender isn't found then the next line fails
# and it is ok
lines = msg.splitlines()
for line in lines:
match = re.match('From:(.*)', line)
if match:
sender = match.group(1)
break
return (sender, msg)
def build_detection_class(folder, dataset_filename,
label, sender_known=True):
"""Builds signature detection class.
Signature detection dataset includes patterns for two classes:
* class for positive patterns (goes with label 1)
* class for negative patterns (goes with label -1)
The patterns are build of emails from `folder` and appended to
dataset file.
>>> build_signature_detection_class('emails/P', 'train.data', 1)
"""
with open(dataset_filename, 'a') as dataset:
for filename in os.listdir(folder):
filename = os.path.join(folder, filename)
sender, msg = parse_msg_sender(filename, sender_known)
if sender is None or msg is None:
continue
msg = re.sub('|'.join(ANNOTATIONS), '', msg)
X = build_pattern(msg, features(sender))
X.append(label)
labeled_pattern = ','.join([str(e) for e in X])
dataset.write(labeled_pattern + '\n')
def build_detection_dataset(folder, dataset_filename,
sender_known=True):
"""Builds signature detection dataset using emails from folder.
folder should have the following structure:
x-- folder
| x-- P
| | | -- positive sample email 1
| | | -- positive sample email 2
| | | -- ...
| x-- N
| | | -- negative sample email 1
| | | -- negative sample email 2
| | | -- ...
If the dataset file already exist it is rewritten.
"""
if os.path.exists(dataset_filename):
os.remove(dataset_filename)
build_detection_class(os.path.join(folder, u'P'),
dataset_filename, 1)
build_detection_class(os.path.join(folder, u'N'),
dataset_filename, -1)
def build_extraction_dataset(folder, dataset_filename,
sender_known=True):
"""Builds signature extraction dataset using emails in the `folder`.
The emails in the `folder` should be annotated i.e. signature lines
should be marked with `#sig#`.
"""
if os.path.exists(dataset_filename):
os.remove(dataset_filename)
with open(dataset_filename, 'a') as dataset:
for filename in os.listdir(folder):
filename = os.path.join(folder, filename)
sender, msg = parse_msg_sender(filename, sender_known)
if not sender or not msg:
continue
lines = msg.splitlines()
for i in xrange(1, min(SIGNATURE_MAX_LINES,
len(lines)) + 1):
line = lines[-i]
label = -1
if line[:len(SIGNATURE_ANNOTATION)] == \
SIGNATURE_ANNOTATION:
label = 1
line = line[len(SIGNATURE_ANNOTATION):]
elif line[:len(REPLY_ANNOTATION)] == REPLY_ANNOTATION:
line = line[len(REPLY_ANNOTATION):]
X = build_pattern(line, features(sender))
X.append(label)
labeled_pattern = ','.join([str(e) for e in X])
dataset.write(labeled_pattern + '\n')

View File

@@ -0,0 +1,73 @@
# -*- coding: utf-8 -*-
""" The module provides functions for convertion of a message body/body lines
into classifiers features space.
The body and the message sender string are converted into unicode before
applying features to them.
"""
from talon.signature.constants import SIGNATURE_MAX_LINES
from talon.signature.learning.helpers import *
def features(sender=''):
'''Returns a list of signature features.'''
return [
# This one isn't from paper.
# Meant to match companies names, sender's names, address.
many_capitalized_words,
# This one is not from paper.
# Line is too long.
# This one is less aggressive than `Line is too short`
lambda line: 1 if len(line) > 60 else 0,
# Line contains email pattern.
binary_regex_search(RE_EMAIL),
# Line contains url.
binary_regex_search(RE_URL),
# Line contains phone number pattern.
binary_regex_search(RE_RELAX_PHONE),
# Line matches the regular expression "^[\s]*---*[\s]*$".
binary_regex_match(RE_SEPARATOR),
# Line has a sequence of 10 or more special characters.
binary_regex_search(RE_SPECIAL_CHARS),
# Line contains any typical signature words.
binary_regex_search(RE_SIGNATURE_WORDS),
# Line contains a pattern like Vitor R. Carvalho or William W. Cohen.
binary_regex_search(RE_NAME),
# Percentage of punctuation symbols in the line is larger than 50%
lambda line: 1 if punctuation_percent(line) > 50 else 0,
# Percentage of punctuation symbols in the line is larger than 90%
lambda line: 1 if punctuation_percent(line) > 90 else 0,
contains_sender_names(sender)
]
def apply_features(body, features):
'''Applies features to message body lines.
Returns list of lists. Each of the lists corresponds to the body line
and is constituted by the numbers of features occurances (0 or 1).
E.g. if element j of list i equals 1 this means that
feature j occured in line i (counting from the last line of the body).
'''
# collect all non empty lines
lines = [line for line in body.splitlines() if line.strip()]
# take the last SIGNATURE_MAX_LINES
last_lines = lines[-SIGNATURE_MAX_LINES:]
# apply features, fallback to zeros
return ([[f(line) for f in features] for line in last_lines] or
[[0 for f in features]])
def build_pattern(body, features):
'''Converts body into a pattern i.e. a point in the features space.
Applies features to the body lines and sums up the results.
Elements of the pattern indicate how many times a certain feature occured
in the last lines of the body.
'''
line_patterns = apply_features(body, features)
return reduce(lambda x, y: [i + j for i, j in zip(x, y)], line_patterns)

View File

@@ -0,0 +1,233 @@
# -*- coding: utf-8 -*-
""" The module provides:
* functions used when evaluating signature's features
* regexp's constants used when evaluating signature's features
"""
import unicodedata
import regex as re
from talon.utils import to_unicode
from talon.signature.constants import SIGNATURE_MAX_LINES
rc = re.compile
RE_EMAIL = rc('@')
RE_RELAX_PHONE = rc('.*(\(? ?[\d]{2,3} ?\)?.{,3}){2,}')
RE_URL = rc(r'''https?://|www\.[\S]+\.[\S]''')
# Taken from:
# http://www.cs.cmu.edu/~vitor/papers/sigFilePaper_finalversion.pdf
# Line matches the regular expression "^[\s]*---*[\s]*$".
RE_SEPARATOR = rc('^[\s]*---*[\s]*$')
# Taken from:
# http://www.cs.cmu.edu/~vitor/papers/sigFilePaper_finalversion.pdf
# Line has a sequence of 10 or more special characters.
RE_SPECIAL_CHARS = rc(('^[\s]*([\*]|#|[\+]|[\^]|-|[\~]|[\&]|[\$]|_|[\!]|'
'[\/]|[\%]|[\:]|[\=]){10,}[\s]*$'))
RE_SIGNATURE_WORDS = rc(('(T|t)hank.*,|(B|b)est|(R|r)egards|'
'^sent[ ]{1}from[ ]{1}my[\s,!\w]*$|BR|(S|s)incerely|'
'(C|c)orporation|Group'))
# Taken from:
# http://www.cs.cmu.edu/~vitor/papers/sigFilePaper_finalversion.pdf
# Line contains a pattern like Vitor R. Carvalho or William W. Cohen.
RE_NAME = rc('[A-Z][a-z]+\s\s?[A-Z][\.]?\s\s?[A-Z][a-z]+')
# Pattern to match if e.g. 'Sender:' header field has sender names.
SENDER_WITH_NAME_PATTERN = '([\s]*[\S]+,?)+[\s]*<.*>.*'
RE_SENDER_WITH_NAME = rc(SENDER_WITH_NAME_PATTERN)
# Reply line clue line endings, as in regular expression:
# " wrote:$" or " writes:$"
RE_CLUE_LINE_END = rc('.*(W|w)rotes?:$')
INVALID_WORD_START = rc('\(|\+|[\d]')
BAD_SENDER_NAMES = [
# known mail domains
'hotmail', 'gmail', 'yandex', 'mail', 'yahoo', 'mailgun', 'mailgunhq',
'example',
# first level domains
'com', 'org', 'net', 'ru',
# bad words
'mailto'
]
def binary_regex_search(prog):
'''Returns a function that returns 1 or 0 depending on regex search result.
If regular expression compiled into prog is present in a string
the result of calling the returned function with the string will be 1
and 0 otherwise.
>>> import regex as re
>>> binary_regex_search(re.compile("12"))("12")
1
>>> binary_regex_search(re.compile("12"))("34")
0
'''
return lambda s: 1 if prog.search(s) else 0
def binary_regex_match(prog):
'''Returns a function that returns 1 or 0 depending on regex match result.
If a string matches regular expression compiled into prog
the result of calling the returned function with the string will be 1
and 0 otherwise.
>>> import regex as re
>>> binary_regex_match(re.compile("12"))("12 3")
1
>>> binary_regex_match(re.compile("12"))("3 12")
0
'''
return lambda s: 1 if prog.match(s) else 0
def flatten_list(list_to_flatten):
"""Simple list comprehesion to flatten list.
>>> flatten_list([[1, 2], [3, 4, 5]])
[1, 2, 3, 4, 5]
>>> flatten_list([[1], [[2]]])
[1, [2]]
>>> flatten_list([1, [2]])
Traceback (most recent call last):
...
TypeError: 'int' object is not iterable
"""
return [e for sublist in list_to_flatten for e in sublist]
def contains_sender_names(sender):
'''Returns a functions to search sender\'s name or it\'s part.
>>> feature = contains_sender_names("Sergey N. Obukhov <xxx@example.com>")
>>> feature("Sergey Obukhov")
1
>>> feature("BR, Sergey N.")
1
>>> feature("Sergey")
1
>>> contains_sender_names("<serobnic@mail.ru>")("Serobnic")
1
>>> contains_sender_names("<serobnic@mail.ru>")("serobnic")
1
'''
names = '( |$)|'.join(flatten_list([[e, e.capitalize()]
for e in extract_names(sender)]))
names = names or sender
if names != '':
return binary_regex_search(re.compile(names))
return lambda s: False
def extract_names(sender):
"""Tries to extract sender's names from `From:` header.
It could extract not only the actual names but e.g.
the name of the company, parts of email, etc.
>>> extract_names('Sergey N. Obukhov <serobnic@mail.ru>')
['Sergey', 'Obukhov', 'serobnic']
>>> extract_names('')
[]
"""
sender = to_unicode(sender)
# Remove non-alphabetical characters
sender = "".join([char if char.isalpha() else ' ' for char in sender])
# Remove too short words and words from "black" list i.e.
# words like `ru`, `gmail`, `com`, `org`, etc.
sender = [word for word in sender.split() if len(word) > 1 and
not word in BAD_SENDER_NAMES]
# Remove duplicates
names = list(set(sender))
return names
def categories_percent(s, categories):
'''Returns category characters persent.
>>> categories_percent("qqq ggg hhh", ["Po"])
0.0
>>> categories_percent("q,w.", ["Po"])
50.0
>>> categories_percent("qqq ggg hhh", ["Nd"])
0.0
>>> categories_percent("q5", ["Nd"])
50.0
>>> categories_percent("s.s,5s", ["Po", "Nd"])
50.0
'''
count = 0
s = to_unicode(s)
for c in s:
if unicodedata.category(c) in categories:
count += 1
return 100 * float(count) / len(s) if len(s) else 0
def punctuation_percent(s):
'''Returns punctuation persent.
>>> punctuation_percent("qqq ggg hhh")
0.0
>>> punctuation_percent("q,w.")
50.0
'''
return categories_percent(s, ['Po'])
def capitalized_words_percent(s):
'''Returns capitalized words percent.'''
s = to_unicode(s)
words = re.split('\s', s)
words = [w for w in words if w.strip()]
capitalized_words_counter = 0
valid_words_counter = 0
for word in words:
if not INVALID_WORD_START.match(word):
valid_words_counter += 1
if word[0].isupper():
capitalized_words_counter += 1
if valid_words_counter > 0 and len(words) > 1:
return 100 * float(capitalized_words_counter) / valid_words_counter
return 0
def many_capitalized_words(s):
"""Returns a function to check percentage of capitalized words.
The function returns 1 if percentage greater then 65% and 0 otherwise.
"""
return 1 if capitalized_words_percent(s) > 66 else 0
def has_signature(body, sender):
'''Checks if the body has signature. Returns True or False.'''
non_empty = [line for line in body.splitlines() if line.strip()]
candidate = non_empty[-SIGNATURE_MAX_LINES:]
upvotes = 0
for line in candidate:
# we check lines for sender's name, phone, email and url,
# those signature lines don't take more then 27 lines
if len(line.strip()) > 27:
continue
elif contains_sender_names(sender)(line):
return True
elif (binary_regex_search(RE_RELAX_PHONE)(line) +
binary_regex_search(RE_EMAIL)(line) +
binary_regex_search(RE_URL)(line) == 1):
upvotes += 1
if upvotes > 1:
return True

76
talon/utils.py Normal file
View File

@@ -0,0 +1,76 @@
# coding:utf-8
import logging
from random import shuffle
from talon.constants import RE_DELIMITER
log = logging.getLogger(__name__)
def safe_format(format_string, *args, **kwargs):
"""
Helper: formats string with any combination of bytestrings/unicode
strings without raising exceptions
"""
try:
if not args and not kwargs:
return format_string
else:
return format_string.format(*args, **kwargs)
# catch encoding errors and transform everything into utf-8 string
# before logging:
except (UnicodeEncodeError, UnicodeDecodeError):
format_string = to_utf8(format_string)
args = [to_utf8(p) for p in args]
kwargs = {k: to_utf8(v) for k, v in kwargs.iteritems()}
return format_string.format(*args, **kwargs)
# ignore other errors
except:
return u''
def to_unicode(str_or_unicode, precise=False):
"""
Safely returns a unicode version of a given string
>>> utils.to_unicode('привет')
u'привет'
>>> utils.to_unicode(u'привет')
u'привет'
If `precise` flag is True, tries to guess the correct encoding first.
"""
encoding = detect_encoding(str_or_unicode) if precise else 'utf-8'
if isinstance(str_or_unicode, str):
return unicode(str_or_unicode, encoding, 'replace')
return str_or_unicode
def to_utf8(str_or_unicode):
"""
Safely returns a UTF-8 version of a given string
>>> utils.to_utf8(u'hi')
'hi'
"""
if isinstance(str_or_unicode, unicode):
return str_or_unicode.encode("utf-8", "ignore")
return str(str_or_unicode)
def random_token(length=7):
vals = ("a b c d e f g h i j k l m n o p q r s t u v w x y z "
"0 1 2 3 4 5 6 7 8 9").split(' ')
shuffle(vals)
return ''.join(vals[:length])
def get_delimiter(msg_body):
delimiter = RE_DELIMITER.search(msg_body)
if delimiter:
delimiter = delimiter.group()
else:
delimiter = '\n'
return delimiter

18
tests/__init__.py Normal file
View File

@@ -0,0 +1,18 @@
from nose.tools import *
from mock import *
import talon
EML_MSG_FILENAME = "tests/fixtures/standard_replies/yahoo.eml"
MSG_FILENAME_WITH_BODY_SUFFIX = ("tests/fixtures/signature/emails/P/"
"johndoeexamplecom_body")
EMAILS_DIR = "tests/fixtures/signature/emails"
TMP_DIR = "tests/fixtures/signature/tmp"
STRIPPED = "tests/fixtures/signature/emails/stripped/"
UNICODE_MSG = ("tests/fixtures/signature/emails/P/"
"unicode_msg")
talon.init()

View File

@@ -0,0 +1,16 @@
<html>
<body>
<div>Reply</div>
<span id="OLK_SRC_BODY_SECTION">
<div>
<span>From: </span>Bob &lt;<a href="mailto:bob@example.com">bob@example.com</a>&gt;<br>
<span>Date: </span>Tue, 01 Nov 2011 18:54:39 -0700<br>
<span>To: </span>Rob &lt;<a href="mailto:rob@example.com">rob@example.com</a>&gt;<br>
<span>Subject: </span>Test<br>
</div>
<div>
Hi
</div>
</span>
</body>
</html>

10
tests/fixtures/__init__.py vendored Normal file
View File

@@ -0,0 +1,10 @@
STANDARD_REPLIES = "tests/fixtures/standard_replies"
with open("tests/fixtures/reply-quotations-share-block.eml") as f:
REPLY_QUOTATIONS_SHARE_BLOCK = f.read()
with open("tests/fixtures/OLK_SRC_BODY_SECTION.html") as f:
OLK_SRC_BODY_SECTION = f.read()
with open("tests/fixtures/reply-separated-by-hr.html") as f:
REPLY_SEPARATED_BY_HR = f.read()

View File

@@ -0,0 +1,6 @@
<div dir="ltr"><div class="gmail_default"><div class="gmail_default" style>Hi. I am fine.</div><div class="gmail_default" style><br></div><div class="gmail_default" style>Thanks,</div><div class="gmail_default" style>Alex</div>
</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Jun 26, 2014 at 2:14 PM, Alexander L <span dir="ltr">&lt;<a href="mailto:abc@example.com" target="_blank">a@example.com</a>&gt;</span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_default" style="font-size:small"><div class="gmail_default" style="font-family:arial,sans-serif">
Hello! How are you?</div><div class="gmail_default" style="font-family:arial,sans-serif"><br>
</div><div class="gmail_default" style="font-family:arial,sans-serif">Thanks,</div><div class="gmail_default" style="font-family:arial,sans-serif">Sasha.</div></div></div>
</blockquote></div><br></div>

View File

@@ -0,0 +1,17 @@
<html>
<head>
<style><!--
.hmmessage P
{
margin:0px;
padding:0px
}
body.hmmessage
{
font-size: 12pt;
font-family:Calibri
}
--></style></head>
<body class='hmmessage'><div dir='ltr'>Hi. I am fine.<div><br></div><div>Thanks,</div><div>Alex<br><br><div><hr id="stopSpelling">Date: Thu, 26 Jun 2014 13:53:45 +0400<br>Subject: Test message<br>From: abc@example.com<br>To: alex.l@example.com<br><br><div dir="ltr"><div class="ecxgmail_default" style="font-size:small;">Hello! How are you?</div><div class="ecxgmail_default" style="font-size:small;"><br></div><div class="ecxgmail_default" style="font-size:small;">Thanks,</div><div class="ecxgmail_default" style="font-size:small;">
Sasha.</div></div></div></div> </div></body>
</html>

View File

@@ -0,0 +1,57 @@
<HTML><BODY><p>Hi. I am fine.</p><p>Thanks,<br>Alex</p><br><br><br>Thu, 26 Jun 2014 14:00:51 +0400 от Alexander L &lt;abc@example.com&gt;:<br>
<blockquote style="border-left:1px solid #0857A6; margin:10px; padding:0 0 0 10px;">
<div id="">
<div class="js-helper js-readmsg-msg">
<style type="text/css"></style>
<div>
<base target="_self" href="https://e.mail.ru/">
<div id="style_14037768550000001020_BODY"><div dir="ltr"><div style="font-size:small"><div style="font-family:arial,sans-serif">Hello! How are you?</div><div style="font-family:arial,sans-serif"><br>
</div><div style="font-family:arial,sans-serif">Thanks,</div><div style="font-family:arial,sans-serif">Sasha.</div></div></div>
</div>
<base target="_self" href="https://e.mail.ru/">
</div>
</div>
</div>
</blockquote>
<br></BODY></HTML>

View File

@@ -0,0 +1,134 @@
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">
<meta name=Generator content="Microsoft Word 11 (filtered medium)">
<!--[if !mso]>
<style>
v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style>
<![endif]-->
<style>
<!--
/* Font Definitions */
@font-face
{font-family:Tahoma;
panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman";}
a:link, span.MsoHyperlink
{color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{color:purple;
text-decoration:underline;}
span.EmailStyle17
{mso-style-type:personal-reply;
font-family:Arial;
color:navy;}
@page Section1
{size:595.3pt 841.9pt;
margin:2.0cm 42.5pt 2.0cm 3.0cm;}
div.Section1
{page:Section1;}
-->
</style>
<!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang=RU link=blue vlink=purple>
<div class=Section1>
<p class=MsoNormal><font size=2 color=navy face=Arial><span lang=EN-US
style='font-size:10.0pt;font-family:Arial;color:navy'>Hi. I am fine.<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 color=navy face=Arial><span lang=EN-US
style='font-size:10.0pt;font-family:Arial;color:navy'><o:p>&nbsp;</o:p></span></font></p>
<p class=MsoNormal><font size=2 color=navy face=Arial><span lang=EN-US
style='font-size:10.0pt;font-family:Arial;color:navy'>Thanks,<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 color=navy face=Arial><span lang=EN-US
style='font-size:10.0pt;font-family:Arial;color:navy'>Alex<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 color=navy face=Arial><span style='font-size:
10.0pt;font-family:Arial;color:navy'><o:p>&nbsp;</o:p></span></font></p>
<div>
<div class=MsoNormal align=center style='text-align:center'><font size=3
face="Times New Roman"><span lang=EN-US style='font-size:12.0pt'>
<hr size=3 width="100%" align=center tabindex=-1>
</span></font></div>
<p class=MsoNormal><b><font size=2 face=Tahoma><span lang=EN-US
style='font-size:10.0pt;font-family:Tahoma;font-weight:bold'>From:</span></font></b><font
size=2 face=Tahoma><span lang=EN-US style='font-size:10.0pt;font-family:Tahoma'>
Alexander L [mailto:abc@example.com] <br>
<b><span style='font-weight:bold'>Sent:</span></b> Friday, June 27, 2014 12:06
PM<br>
<b><span style='font-weight:bold'>To:</span></b> Alexander<br>
<b><span style='font-weight:bold'>Subject:</span></b> Test message</span></font><span
lang=EN-US><o:p></o:p></span></p>
</div>
<p class=MsoNormal><font size=3 face="Times New Roman"><span style='font-size:
12.0pt'><o:p>&nbsp;</o:p></span></font></p>
<div>
<div>
<div>
<p class=MsoNormal><font size=3 face=Arial><span style='font-size:12.0pt;
font-family:Arial'>Hello! How are you?<o:p></o:p></span></font></p>
</div>
<div>
<p class=MsoNormal><font size=3 face=Arial><span style='font-size:12.0pt;
font-family:Arial'><o:p>&nbsp;</o:p></span></font></p>
</div>
<div>
<p class=MsoNormal><font size=3 face=Arial><span style='font-size:12.0pt;
font-family:Arial'>Thanks,<o:p></o:p></span></font></p>
</div>
<div>
<p class=MsoNormal><font size=3 face=Arial><span style='font-size:12.0pt;
font-family:Arial'>Sasha.<o:p></o:p></span></font></p>
</div>
</div>
</div>
</div>
</body>
</html>

View File

@@ -0,0 +1,42 @@
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=utf-8"><meta name=Generator content="Microsoft Word 12 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Tahoma;
panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
span.EmailStyle17
{mso-style-type:personal-reply;
font-family:"Calibri","sans-serif";
color:#1F497D;}
.MsoChpDefault
{mso-style-type:export-only;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:2.0cm 42.5pt 2.0cm 3.0cm;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=EN-US link=blue vlink=purple><div class=WordSection1><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Hi. I am fine.<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'> <o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Thanks,<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Alex<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><div style='border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0cm 0cm 0cm'><p class=MsoNormal><b><span lang=RU style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span></b><span lang=RU style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'> Alexander L [mailto:abc@example.com] <br><b>Sent:</b> Thursday, July 03, 2014 3:50 PM<br><b>To:</b> alex.l@example.com<br><b>Subject:</b> Test message<o:p></o:p></span></p></div><p class=MsoNormal><o:p>&nbsp;</o:p></p><div><div><div><p class=MsoNormal><span style='font-family:"Arial","sans-serif"'>Hello! How are you?<o:p></o:p></span></p></div><div><p class=MsoNormal><span style='font-family:"Arial","sans-serif"'><o:p>&nbsp;</o:p></span></p></div><div><p class=MsoNormal><span style='font-family:"Arial","sans-serif"'>Thanks,<o:p></o:p></span></p></div><div><p class=MsoNormal><span style='font-family:"Arial","sans-serif"'>Sasha.<o:p></o:p></span></p></div></div></div></div></body></html>

View File

@@ -0,0 +1,32 @@
<html>
<head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Hi. I am fine.<br>
<br>
Thanks,<br>
Alex<br>
<div class="moz-cite-prefix">On 26.06.2014 14:41, Alexander L
wrote:<br>
</div>
<blockquote
cite="mid:CA+jEWTKBU6qc4OnH5m=-0sfwkAzZhcy0rd+ean2W6bFUVXaO7A@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_default" style="font-size:small">
<div class="gmail_default"
style="font-family:arial,sans-serif">Hello! How are you?</div>
<div class="gmail_default"
style="font-family:arial,sans-serif"><br>
</div>
<div class="gmail_default"
style="font-family:arial,sans-serif">Thanks,</div>
<div class="gmail_default"
style="font-family:arial,sans-serif">Sasha.</div>
</div>
</div>
</blockquote>
<br>
</body>
</html>

View File

@@ -0,0 +1,33 @@
<html>
<head>
<meta name="generator" content="Windows Mail 17.5.9600.20498">
<style data-externalstyle="true"><!--
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph {
margin-top:0in;
margin-right:0in;
margin-bottom:0in;
margin-left:.5in;
margin-bottom:.0001pt;
}
p.MsoNormal, li.MsoNormal, div.MsoNormal {
margin:0in;
margin-bottom:.0001pt;
}
p.MsoListParagraphCxSpFirst, li.MsoListParagraphCxSpFirst, div.MsoListParagraphCxSpFirst,
p.MsoListParagraphCxSpMiddle, li.MsoListParagraphCxSpMiddle, div.MsoListParagraphCxSpMiddle,
p.MsoListParagraphCxSpLast, li.MsoListParagraphCxSpLast, div.MsoListParagraphCxSpLast {
margin-top:0in;
margin-right:0in;
margin-bottom:0in;
margin-left:.5in;
margin-bottom:.0001pt;
line-height:115%;
}
--></style></head>
<body dir="ltr">
<div data-externalstyle="false" dir="ltr" style="font-family: 'Calibri', 'Segoe UI', 'Meiryo', 'Microsoft YaHei UI', 'Microsoft JhengHei UI', 'Malgun Gothic', 'sans-serif';font-size:12pt;"><div>Hi. I am fine.</div><div><br></div><div>Thanks,</div><div>Alex<br></div><div data-signatureblock="true"><div><br></div><div><br></div></div><div style="padding-top: 5px; border-top-color: rgb(229, 229, 229); border-top-width: 1px; border-top-style: solid;"><div><font face=" 'Calibri', 'Segoe UI', 'Meiryo', 'Microsoft YaHei UI', 'Microsoft JhengHei UI', 'Malgun Gothic', 'sans-serif'" style='line-height: 15pt; letter-spacing: 0.02em; font-family: "Calibri", "Segoe UI", "Meiryo", "Microsoft YaHei UI", "Microsoft JhengHei UI", "Malgun Gothic", "sans-serif"; font-size: 12pt;'><b>От:</b>&nbsp;<a href="mailto:abc@example.com" target="_parent">Alexander L</a><br><b>Отправлено:</b>&nbsp;‎четверг‎, 26 ‎июня‎ 2014 г. 15:05<br><b>Кому:</b>&nbsp;<a href="mailto:alex-ninja@example.com" target="_parent">Alex</a></font></div></div><div><br></div><div dir=""><div dir="ltr"><div class="gmail_default" style="font-size: small;"><div class="gmail_default" style="font-family: arial,sans-serif;">Hello! How are you?</div><div class="gmail_default" style="font-family: arial,sans-serif;"><br>
</div><div class="gmail_default" style="font-family: arial,sans-serif;">Thanks,</div><div class="gmail_default" style="font-family: arial,sans-serif;">Sasha.</div></div></div>
</div></div>
</body>
</html>

View File

@@ -0,0 +1 @@
<p>Hi. I am fine.<br /><br />Thanks,<br />Alex<br /><br />26.06.2014, 14:41, "Alexander L" &lt;<a href="mailto:abc@example.com">abc@example.com</a>&gt;:</p><blockquote> Hello! How are you?<br /><br /> Thanks,<br /> Sasha.</blockquote>

View File

@@ -0,0 +1,22 @@
Content-Type: multipart/alternative;
boundary="===============6853056845739363347=="
MIME-Version: 1.0
Date: Wed, 4 Apr 2012 22:22:42 -0700 (PDT)
From: Joe Doe <xxx@example.com>
Subject: Re: You've got a new booking inquiry!
--===============6853056845739363347==
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
SGkgS2F0aGFyaW5lLsKgIFNvdW5kcyBncmVhdC7CoCBBcmUgdGhlcmUgYW5kIGRpZXRyeSByZXN0cmljdGlvbnMgb3IgdGhpbmdzIHlvdXIgaHVzYmFuZCBkb2VzL2RvZXNuJ3QgbGlrZSB0byBlYXQ/wqAgV291bGQgeW91IGxpa2UgdG8gZG8gYSBmZXcgaG9ycyBkIG9ldXZyZXMgYW5kIHRoZW4gYcKgMyBvciA0wqBjb3Vyc2UgZGlubmVyP8KgIExldCBtZSBrbm93IHdoYXQgeW91IHRoaW5rIHdpbGwgd29yayBiZXN0IGFuZCBJIHdpbGwgc3RhcnQgd29ya2luZyBvbiBhIG1lbnUgYW5kIHByb3Bvc2FsLsKgIFRoYW5rcyBzbyBtdWNoIGFuZCBsb29rIGZvcndhcmQgdG8gaGVhcmluZyBmcm9tIHlvdSBzb29uLgrCoApKb2UgWFhYCgotLS0gT24gV2VkLCA0LzQvMTIsIHh4eEBleGFtcGxlLmNvbSA8eHh4QGV4YW1wbGUuY29tPiB3cm90ZToKCgpGcm9tOiB4eHhAZXhhbXBsZS5jb20gPHh4eEBleGFtcGxlLmNvbT4KU3ViamVjdDogWW91J3ZlIGdvdCBhIG5ldyBib29raW5nIGlucXVpcnkhClRvOiB4eHhAeWFob28uY29tCkRhdGU6IFdlZG5lc2RheSwgQXByaWwgNCwgMjAxMiwgMTA6MjMgUE0KCk5ldyBCb29raW5nIElucXVpcnkKCg==
--===============6853056845739363347==
MIME-Version: 1.0
Content-Type: text/html; charset="utf-8"
Content-Transfer-Encoding: base64
PHRhYmxlPjx0cj48dGQ+PERJVj5IaSBLYXRoYXJpbmUuJm5ic3A7IFNvdW5kcyBncmVhdC4mbmJzcDsgQXJlIHRoZXJlIGFuZCBkaWV0cnkgcmVzdHJpY3Rpb25zIG9yIHRoaW5ncyB5b3VyIGh1c2JhbmQgZG9lcy9kb2Vzbid0IGxpa2UgdG8gZWF0PyZuYnNwOyBXb3VsZCB5b3UgbGlrZSB0byBkbyBhIGZldyBob3JzIGQgb2V1dnJlcyBhbmQgdGhlbiBhJm5ic3A7MyBvciA0Jm5ic3A7Y291cnNlIGRpbm5lcj8mbmJzcDsgTGV0IG1lIGtub3cgd2hhdCB5b3UgdGhpbmsgd2lsbCB3b3JrIGJlc3QgYW5kIEkgd2lsbCBzdGFydCB3b3JraW5nIG9uIGEgbWVudSBhbmQgcHJvcG9zYWwuJm5ic3A7IFRoYW5rcyBzbyBtdWNoIGFuZCBsb29rIGZvcndhcmQgdG8gaGVhcmluZyBmcm9tIHlvdSBzb29uLjwvRElWPgo8RElWPiZuYnNwOzwvRElWPgo8RElWPkpob24gRG9lPEJSPjxCUj4tLS0gT24gPEI+V2VkLCA0LzQvMTIsIHh4eEBleGFtcGxlLmNvbSA8ST4mbHQ7eHh4QGV4YW1wbGUuY29tJmd0OzwvST48L0I+IHdyb3RlOjxCUj48L0RJVj4KPEJMT0NLUVVPVEU+PEJSPkZyb206IHh4eEBleGFtcGxlLmNvbSAmbHQ7eHh4QGV4YW1wbGUuY29tJmd0OzxCUj5TdWJqZWN0OiBZb3UndmUgZ290IGEgbmV3IGJvb2tpbmcgaW5xdWlyeSE8QlI+VG86IHh4eEB5YWhvby5jb208QlI+RGF0ZTogV2VkbmVzZGF5LCBBcHJpbCA0LCAyMDEyLCAxMDoyMyBQTTxCUj48QlI+CjxESVY+CjxESVY+CjxDRU5URVI+CjxUQUJMRT4KPFRCT0RZPgo8VFI+CjxURD4KPFRBQkxFPgo8VEJPRFk+CjxUUj4KPFREPgo8VEFCTEU+CjxUQk9EWT4KPFRSPgo8VEQ+CjxESVY+TmV3IEJvb2tpbmcgSW5xdWlyeSA8L0RJVj48L1REPgo8VEQ+CjxESVY+WW91ciBwbGFjZSBpcyB0aGUgaG9tZSBvZiBiZXNwb2tlIGRpbmluZyA8L0RJVj48L1REPjwvVFI+PC9UQk9EWT48L1RBQkxFPjwvVEQ+PC9UUj48L1RCT0RZPjwvVEFCTEU+CjxUQUJMRT4KPFRCT0RZPgo8VFI+CjxURD4KPFRBQkxFPgo8VEJPRFk+CjxUUj4KPFREPiA8L1REPjwvVFI+PC9UQk9EWT48L1RBQkxFPjwvVEQ+PC9UUj4KPFRSPgo8VEQ+CjxUQUJMRT4KPFRCT0RZPgo8VFI+CjxURD4KPFRBQkxFPgo8VEJPRFk+CjxUUj4KPFREPgo8RElWPjxCUj5Hb29kIE5ld3MhPEJSPjxCUj4KPFA+RXZlbnQgRGV0YWlsczwvRElWPkRhdGU6IEFwcmlsIDI4LCAyMDEyPEJSPkxvY2F0aW9uOiB4eHg8QlI+SGVhZGNvdW50OiA2IHRvIDg8QlI+VGFyZ2V0IEJ1ZGdldDogJDUwIHBlciBwZXJzb248QlI+PEJSPkJlc3QgRGVzY3JpcHRpb24gb2YgVGFyZ2V0IEJ1ZGdldDogSSdkIGxvdmUgdG8gaGVhciB3aGF0IHRoZSBjaGVmIHRoaW5rcyBpcyBiZXN0IGZvciBteSBldmVudCwgcHJvdmlkZWQgd2Ugc3RheSBjbG9zZSB0byB0aGlzIGJ1ZGdldCA8QlI+PEJSPkV2ZW50IERlc2NyaXB0aW9uOiBJIGFtIHdhbnRpbmcgdG8gc3VycHJpc2UgbXkgaHVzYmFuZCB3aXRoIGEgY2FzdWFsIGRpbm5lciBwYXJ0eSBpbiBvdXIgaG9tZSBpbiB4eHguIFdlIGhhdmUgYW4gYW1hemluZyBraXRjaGVuICh0aGF0IEkgZG9uJ3QgZG8ganVzdGljZSB0byBidXQgSSBiZXQgeW91IGNvdWxkISksIGFuZCBhIHJlYWxseSBuaWNlIGdhcmRlbiBmb3IgZGluaW5nLiBJIGFtIGZseWluZyBzb21lIG9mIGhpcyBiZXN0IGZyaWVuZHMgaW4gdG8gY2VsZWJyYXRlIGhpbS4gV2UgaGF2ZSBzbWFsbCBraWRzICh3aG8gd2lsbCBiZSBzbGVlcGluZyEpLCBzbyBJJ20gaG9waW5nIGZvciBhIGNhc3VhbCBidXQgcm9tYW50aWMgZGlubmVyIHBhcnR5LiA8QlI+PEJSPlZpZXcgbW9yZSBpbnF1aXJ5IGRldGFpbHMgb24geW91ciBFdmVudCBEYXNoYm9hcmQuIElmIHlvdSBsaWtlIHdoYXQgeW91IHNlZSwgcGxlYXNlIGNyZWF0ZSBhIHByb3Bvc2FsIGZvciB0aGUgZXZlbnQuIDxCUj48QlI+SWYgeW91IGRvIG5vdCBoYXZlIHRoZSB0aW1lIHRvIG1ha2UgYSBmdWxsIHByb3Bvc2FsIHJpZ2h0IG5vdywgd2UgZW5jb3VyYWdlIHlvdSB0byBhdCBsZWFzdCByZXNwb25kIHRvIHRoZSBob3N0IHdpdGggYSBxdWljayBtZXNzYWdlIHRvIGNvbmZpcm0gdGhhdCB5b3UndmUgZ290dGVuIHRoaXMgaW5xdWlyeSBhbmQgaGF2ZSBiZWd1biB0aGlua2luZyBhYm91dCB0aGUgZXZlbnQuIDxCUj48QlI+PFNUUk9ORz5Zb3UgY2FuIHJlcGx5IGRpcmVjdGx5IHRvIHRoaXMgZW1haWwgYW5kIHlvdXIgbWVzc2FnZSB3aWxsIGdvIHRvIHRoZSBob3N0IG9uIHRoZSBldmVudCBkYXNoYm9hcmQuPC9TVFJPTkc+IDxCUj48QlI+UmVtZW1iZXIsIHlvdSBoYXZlIGV4Y2x1c2l2ZSBhY2Nlc3MgdG8gdGhpcyBpbnF1aXJ5IGZvciB0aGUgbmV4dCAyNCBob3Vycy4gUGxlYXNlIG1ha2UgYSBwcm9wb3NhbCBvciBzZW5kIGEgbWVzc2FnZSB0byB0aGUgaG9zdCBpbiB0aGF0IHRpbWUuIElmIHRoZSBob3N0IGhhcyBub3QgaGVhcmQgYW55dGhpbmcgZnJvbSB5b3UgaW4gMjQgaG91cnMsIHdlIHdpbGwKIGZvcndhcmQgdGhlIGhvc3RzIGlucXVpcnkgdG8gYSBzbWFsbCBudW1iZXIgb2YgYWRkaXRpb25hbCBjaGVmcywgYW5kIHRoZXkgd2lsbCBoYXZlIHRoZSBvcHBvcnR1bml0eSB0byBtYWtlIGEgcHJvcG9zYWwuIFdlIGRvIHRoaXMgYXMgYSBjb3VydGVzeSB0byB0aGUgaG9zdHMuIDxCUj48QlI+SWYgeW91IGNhbm5vdCBhY2NlcHQgdGhpcyBib29raW5nIG9yIGRvIG5vdCB3YW50IHRvIGZvciBhbnkgcmVhc29uLCBwbGVhc2UgdGFrZSB0aGUgdGltZSB0byBkZWNsaW5lIG9uIHRoZSBFdmVudCBEYXNoYm9hcmQuIDxCUj48QlI+VGltZSB0byBnZXQgY29va2luJyA8QlI+PEJSPjwvRElWPjwvVEQ+PC9UUj48L1RCT0RZPjwvVEFCTEU+PC9URD48L1RSPjwvVEJPRFk+PC9UQUJMRT48L1REPjwvVFI+CjxUUj4KPFREPgo8VEFCTEU+CjxUQk9EWT4KPFRSPgo8VEQ+CjxUQUJMRT4KPFRCT0RZPgo8VFI+CjxURD4KPERJVj4mbmJzcDs8QSBocmVmPSJodHRwOi8vZXhhbXBsZS5jb20iPmZvbGxvdyBvbiBUd2l0dGVyPC9BPiB8IDxBIGhyZWY9Imh0dHA6Ly94eHgiPmZyaWVuZCBvbiBGYWNlYm9vazwvQT4gfCA8QQogaHJlZj0iaHR0cDovL2V4YW1wbGUuY29tIj5Gb3J3YXJkIHRvIGEgRnJpZW5kPC9BPiZndDsmbmJzcDsgPC9ESVY+PC9URD48L1RSPgo8VFI+CjxURD4KPERJVj48RU0+Q29weXJpZ2g8L0VNPiA8L0RJVj48L1REPjwvVFI+PC9UQk9EWT48L1RBQkxFPjwvVEQ+PC9UUj48L1RCT0RZPjwvVEFCTEU+PC9URD48L1RSPjwvVEJPRFk+PC9UQUJMRT48QlI+PC9URD48L1RSPjwvVEJPRFk+PC9UQUJMRT48L0NFTlRFUj48SU1HIGFsdD0iIiBzcmM9Imh0dHA6Ly9leGFtcGxlLmNvbSI+IDwvRElWPjwvRElWPjwvQkxPQ0tRVU9URT48L3RkPjwvdHI+PC90YWJsZT4K
--===============6853056845739363347==--

View File

@@ -0,0 +1,21 @@
<html>
<body>
<div>
Hi
<div>
there
</div>
<div>
Bob
<hr>
<b>From: </b>bob@example.com<br>
<b>To: </b>xxx@comcast.net<br>
<b>Sent: </b>Friday, July 22, 2011 6:20:01 PM<br>
<b>Subject: </b>Hello<br><br>
<p>
Hello
</p>
</div>
</div>
</body>
</html>

24
tests/fixtures/signature/emails/P/102682_R_S vendored Executable file
View File

@@ -0,0 +1,24 @@
From: doe@example.com (John Doe)
Subject: Hello
Date: 7 Apr 94 17:35:09 GMT
#reply#rickc@example.com (xxx xxx) writes:
#reply#>In article <xxx.xxx.xxx@xxx-xxx>, xxx@example.com
#reply#>writes:
#reply#>|> I just wanted to let everyone know that I have lost what little respect
#reply#>|> I have
#reply#>|> for xxx xxx after seeing today's xxx game.
#reply#>|> A dishard xxx fan
#reply#>Yes, I also wonder if they can win with this manager.
#reply#>I never believed managers had that much to do with winning
#reply#>until I saw how much they had to do with losing....
I like the xxx a lot, but my heart belongs to the xxx...You can imagine
my frustration when I saw the xxx nabbing xxx...ARHGGHRGHH!
#sig# -John Doe
#sig#
#sig# doe@example.com

View File

@@ -0,0 +1,34 @@
(Please accept our apologies if you've already completed the Survey. Send a reply with "Did it" in the subject line to avoid future reminder messages.)
Dear Executive,
YOUR INPUT IS VERY VALUABLE. Over the past week or so, you've been invited to participate in a very important survey that will significantly improve the information products available to you in the power and energy industry.
Because we haven't heard from you yet, we're adding more prizes to encourage participation. There are now 12 prizes you could win; we've added 4 more inducements to the original 8 prizes.
YOUR CHANCE OF WINNING IS HUGE. We're hoping to draw a total of 200 respondents from this sector of the industry. When we receive your fully completed questionnaire, your e-mail address will one among 30 in a drawing that could bring you one of the 12 prizes. Those odds aren't bad at all! The prizes include:
-- FOUR $100 gift certificates
-- EIGHT $50 gift certificates
IF YOU WIN, you can choose from the following list where you'd like to spend your gift certificate:
-- Amazon.com,
-- REI.com,
-- GOLFDISCOUNT.com,
-- CABELLAS.com,
-- fogdog.com, or
-- a general American Express gift certificate.
Just click on the long blue URL listed in the section below this letter to connect to the welcome page for the survey.
THE DEADLINE FOR SUBMITTING YOUR SURVEY IS FRIDAY, SEPTEMBER 7.
Thanks for your participation, and we wish you the best of luck in the
drawing!
#sig#John E. Doe, Ph.D.
#sig#President
#sig#Xxx Research
#sig#john@example.com
#sig#www.example.com

View File

@@ -0,0 +1 @@
john@example.com

View File

@@ -0,0 +1,10 @@
From: Сергей Обухов
Добрый день.
Мы являемся стартапом работающим над созданием платформы
почтовых электронных сообщений. Нам бы хотелось использовать Ваш продукт
для решения задач парсинга сообщений. Если Вы заинтересованы, пожалуйста
ответьте на это письмо.
С уважением,
Сергей Обухов

View File

@@ -0,0 +1,9 @@
Martin, can we get:
1 L Male
1 M Male
1 M Female
to 111 Xxxxxx ST Xxx Xxxxxxxxx XX 94133
That'd be awesome! Really cool shirts!

View File

@@ -0,0 +1,5 @@
Thanks Sasha, I can't go any higher and is why I limited it to the
homepage.
Stavros Xxxxxx
via mobile

View File

@@ -0,0 +1 @@
Stavros Xxxxxx

View File

@@ -0,0 +1,2 @@
Stavros Xxxxxx
via mobile

View File

@@ -0,0 +1,34 @@
(Please accept our apologies if you've already completed the Survey. Send a reply with "Did it" in the subject line to avoid future reminder messages.)
Dear Executive,
YOUR INPUT IS VERY VALUABLE. Over the past week or so, you've been invited to participate in a very important survey that will significantly improve the information products available to you in the power and energy industry.
Because we haven't heard from you yet, we're adding more prizes to encourage participation. There are now 12 prizes you could win; we've added 4 more inducements to the original 8 prizes.
YOUR CHANCE OF WINNING IS HUGE. We're hoping to draw a total of 200 respondents from this sector of the industry. When we receive your fully completed questionnaire, your e-mail address will one among 30 in a drawing that could bring you one of the 12 prizes. Those odds aren't bad at all! The prizes include:
-- FOUR $100 gift certificates
-- EIGHT $50 gift certificates
IF YOU WIN, you can choose from the following list where you'd like to spend your gift certificate:
-- Amazon.com,
-- REI.com,
-- GOLFDISCOUNT.com,
-- CABELLAS.com,
-- fogdog.com, or
-- a general American Express gift certificate.
Just click on the long blue URL listed in the section below this letter to connect to the welcome page for the survey.
THE DEADLINE FOR SUBMITTING YOUR SURVEY IS FRIDAY, SEPTEMBER 7.
Thanks for your participation, and we wish you the best of luck in the
drawing!
#sig#John E. Doe, Ph.D.
#sig#President
#sig#Xxx Research
#sig#john@example.com
#sig#www.example.com

View File

@@ -0,0 +1 @@
john@example.com

View File

@@ -0,0 +1,5 @@
#sig#John E. Doe, Ph.D.
#sig#President
#sig#Xxx Research
#sig#john@example.com
#sig#www.example.com

View File

@@ -0,0 +1,8 @@
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce nec est enim. Vestibulum vel enim urna, sed facilisis augue. Vestibulum dui nibh, pulvinar id adipiscing id, congue id turpis.
Suspendisse non posuere erat. Ut porta luctus augue, laoreet accumsan sem auctor quis. Fusce feugiat elit et dolor tempor lobortis. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Sed molestie gravida mi, id faucibus risus tempus vel. Mauris dictum enim nec lectus iaculis ac eleifend libero vestibulum. Morbi imperdiet lobortis erat non molestie. Sed non aliquam lacus.
Gina Xxxxxxxxx
gina@example.com
(555) 346-9947
www.example.com

View File

@@ -0,0 +1 @@
Gina Xxxxxxxxx

View File

@@ -0,0 +1,4 @@
Gina Xxxxxxxxx
gina@example.com
(555) 346-9947
www.example.com

View File

@@ -0,0 +1,6 @@
Simone,
It is 'example.com'. Please let me know what you see.
Thank you,
Noam

View File

@@ -0,0 +1 @@
Noam

View File

@@ -0,0 +1,2 @@
Thank you,
Noam

View File

@@ -0,0 +1,23 @@
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce nec est enim. Vestibulum vel enim urna, sed facilisis augue. Vestibulum dui nibh, pulvinar id adipiscing id, congue id turpis.
Suspendisse non posuere erat. Ut porta luctus augue, laoreet accumsan sem auctor quis. Fusce feugiat elit et dolor tempor lobortis. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Sed molestie gravida mi, id faucibus risus tempus vel. Mauris dictum enim nec lectus iaculis ac eleifend libero vestibulum. Morbi imperdiet lobortis erat non molestie. Sed non aliquam lacus.
----------------
AM_LOGO_CLR
Me John Doe
Internet xxx / xxxxx d'internet
Xxxxxxx du Québec
t. 555-931-0702
skype: xxx
f. 555-875-7611
http://example.com/
http://www.example.com/in/xxx

View File

@@ -0,0 +1 @@
John Doe

View File

@@ -0,0 +1,19 @@
----------------
AM_LOGO_CLR
Me John Doe
Internet xxx / xxxxx d'internet
Xxxxxxx du Québec
t. 555-931-0702
skype: xxx
f. 555-875-7611
http://example.com/
http://www.example.com/in/xxx

View File

@@ -0,0 +1 @@
*.data

View File

@@ -0,0 +1,24 @@
Content-Type: multipart/alternative;
boundary="===============0934372227844987316=="
MIME-Version: 1.0
Date: Mon, 2 Apr 2012 18:22:10 +0400
Message-Id: <CAEAsyCZ-sCHxZtoKyM3JmT5gSYpZd5GwY-cVNiV8H329zgJT4g@mail.gmail.com>
Subject: Re: Test
From: Sergey Obykhov <bob@example.com>
To: "bob@xxx.mailgun.org" <bob@xxx.mailgun.org>
--===============0934372227844987316==
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
SGVsbG8KMDIuMDQuMjAxMiAxNDoyMCDQv9C+0LvRjNC30L7QstCw0YLQtdC70YwgImJvYkB4eHgubWFpbGd1bi5vcmciIDwKYm9iQHh4eC5tYWlsZ3VuLm9yZz4g0L3QsNC/0LjRgdCw0Ls6Cgo+IEhpCj4KCg==
--===============0934372227844987316==
MIME-Version: 1.0
Content-Type: text/html; charset="utf-8"
Content-Transfer-Encoding: base64
PHA+SGVsbG88L3A+CjxkaXYgY2xhc3M9ImdtYWlsX3F1b3RlIj4wMi4wNC4yMDEyIDE0OjIwINC/0L7Qu9GM0LfQvtCy0LDRgtC10LvRjCAmcXVvdDs8YSBocmVmPSJtYWlsdG86Ym9iQHh4eC5tYWlsZ3VuLm9yZyI+Ym9iQHh4eC5tYWlsZ3VuLm9yZzwvYT4mcXVvdDsgJmx0OzxhIGhyZWY9Im1haWx0bzpib2JAeHh4Lm1haWxndW4ub3JnIj5ib2JAeHh4Lm1haWxndW4ub3JnPC9hPiZndDsg0L3QsNC/0LjRgdCw0Ls6PGJyIHR5cGU9ImF0dHJpYnV0aW9uIj4KPGJsb2NrcXVvdGUgY2xhc3M9ImdtYWlsX3F1b3RlIiBzdHlsZT0ibWFyZ2luOjAgMCAwIC44ZXg7Ym9yZGVyLWxlZnQ6MXB4ICNjY2Mgc29saWQ7cGFkZGluZy1sZWZ0OjFleCI+SGk8YnI+CjwvYmxvY2txdW90ZT48L2Rpdj4KCg==
--===============0934372227844987316==--

65
tests/fixtures/standard_replies/aol.eml vendored Normal file
View File

@@ -0,0 +1,65 @@
Content-Type: multipart/alternative;
boundary="===============7429987408351918371=="
MIME-Version: 1.0
To: bob@example.com
Subject: Re: Test
From: Megan Odin <xxx@aol.com>
Message-Id: <8CEDEEFBEF4733B-1E5C-73DF@webmail-d070.sysops.aol.com>
Date: Mon, 2 Apr 2012 09:57:58 -0400 (EDT)
--===============7429987408351918371==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Hello
-----Original Message-----
From: bob <bob@example.com>
To: xxx <xxx@gmail.com>; xxx <xxx@hotmail.com>; xxx <xxx@yahoo.com>; xxx <xxx@aol.com>; xxx <xxx@comcast.net>; xxx <xxx@nyc.rr.com>
Sent: Mon, Apr 2, 2012 5:49 pm
Subject: Test
Hi
--===============7429987408351918371==
Content-Type: text/html; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
<font color='black' size='2' face='arial'>Hello<br>
<br>
<br>
<div style="font-family:arial,helvetica;font-size:10pt;color:black">-----Original Message-----<br>
From: bob &lt;bob@example.com&gt;<br>
To: xxx &lt;xxx@gmail.com&gt;; xxx &lt;xxx@hotmail.com&gt;; xxx &lt;xxx@yahoo.com&gt;; xxx &lt;xxx@aol.com&gt;; xxx &lt;xxx@comcast.net&gt;; xxx &lt;xxx@nyc.rr.com&gt;<br>
Sent: Mon, Apr 2, 2012 5:49 pm<br>
Subject: Test<br>
<br>
<div id="AOLMsgPart_0_4d68a632-fe65-4f6d-ace2-292ac1b91f1f" style="margin: 0px;font-family: Tahoma, Verdana, Arial, Sans-Serif;font-size: 12px;color: #000;background-color: #fff;">
<pre style="font-size: 9pt;"><tt>Hi
</tt></pre>
</div>
<!-- end of AOLMsgPart_0_4d68a632-fe65-4f6d-ace2-292ac1b91f1f -->
</div>
</font>
--===============7429987408351918371==--

View File

@@ -0,0 +1,15 @@
Content-Type: text/plain; charset=iso-8859-1
Mime-Version: 1.0 (Apple Message framework v1257)
Subject: Re: Test
From: xxx <xxx@gmail.com>
Date: Tue, 3 Apr 2012 16:55:26 +0400
Content-Transfer-Encoding: 7bit
Message-Id: <9A1EA6A5-4FD3-4AD0-8DFD-2420E670DB53@gmail.com>
To: bob <bob@example.com>
X-Mailer: Apple Mail (2.1257)
Hello
On Apr 3, 2012, at 4:19 PM, bob wrote:
> Hi

View File

@@ -0,0 +1,33 @@
Content-Type: multipart/alternative;
boundary="===============3552566137977633461=="
MIME-Version: 1.0
Date: Mon, 2 Apr 2012 13:56:12 +0000 (UTC)
From: xxx@comcast.net
To: bob@xxx.mailgun.org
Message-Id: <650787974.741595.1333374972389.JavaMail.root@sz0152a.westchester.pa.mail.comcast.net>
Subject: Re: Test
X-Mailer: Zimbra 6.0.13_GA_2944 (ZimbraWebClient - SAF3 (Linux)/6.0.13_GA_2944)
--===============3552566137977633461==
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Hello
----- Original Message -----
From: bob@xxx.mailgun.org
To: xxx@gmail.com, xxx@hotmail.com, xxx@yahoo.com, xxx@aol.com, xxx@comcast.net, lsloan6@nyc.rr.com
Sent: Monday, April 2, 2012 5:44:22 PM
Subject: Test
Hi
--===============3552566137977633461==
MIME-Version: 1.0
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: 7bit
<html><head><style type='text/css'>p { margin: 0; }</style></head><body><div style='font-family: Arial; font-size: 12pt; color: #000000'>Hello<br><br><hr id="zwchr"><b>From: </b>bob@xxx.mailgun.org<br><b>To: </b>xxx@gmail.com, xxx@hotmail.com, xxx@yahoo.com, xxx@aol.com, xxx@comcast.net, lsloan6@nyc.rr.com<br><b>Sent: </b>Monday, April 2, 2012 5:44:22 PM<br><b>Subject: </b>Test<br><br>Hi<br></div></body></html>
--===============3552566137977633461==--

View File

@@ -0,0 +1,31 @@
Content-Type: multipart/alternative;
boundary="===============3455449757443551301=="
MIME-Version: 1.0
Date: Mon, 2 Apr 2012 20:21:52 +0400
Message-Id: <CAKsfaBW4hj0Gek6TwbR3erng4P1y0CZzJ0d=pXtCNnYnbe7PLg@mail.gmail.com>
Subject: Re: Test
From: Megan One <xxx@gmail.com>
To: bob@example.com
--===============3455449757443551301==
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Hello
On Mon, Apr 2, 2012 at 6:26 PM, Megan One <xxx@gmail.com> wrote:
> Hi
--===============3455449757443551301==
MIME-Version: 1.0
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Hello<br><br><div class="gmail_quote">On Mon, Apr 2, 2012 at 6:26 PM, Megan One <span dir="ltr">&lt;<a href="mailto:xxx@gmail.com">xxx@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hi
</blockquote></div><br>
--===============3455449757443551301==--

View File

@@ -0,0 +1,50 @@
Content-Type: multipart/alternative;
boundary="===============5499446768842282638=="
MIME-Version: 1.0
Message-Id: <DUB102-W192C6E94759954C4885B92B14C0@phx.gbl>
From: Alexey Q <xxx@hotmail.com>
To: <bob@xxx.mailgun.org>
Subject: RE: Test
Date: Mon, 2 Apr 2012 21:47:37 +0800
X-Originalarrivaltime: 02 Apr 2012 13:47:37.0935 (UTC)
FILETIME=[2A6C0DF0:01CD10D7]
--===============5499446768842282638==
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Hello
> Subject: Test
> From: bob@xxx.mailgun.org
> To: xxx@gmail.com; xxx@hotmail.com; xxx@yahoo.com; xxx@aol.com; xxx@comcast.net; xxx@nyc.rr.com
> Date: Mon, 2 Apr 2012 17:44:22 +0400
>
> Hi
--===============5499446768842282638==
MIME-Version: 1.0
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: 7bit
<html>
<head>
<style><!--
.hmmessage P
{
margin:0px;
padding:0px
}
body.hmmessage
{
font-size: 10pt;
font-family:Tahoma
}
--></style></head>
<body class='hmmessage'><div dir='ltr'>
Hello<br><br><div><div id="SkyDrivePlaceholder"></div>&gt; Subject: Test<br>&gt; From: bob@xxx.mailgun.org<br>&gt; To: xxx@gmail.com; xxx@hotmail.com; xxx@yahoo.com; xxx@aol.com; xxx@comcast.net; xxx@nyc.rr.com<br>&gt; Date: Mon, 2 Apr 2012 17:44:22 +0400<br>&gt; <br>&gt; Hi<br></div> </div></body>
</html>
--===============5499446768842282638==--

View File

@@ -0,0 +1,19 @@
Subject: Re: Test
From: xxx <xxx@gmail.com>
Content-Type: text/plain;
charset=us-ascii
X-Mailer: iPhone Mail (9B176)
Message-Id: <06C90B12-13B9-4C5F-A9EF-4A809D94C078@gmail.com>
Date: Tue, 3 Apr 2012 16:23:59 +0400
To: bob <bob@example.com>
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0 (1.0)
hello
Sent from my iPhone
On Apr 3, 2012, at 4:19 PM, bob <bob@example.com> wr=
ote:
> Hi

View File

@@ -0,0 +1,85 @@
Subject: Test
From: me@example.com
To: you@example.com
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary=0016364c440b2e8b63049acd5370
X-Mailgun-Tag: tag
X-Mailgun-Mailing-List-Id: 1q
--0016364c440b2e8b63049acd5370
Content-Type: text/plain; charset=ISO-8859-1
Hello
From: xxx@xxx.mailgun.org [mailto:xxx@xxx.mailgun.org]
Sent: March-09-12 4:22 PM
To: Dan Le
Subject: The manager has commented on your Loop
Hi dan.le@example.com<mailto:dan.le@example.com>,
The manager's comment:
"Hello Allan! Did you ask for some MIME? "
Loop details:
xxx at Dan
I'm not happy
""
Your Loop is here<http://dev.xxx.com/loop/view/4f50f20e160839c95a000bb3?_uid=4f3541a7ac63e655040008e3>.
We will be in touch again with any further updates,
xxx
If you did not sign up to receive emails from us you can use the link below to unsubscribe. We apologize for any inconvenience.
Unsubscribe<http://dev.xxx.com/user/unsubscribe/dan.le@example.com?verify=4a400554148256338956101abdf06406>
--0016364c440b2e8b63049acd5370
Content-Type: text/html; charset=ISO-8859-1
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii"><meta name=Generator content="Microsoft Word 14 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Tahoma;
panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
span.EmailStyle17
{mso-style-type:personal-reply;
font-family:"Calibri","sans-serif";
color:#1F497D;}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri","sans-serif";
mso-fareast-language:EN-US;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=EN-CA link=blue vlink=purple><div class=WordSection1><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Allo! Follow up MIME!<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal><b><span lang=EN-US style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span></b><span lang=EN-US style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'> xxx@xxx.mailgun.org [mailto:xxx@xxx.mailgun.org] <br><b>Sent:</b> March-09-12 4:22 PM<br><b>To:</b> Dan Le<br><b>Subject:</b> The manager has commented on your Loop<o:p></o:p></span></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>Hi <a href="mailto:dan.le@example.com">dan.le@example.com</a>,<br><br>The manager's comment:<br>&quot;Hello Allan! Did you ask for some MIME? &quot;<br><br>Loop details:<br><br>xxx at Dan<br>I'm not happy<br>&quot;&quot;<br><br>Your Loop is <a href="http://dev.xxx.com/loop/view/4f50f20e160839c95a000bb3?_uid=4f3541a7ac63e655040008e3">here</a>.<br><br>We will be in touch again with any further updates,<br><br>xxx<br><br>If you did not sign up to receive emails from us you can use the link below to unsubscribe. We apologize for any inconvenience.<br><br><a href="http://dev.xxx.com/user/unsubscribe/dan.le@example.com?verify=4a400554148256338956101abdf06406">Unsubscribe</a> <o:p></o:p></p></div></body></html>
--0016364c440b2e8b63049acd5370--

View File

@@ -0,0 +1,61 @@
Date: Tue, 3 Apr 2012 16:58:35 +0400
From: xxx <xxx@gmail.com>
To: bob <bob@example.com>
Message-ID: <5BB86EF4B6E24E4C9DA4BBEF59DA9809@gmail.com>
Subject: Re: Test
X-Mailer: sparrow 1.5 (build 1043)
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="4f7af3fb_749abb43_300"
--4f7af3fb_749abb43_300
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Hello
--
xxx
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
On Tuesday, April 3, 2012 at 4:55 PM, xxx wrote:
> Hello
>
> On Apr 3, 2012, at 4:19 PM, bob wrote:
>
> > Hi
--4f7af3fb_749abb43_300
Content-Type: text/html; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
<div>
<span style=3D=22font-size: 12px;=22>Hello</span>
</div>
<div><div><br></div><div>--&nbsp;</div><div>xx=
x</div><div>Sent with <a href=3D=22http://www.sparrowmailapp.com/=3Fsig=22=
>Sparrow</a></div><div><br></div></div>
=20
<p style=3D=22color: =23A0A0A8;=22>On Tuesday, April 3, 2=
012 at 4:55 PM, xxx wrote:</p>
<blockquote type=3D=22cite=22 style=3D=22border-left-styl=
e:solid;border-width:1px;margin-left:0px;padding-left:10px;=22>
<span><div><div><div>Hello</div><div><br></div><div>O=
n Apr 3, 2012, at 4:19 PM, bob wrote:</div><div><br></div><blo=
ckquote type=3D=22cite=22><div>Hi</div></blockquote></div></div></span>
=20
=20
=20
=20
</blockquote>
=20
<div>
<br>
</div>
--4f7af3fb_749abb43_300--

View File

@@ -0,0 +1,5 @@
Hello
--
xxx
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)

View File

@@ -0,0 +1,15 @@
MIME-Version: 1.0
Message-Id: <4F79B73C.9030506@xxx.mailgun.org>
Date: Mon, 02 Apr 2012 18:27:08 +0400
From: bob <bob@xxx.mailgun.org>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US;
rv:1.9.2.28) Gecko/20120313 Thunderbird/3.1.20
To: Megan One <xxx@gmail.com>
Subject: Re: Test
Sender: bob@xxx.mailgun.org
Content-Type: text/plain; charset="us-ascii"; format="flowed"
Content-Transfer-Encoding: 7bit
On 04/02/2012 06:26 PM, Megan One wrote:
> Hi
Hello

View File

@@ -0,0 +1,22 @@
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
X-Mailer: YahooMailWebService/0.8.117.340979
Message-Id: <1333374330.68772.YahooMailNeo@web114411.mail.gq1.yahoo.com>
Date: Mon, 2 Apr 2012 06:45:30 -0700 (PDT)
From: Alex Q <xxx@yahoo.com>
Subject: Re: Test
To: "bob@xxx.mailgun.org" <bob@xxx.mailgun.org>
In-Reply-To: <1333374262.7063.15.camel@mg5>
Content-Transfer-Encoding: 7bit
Hello
----- Original Message -----
From: "bob@xxx.mailgun.org" <bob@xxx.mailgun.org>
To: xxx@gmail.com; xxx@hotmail.com; xxx@yahoo.com; xxx@aol.com; xxx@comcast.net; xxx@nyc.rr.com
Cc:
Sent: Monday, April 2, 2012 5:44 PM
Subject: Test
Hi

View File

@@ -0,0 +1,298 @@
# -*- coding: utf-8 -*-
from . import *
from . fixtures import *
import regex as re
from flanker import mime
from talon import quotations
import html2text
RE_WHITESPACE = re.compile("\s")
RE_DOUBLE_WHITESPACE = re.compile("\s")
def test_quotation_splitter_inside_blockquote():
msg_body = """Reply
<blockquote>
<div>
On 11-Apr-2011, at 6:54 PM, Bob &lt;bob@example.com&gt; wrote:
</div>
<div>
Test
</div>
</blockquote>"""
eq_("<html><body><p>Reply</p></body></html>",
RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body)))
def test_quotation_splitter_outside_blockquote():
msg_body = """Reply
<div>
On 11-Apr-2011, at 6:54 PM, Bob &lt;bob@example.com&gt; wrote:
</div>
<blockquote>
<div>
Test
</div>
</blockquote>
"""
eq_("<html><body><p>Reply</p><div></div></body></html>",
RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body)))
def test_no_blockquote():
msg_body = """
<html>
<body>
Reply
<div>
On 11-Apr-2011, at 6:54 PM, Bob &lt;bob@example.com&gt; wrote:
</div>
<div>
Test
</div>
</body>
</html>
"""
reply = """
<html>
<body>
Reply
</body></html>"""
eq_(RE_WHITESPACE.sub('', reply),
RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body)))
def test_empty_body():
eq_('', quotations.extract_from_html(''))
def test_validate_output_html():
msg_body = """Reply
<div>
On 11-Apr-2011, at 6:54 PM, Bob &lt;bob@example.com&gt; wrote:
<blockquote>
<div>
Test
</div>
</blockquote>
</div>
<div/>
"""
out = quotations.extract_from_html(msg_body)
ok_('<html>' in out and '</html>' in out,
'Invalid HTML - <html>/</html> tag not present')
ok_('<div/>' not in out,
'Invalid HTML output - <div/> element is not valid')
def test_gmail_quote():
msg_body = """Reply
<div class="gmail_quote">
<div class="gmail_quote">
On 11-Apr-2011, at 6:54 PM, Bob &lt;bob@example.com&gt; wrote:
<div>
Test
</div>
</div>
</div>"""
eq_("<html><body><p>Reply</p></body></html>",
RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body)))
def test_unicode_in_reply():
msg_body = u"""Reply \xa0 \xa0 Text<br>
<div>
<br>
</div>
<blockquote class="gmail_quote">
Quote
</blockquote>""".encode("utf-8")
eq_("<html><body><p>Reply&#160;&#160;Text<br></p><div><br></div>"
"</body></html>",
RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body)))
def test_blockquote_disclaimer():
msg_body = """
<html>
<body>
<div>
<div>
message
</div>
<blockquote>
Quote
</blockquote>
</div>
<div>
disclaimer
</div>
</body>
</html>
"""
stripped_html = """
<html>
<body>
<div>
<div>
message
</div>
</div>
<div>
disclaimer
</div>
</body>
</html>
"""
eq_(RE_WHITESPACE.sub('', stripped_html),
RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body)))
def test_date_block():
msg_body = """
<div>
message<br>
<div>
<hr>
Date: Fri, 23 Mar 2012 12:35:31 -0600<br>
To: <a href="mailto:bob@example.com">bob@example.com</a><br>
From: <a href="mailto:rob@example.com">rob@example.com</a><br>
Subject: You Have New Mail From Mary!<br><br>
text
</div>
</div>
"""
eq_('<html><body><div>message<br></div></body></html>',
RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body)))
def test_from_block():
msg_body = """<div>
message<br>
<div>
<hr>
From: <a href="mailto:bob@example.com">bob@example.com</a><br>
Date: Fri, 23 Mar 2012 12:35:31 -0600<br>
To: <a href="mailto:rob@example.com">rob@example.com</a><br>
Subject: You Have New Mail From Mary!<br><br>
text
</div></div>
"""
eq_('<html><body><div>message<br></div></body></html>',
RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body)))
def test_reply_shares_div_with_from_block():
msg_body = '''
<body>
<div>
Blah<br><br>
<hr>Date: Tue, 22 May 2012 18:29:16 -0600<br>
To: xx@hotmail.ca<br>
From: quickemail@ashleymadison.com<br>
Subject: You Have New Mail From x!<br><br>
</div>
</body>'''
eq_('<html><body><div>Blah<br><br></div></body></html>',
RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body)))
def test_reply_quotations_share_block():
msg = mime.from_string(REPLY_QUOTATIONS_SHARE_BLOCK)
html_part = list(msg.walk())[1]
assert html_part.content_type == 'text/html'
stripped_html = quotations.extract_from_html(html_part.body)
ok_(stripped_html)
ok_('From' not in stripped_html)
def test_OLK_SRC_BODY_SECTION_stripped():
eq_('<html><body><div>Reply</div></body></html>',
RE_WHITESPACE.sub(
'', quotations.extract_from_html(OLK_SRC_BODY_SECTION)))
def test_reply_separated_by_hr():
eq_('<html><body><div>Hi<div>there</div></div></body></html>',
RE_WHITESPACE.sub(
'', quotations.extract_from_html(REPLY_SEPARATED_BY_HR)))
RE_REPLY = re.compile(r"^Hi\. I am fine\.\s*\n\s*Thanks,\s*\n\s*Alex\s*$")
def extract_reply_and_check(filename):
f = open(filename)
msg_body = f.read().decode("utf-8")
reply = quotations.extract_from_html(msg_body)
h = html2text.HTML2Text()
h.body_width = 0
plain_reply = h.handle(reply)
#remove &nbsp; spaces
plain_reply = plain_reply.replace(u'\xa0', u' ')
if RE_REPLY.match(plain_reply):
eq_(1, 1)
else:
eq_("Hi. I am fine.\n\nThanks,\nAlex", plain_reply)
def test_gmail_reply():
extract_reply_and_check("tests/fixtures/html_replies/gmail.html")
def test_mail_ru_reply():
extract_reply_and_check("tests/fixtures/html_replies/mail_ru.html")
def test_hotmail_reply():
extract_reply_and_check("tests/fixtures/html_replies/hotmail.html")
def test_ms_outlook_2003_reply():
extract_reply_and_check("tests/fixtures/html_replies/ms_outlook_2003.html")
def test_ms_outlook_2007_reply():
extract_reply_and_check("tests/fixtures/html_replies/ms_outlook_2007.html")
def test_thunderbird_reply():
extract_reply_and_check("tests/fixtures/html_replies/thunderbird.html")
def test_windows_mail_reply():
extract_reply_and_check("tests/fixtures/html_replies/windows_mail.html")
def test_yandex_ru_reply():
extract_reply_and_check("tests/fixtures/html_replies/yandex_ru.html")

33
tests/quotations_test.py Normal file
View File

@@ -0,0 +1,33 @@
# -*- coding: utf-8 -*-
from . import *
from . fixtures import *
from flanker import mime
from talon import quotations
@patch.object(quotations, 'extract_from_html')
@patch.object(quotations, 'extract_from_plain')
def test_extract_from_respects_content_type(extract_from_plain,
extract_from_html):
msg_body = "Hi there"
quotations.extract_from(msg_body, 'text/plain')
extract_from_plain.assert_called_with(msg_body)
quotations.extract_from(msg_body, 'text/html')
extract_from_html.assert_called_with(msg_body)
eq_(msg_body, quotations.extract_from(msg_body, 'text/blah'))
@patch.object(quotations, 'extract_from_plain', Mock(side_effect=Exception()))
def test_crash_inside_extract_from():
msg_body = "Hi there"
eq_(msg_body, quotations.extract_from(msg_body, 'text/plain'))
def test_empty_body():
eq_('', quotations.extract_from_plain(''))

View File

View File

@@ -0,0 +1,238 @@
# -*- coding: utf-8 -*-
from .. import *
import os
from flanker import mime
from talon.signature import bruteforce
def test_empty_body():
eq_(('', None), bruteforce.extract_signature(''))
def test_no_signature():
msg_body = 'Hey man!'
eq_((msg_body, None), bruteforce.extract_signature(msg_body))
def test_signature_only():
msg_body = '--\nRoman'
eq_((msg_body, None), bruteforce.extract_signature(msg_body))
def test_signature_separated_by_dashes():
msg_body = '''Hey man! How r u?
---
Roman'''
eq_(('Hey man! How r u?', '---\nRoman'),
bruteforce.extract_signature(msg_body))
msg_body = '''Hey!
-roman'''
eq_(('Hey!', '-roman'), bruteforce.extract_signature(msg_body))
msg_body = '''Hey!
- roman'''
eq_(('Hey!', '- roman'), bruteforce.extract_signature(msg_body))
msg_body = '''Wow. Awesome!
--
Bob Smith'''
eq_(('Wow. Awesome!', '--\nBob Smith'),
bruteforce.extract_signature(msg_body))
def test_signature_words():
msg_body = '''Hey!
Thanks!
Roman'''
eq_(('Hey!', 'Thanks!\nRoman'),
bruteforce.extract_signature(msg_body))
msg_body = '''Hey!
--
Best regards,
Roman'''
eq_(('Hey!', '--\nBest regards,\n\nRoman'),
bruteforce.extract_signature(msg_body))
msg_body = '''Hey!
--
--
Regards,
Roman'''
eq_(('Hey!', '--\n--\nRegards,\nRoman'),
bruteforce.extract_signature(msg_body))
def test_iphone_signature():
msg_body = '''Hey!
Sent from my iPhone!'''
eq_(('Hey!', 'Sent from my iPhone!'),
bruteforce.extract_signature(msg_body))
def test_mailbox_for_iphone_signature():
msg_body = """Blah
Sent from Mailbox for iPhone"""
eq_(("Blah", "Sent from Mailbox for iPhone"),
bruteforce.extract_signature(msg_body))
def test_line_starts_with_signature_word():
msg_body = '''Hey man!
Thanks for your attention.
--
Thanks!
Roman'''
eq_(('Hey man!\nThanks for your attention.', '--\nThanks!\nRoman'),
bruteforce.extract_signature(msg_body))
def test_line_starts_with_dashes():
msg_body = '''Hey man!
Look at this:
--> one
--> two
--
Roman'''
eq_(('Hey man!\nLook at this:\n\n--> one\n--> two', '--\nRoman'),
bruteforce.extract_signature(msg_body))
def test_blank_lines_inside_signature():
msg_body = '''Blah.
-Lev.
Sent from my HTC smartphone!'''
eq_(('Blah.', '-Lev.\n\nSent from my HTC smartphone!'),
bruteforce.extract_signature(msg_body))
msg_body = '''Blah
--
John Doe'''
eq_(('Blah', '--\n\nJohn Doe'), bruteforce.extract_signature(msg_body))
def test_blackberry_signature():
msg_body = """Heeyyoooo.
Sent wirelessly from my BlackBerry device on the Bell network.
Envoyé sans fil par mon terminal mobile BlackBerry sur le réseau de Bell."""
eq_(('Heeyyoooo.', msg_body[len('Heeyyoooo.\n'):]),
bruteforce.extract_signature(msg_body))
msg_body = u"""Blah
Enviado desde mi oficina móvil BlackBerry® de Telcel"""
eq_(('Blah', u'Enviado desde mi oficina móvil BlackBerry® de Telcel'),
bruteforce.extract_signature(msg_body))
@patch.object(bruteforce, 'get_delimiter', Mock(side_effect=Exception()))
def test_crash_in_extract_signature():
msg_body = '''Hey!
-roman'''
eq_((msg_body, None), bruteforce.extract_signature(msg_body))
def test_signature_cant_start_from_first_line():
msg_body = """Thanks,
Blah
regards
John Doe"""
eq_(('Thanks,\n\nBlah', 'regards\n\nJohn Doe'),
bruteforce.extract_signature(msg_body))
@patch.object(bruteforce, 'SIGNATURE_MAX_LINES', 2)
def test_signature_max_lines_ignores_empty_lines():
msg_body = """Thanks,
Blah
regards
John Doe"""
eq_(('Thanks,\nBlah', 'regards\n\n\nJohn Doe'),
bruteforce.extract_signature(msg_body))
def test_get_signature_candidate():
# if there aren't at least 2 non-empty lines there should be no signature
for lines in [], [''], ['', ''], ['abc']:
eq_([], bruteforce.get_signature_candidate(lines))
# first line never included
lines = ['text', 'signature']
eq_(['signature'], bruteforce.get_signature_candidate(lines))
# test when message is shorter then SIGNATURE_MAX_LINES
with patch.object(bruteforce, 'SIGNATURE_MAX_LINES', 3):
lines = ['text', '', '', 'signature']
eq_(['signature'], bruteforce.get_signature_candidate(lines))
# test when message is longer then the SIGNATURE_MAX_LINES
with patch.object(bruteforce, 'SIGNATURE_MAX_LINES', 2):
lines = ['text1', 'text2', 'signature1', '', 'signature2']
eq_(['signature1', '', 'signature2'],
bruteforce.get_signature_candidate(lines))
# test long lines not encluded
with patch.object(bruteforce, 'TOO_LONG_SIGNATURE_LINE', 3):
lines = ['BR,', 'long', 'Bob']
eq_(['Bob'], bruteforce.get_signature_candidate(lines))
# test list (with dashes as bullet points) not included
lines = ['List:,', '- item 1', '- item 2', '--', 'Bob']
eq_(['--', 'Bob'], bruteforce.get_signature_candidate(lines))
def test_mark_candidate_indexes():
with patch.object(bruteforce, 'TOO_LONG_SIGNATURE_LINE', 3):
# spaces are not considered when checking line length
eq_('clc',
bruteforce._mark_candidate_indexes(
['BR, ', 'long', 'Bob'],
[0, 1, 2]))
# only candidate lines are marked
# if line has only dashes it's a candidate line
eq_('ccdc',
bruteforce._mark_candidate_indexes(
['-', 'long', '-', '- i', 'Bob'],
[0, 2, 3, 4]))
def test_process_marked_candidate_indexes():
eq_([2, 13, 15],
bruteforce._process_marked_candidate_indexes(
[2, 13, 15], 'dcc'))
eq_([15],
bruteforce._process_marked_candidate_indexes(
[2, 13, 15], 'ddc'))
eq_([13, 15],
bruteforce._process_marked_candidate_indexes(
[13, 15], 'cc'))
eq_([15],
bruteforce._process_marked_candidate_indexes(
[15], 'lc'))
eq_([15],
bruteforce._process_marked_candidate_indexes(
[13, 15], 'ld'))

View File

@@ -0,0 +1,148 @@
# -*- coding: utf-8 -*-
from .. import *
import os
from PyML import SparseDataSet
from talon.signature.learning import dataset
from talon import signature
from talon.signature import extraction as e
from talon.signature import bruteforce
def test_message_shorter_SIGNATURE_MAX_LINES():
sender = "bob@foo.bar"
body = """Call me ASAP, please.This is about the last changes you deployed.
Thanks in advance,
Bob"""
text, extracted_signature = signature.extract(body, sender)
eq_('\n'.join(body.splitlines()[:2]), text)
eq_('\n'.join(body.splitlines()[-2:]), extracted_signature)
def test_messages_longer_SIGNATURE_MAX_LINES():
for filename in os.listdir(STRIPPED):
filename = os.path.join(STRIPPED, filename)
if not filename.endswith('_body'):
continue
sender, body = dataset.parse_msg_sender(filename)
text, extracted_signature = signature.extract(body, sender)
extracted_signature = extracted_signature or ''
with open(filename[:-len('body')] + 'signature') as ms:
msg_signature = ms.read()
eq_(msg_signature.strip(), extracted_signature.strip())
stripped_msg = body.strip()[:len(body.strip())-len(msg_signature)]
eq_(stripped_msg.strip(), text.strip())
def test_text_line_in_signature():
# test signature should consist of one solid part
sender = "bob@foo.bar"
body = """Call me ASAP, please.This is about the last changes you deployed.
Thanks in advance,
some text which doesn't seem to be a signature at all
Bob"""
text, extracted_signature = signature.extract(body, sender)
eq_('\n'.join(body.splitlines()[:2]), text)
eq_('\n'.join(body.splitlines()[-3:]), extracted_signature)
def test_long_line_in_signature():
sender = "bob@foo.bar"
body = """Call me ASAP, please.This is about the last changes you deployed.
Thanks in advance,
some long text here which doesn't seem to be a signature at all
Bob"""
text, extracted_signature = signature.extract(body, sender)
eq_('\n'.join(body.splitlines()[:-1]), text)
eq_('Bob', extracted_signature)
body = """Thanks David,
some *long* text here which doesn't seem to be a signature at all
"""
((body, None), signature.extract(body, "david@example.com"))
def test_basic():
msg_body = 'Blah\r\n--\r\n\r\nSergey Obukhov'
eq_(('Blah', '--\r\n\r\nSergey Obukhov'),
signature.extract(msg_body, 'Sergey'))
def test_over_2_text_lines_after_signature():
body = """Blah
Bob,
If there are more than
2 non signature lines in the end
It's not signature
"""
text, extracted_signature = signature.extract(body, "Bob")
eq_(extracted_signature, None)
def test_no_signature():
sender, body = "bob@foo.bar", "Hello"
eq_((body, None), signature.extract(body, sender))
def test_handles_unicode():
sender, body = dataset.parse_msg_sender(UNICODE_MSG)
text, extracted_signature = signature.extract(body, sender)
@patch.object(signature.extraction, 'has_signature')
def test_signature_extract_crash(has_signature):
has_signature.side_effect = Exception('Bam!')
msg_body = u'Blah\r\n--\r\n\r\nСергей'
eq_((msg_body, None), signature.extract(msg_body, 'Сергей'))
def test_mark_lines():
with patch.object(bruteforce, 'SIGNATURE_MAX_LINES', 2):
# we analyse the 2nd line as well though it's the 6th line
# (starting from the bottom) because we don't count empty line
eq_('ttset',
e._mark_lines(['Bob Smith',
'Bob Smith',
'Bob Smith',
'',
'some text'], 'Bob Smith'))
with patch.object(bruteforce, 'SIGNATURE_MAX_LINES', 3):
# we don't analyse the 1st line because
# signature cant start from the 1st line
eq_('tset',
e._mark_lines(['Bob Smith',
'Bob Smith',
'',
'some text'], 'Bob Smith'))
def test_process_marked_lines():
# no signature found
eq_((range(5), None), e._process_marked_lines(range(5), 'telt'))
# signature in the middle of the text
eq_((range(9), None), e._process_marked_lines(range(9), 'tesestelt'))
# long line splits signature
eq_((range(7), [7, 8]),
e._process_marked_lines(range(9), 'tsslsless'))
eq_((range(20), [20]),
e._process_marked_lines(range(21), 'ttttttstttesllelelets'))
# some signature lines could be identified as text
eq_(([0], range(1, 9)), e._process_marked_lines(range(9), 'tsetetest'))
eq_(([], range(5)),
e._process_marked_lines(range(5), "ststt"))

View File

View File

@@ -0,0 +1,51 @@
# -*- coding: utf-8 -*-
from ... import *
import os
from PyML import SparseDataSet
from talon.utils import to_unicode
from talon.signature.learning import dataset as d
from talon.signature.learning.featurespace import features
def test_is_sender_filename():
assert_false(d.is_sender_filename("foo/bar"))
assert_false(d.is_sender_filename("foo/bar_body"))
ok_(d.is_sender_filename("foo/bar_sender"))
def test_build_sender_filename():
eq_("foo/bar_sender", d.build_sender_filename("foo/bar_body"))
def test_parse_msg_sender():
sender, msg = d.parse_msg_sender(EML_MSG_FILENAME)
# if the message in eml format
with open(EML_MSG_FILENAME) as f:
eq_(sender,
" Alex Q <xxx@yahoo.com>")
eq_(msg, f.read())
# if the message sender is stored in a separate file
sender, msg = d.parse_msg_sender(MSG_FILENAME_WITH_BODY_SUFFIX)
with open(MSG_FILENAME_WITH_BODY_SUFFIX) as f:
eq_(sender, u"john@example.com")
eq_(msg, f.read())
def test_build_extraction_dataset():
if os.path.exists(os.path.join(TMP_DIR, 'extraction.data')):
os.remove(os.path.join(TMP_DIR, 'extraction.data'))
d.build_extraction_dataset(os.path.join(EMAILS_DIR, 'P'),
os.path.join(TMP_DIR,
'extraction.data'), 1)
test_data = SparseDataSet(os.path.join(TMP_DIR, 'extraction.data'),
labelsColumn=-1)
# the result is a loadable signature extraction dataset
# 32 comes from 3 emails in emails/P folder, 11 lines checked to be
# a signature, one email has only 10 lines
eq_(test_data.size(), 32)
eq_(len(features('')), test_data.numFeatures)

View File

@@ -0,0 +1,44 @@
# -*- coding: utf-8 -*-
from ... import *
from talon.signature.learning import featurespace as fs
def test_apply_features():
s = '''John Doe
VP Research and Development, Xxxx Xxxx Xxxxx
555-226-2345
john@example.com'''
sender = 'John <john@example.com>'
features = fs.features(sender)
result = fs.apply_features(s, features)
# note that we don't consider the first line because signatures don't
# usually take all the text, empty lines are not considered
eq_(result, [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
with patch.object(fs, 'SIGNATURE_MAX_LINES', 4):
features = fs.features(sender)
new_result = fs.apply_features(s, features)
# result remains the same because we don't consider empty lines
eq_(result, new_result)
def test_build_pattern():
s = '''John Doe
VP Research and Development, Xxxx Xxxx Xxxxx
555-226-2345
john@example.com'''
sender = 'John <john@example.com>'
features = fs.features(sender)
result = fs.build_pattern(s, features)
eq_(result, [2, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1])

View File

@@ -0,0 +1,236 @@
# -*- coding: utf-8 -*-
from ... import *
import regex as re
from talon.signature.learning import helpers as h
from talon.signature.learning.helpers import *
# First testing regex constants.
VALID = '''
15615552323
1-561-555-1212
5613333
18008793262
800-879-3262
0-800.879.3262
04 3452488
04 -3452488
04 - 3452499
(610) 310-5555 x5555
533-1123
(021)1234567
(021)123456
(000)000000
+7 920 34 57 23
+7(920) 34 57 23
+7(920)345723
+7920345723
8920345723
21143
2-11-43
2 - 11 - 43
'''
VALID_PHONE_NUMBERS = [e.strip() for e in VALID.splitlines() if e.strip()]
def test_match_phone_numbers():
for phone in VALID_PHONE_NUMBERS:
ok_(RE_RELAX_PHONE.match(phone), "{} should be matched".format(phone))
def test_match_names():
names = ['John R. Doe']
for name in names:
ok_(RE_NAME.match(name), "{} should be matched".format(name))
def test_sender_with_name():
ok_lines = ['Sergey Obukhov <serobnic@example.com>',
'\tSergey <serobnic@example.com>',
('"Doe, John (TX)"'
'<DowJ@example.com>@EXAMPLE'
'<IMCEANOTES-+22Doe+2C+20John+20'
'+28TX+29+22+20+3CDoeJ+40example+2Ecom+3E'
'+40EXAMPLE@EXAMPLE.com>'),
('Company Sleuth <csleuth@email.xxx.com>'
'@EXAMPLE <XXX-Company+20Sleuth+20+3Ccsleuth'
'+40email+2Exxx+2Ecom+3E+40EXAMPLE@EXAMPLE.com>'),
('Doe III, John '
'</O=EXAMPLE/OU=NA/CN=RECIPIENTS/CN=jDOE5>')]
for line in ok_lines:
ok_(RE_SENDER_WITH_NAME.match(line),
'{} should be matched'.format(line))
nok_lines = ['', '<serobnic@xxx.ru>', 'Sergey serobnic@xxx.ru']
for line in nok_lines:
assert_false(RE_SENDER_WITH_NAME.match(line),
'{} should not be matched'.format(line))
# Now test helpers functions
def test_binary_regex_search():
eq_(1, h.binary_regex_search(re.compile("12"))("12"))
eq_(0, h.binary_regex_search(re.compile("12"))("34"))
def binary_regex_match(prog):
eq_(1, h.binary_regex_match(re.compile("12"))("12 3"))
eq_(0, h.binary_regex_match(re.compile("12"))("3 12"))
def test_flatten_list():
eq_([1, 2, 3, 4, 5], h.flatten_list([[1, 2], [3, 4, 5]]))
@patch.object(h.re, 'compile')
def test_contains_sender_names(re_compile):
with patch.object(h, 'extract_names',
Mock(return_value=['bob', 'smith'])) as extract_names:
has_sender_names = h.contains_sender_names("bob.smith@example.com")
extract_names.assert_called_with("bob.smith@example.com")
for name in ["bob", "Bob", "smith", "Smith"]:
ok_(has_sender_names(name))
extract_names.return_value = ''
has_sender_names = h.contains_sender_names("bob.smith@example.com")
# if no names could be extracted fallback to the email address
ok_(has_sender_names('bob.smith@example.com'))
# don't crash if there are no sender
extract_names.return_value = ''
has_sender_names = h.contains_sender_names("")
assert_false(has_sender_names(''))
def test_extract_names():
senders_names = {
# from example dataset
('Jay Rickerts <eCenter@example.com>@EXAMPLE <XXX-Jay+20Rickerts'
'+20+3CeCenter+40example+2Ecom+3E+40EXAMPLE@EXAMPLE.com>'):
['Jay', 'Rickerts'],
# if `,` is used in sender's name
'Williams III, Bill </O=EXAMPLE/OU=NA/CN=RECIPIENTS/CN=BWILLIA5>':
['Williams', 'III', 'Bill'],
# if somehow `'` or `"` are used in sender's name
'Laura" "Goldberg <laura.goldberg@example.com>':
['Laura', 'Goldberg'],
# extract from senders email address
'<sergey@xxx.ru>': ['sergey'],
# extract from sender's email address
# if dots are used in the email address
'<sergey.obukhov@xxx.ru>': ['sergey', 'obukhov'],
# extract from sender's email address
# if dashes are used in the email address
'<sergey-obukhov@xxx.ru>': ['sergey', 'obukhov'],
# extract from sender's email address
# if `_` are used in the email address
'<sergey_obukhov@xxx.ru>': ['sergey', 'obukhov'],
# old style From field, found in jangada dataset
'wcl@example.com (Wayne Long)': ['Wayne', 'Long'],
# if only sender's name provided
'Wayne Long': ['Wayne', 'Long'],
# if middle name is shortened with dot
'Sergey N. Obukhov <serobnic@xxx.ru>': ['Sergey', 'Obukhov'],
# not only spaces could be used as name splitters
' Sergey Obukhov <serobnic@xxx.ru>': ['Sergey', 'Obukhov'],
# finally normal example
'Sergey <serobnic@xxx.ru>': ['Sergey'],
# if middle name is shortened with `,`
'Sergey N, Obukhov': ['Sergey', 'Obukhov'],
# if mailto used with email address and sender's name is specified
'Sergey N, Obukhov [mailto: serobnic@xxx.ru]': ['Sergey', 'Obukhov'],
# when only email address is given
'serobnic@xxx.ru': ['serobnic'],
# when nothing is given
'': [],
# if phone is specified in the `From:` header
'wcl@example.com (Wayne Long +7 920 -256 - 35-09)': ['Wayne', 'Long'],
# from crash reports `nothing to repeat`
'* * * * <the_pod1@example.com>': ['the', 'pod'],
'"**Bobby B**" <copymycashsystem@example.com>':
['Bobby', 'copymycashsystem'],
# from crash reports `bad escape`
'"M Ali B Azlan \(GHSE/PETH\)" <aliazlan@example.com>':
['Ali', 'Azlan'],
('"Ridthauddin B A Rahim \(DD/PCSB\)"'
' <ridthauddin_arahim@example.com>'): ['Ridthauddin', 'Rahim'],
('"Boland, Patrick \(Global Xxx Group, Ireland \)"'
' <Patrick.Boland@example.com>'): ['Boland', 'Patrick'],
'"Mates Rate \(Wine\)" <amen@example.com.com>':
['Mates', 'Rate', 'Wine'],
('"Morgan, Paul \(Business Xxx RI, Xxx Xxx Group\)"'
' <paul.morgan@example.com>'): ['Morgan', 'Paul'],
'"David DECOSTER \(Domicile\)" <decosterdavid@xxx.be>':
['David', 'DECOSTER', 'Domicile']
}
for sender, expected_names in senders_names.items():
extracted_names = h.extract_names(sender)
# check that extracted names could be compiled
try:
re.compile("|".join(extracted_names))
except Exception, e:
ok_(False, ("Failed to compile extracted names {}"
"\n\nReason: {}").format(extracted_names, e))
if expected_names:
for name in expected_names:
assert_in(name, extracted_names)
else:
eq_(expected_names, extracted_names)
# words like `ru`, `gmail`, `com`, `org`, etc. are not considered
# sender's names
for word in h.BAD_SENDER_NAMES:
eq_(h.extract_names(word), [])
# duplicates are not allowed
eq_(h.extract_names("sergey <sergey@example.com"), ["sergey"])
def test_categories_percent():
eq_(0.0, h.categories_percent("qqq ggg hhh", ["Po"]))
eq_(50.0, h.categories_percent("q,w.", ["Po"]))
eq_(0.0, h.categories_percent("qqq ggg hhh", ["Nd"]))
eq_(50.0, h.categories_percent("q5", ["Nd"]))
eq_(50.0, h.categories_percent("s.s,5s", ["Po", "Nd"]))
eq_(0.0, h.categories_percent("", ["Po", "Nd"]))
@patch.object(h, 'categories_percent')
def test_punctuation_percent(categories_percent):
h.punctuation_percent("qqq")
categories_percent.assert_called_with("qqq", ['Po'])
def test_capitalized_words_percent():
eq_(0.0, h.capitalized_words_percent(''))
eq_(100.0, h.capitalized_words_percent('Example Corp'))
eq_(50.0, h.capitalized_words_percent('Qqq qqq QQQ 123 sss'))
eq_(100.0, h.capitalized_words_percent('Cell 713-444-7368'))
eq_(100.0, h.capitalized_words_percent('8th Floor'))
eq_(0.0, h.capitalized_words_percent('(212) 230-9276'))
def test_has_signature():
ok_(h.has_signature('sender', 'sender@example.com'))
ok_(h.has_signature('http://www.example.com\n555 555 5555',
'sender@example.com'))
ok_(h.has_signature('http://www.example.com\naddress@example.com',
'sender@example.com'))
assert_false(h.has_signature('http://www.example.com/555-555-5555',
'sender@example.com'))
long_line = ''.join(['q' for e in xrange(28)])
assert_false(h.has_signature(long_line + ' sender', 'sender@example.com'))
# wont crash on an empty string
assert_false(h.has_signature('', ''))
# dont consider empty strings when analysing signature
with patch.object(h, 'SIGNATURE_MAX_LINES', 1):
ok_('sender\n\n', 'sender@example.com')

View File

@@ -0,0 +1,534 @@
# -*- coding: utf-8 -*-
from . import *
from . fixtures import *
import os
from flanker import mime
from talon import quotations
@patch.object(quotations, 'MAX_LINES_COUNT', 1)
def test_too_many_lines():
msg_body = """Test reply
-----Original Message-----
Test"""
eq_(msg_body, quotations.extract_from_plain(msg_body))
def test_pattern_on_date_somebody_wrote():
msg_body = """Test reply
On 11-Apr-2011, at 6:54 PM, Roman Tkachenko <romant@example.com> wrote:
>
> Test
>
> Roman"""
eq_("Test reply", quotations.extract_from_plain(msg_body))
def test_pattern_on_date_somebody_wrote_date_with_slashes():
msg_body = """Test reply
On 04/19/2011 07:10 AM, Roman Tkachenko wrote:
>
> Test.
>
> Roman"""
eq_("Test reply", quotations.extract_from_plain(msg_body))
def test_pattern_on_date_somebody_wrote_allows_space_in_front():
msg_body = """Thanks Thanmai
On Mar 8, 2012 9:59 AM, "Example.com" <
r+7f1b094ceb90e18cca93d53d3703feae@example.com> wrote:
>**
> Blah-blah-blah"""
eq_("Thanks Thanmai", quotations.extract_from_plain(msg_body))
def test_pattern_on_date_somebody_sent():
msg_body = """Test reply
On 11-Apr-2011, at 6:54 PM, Roman Tkachenko <romant@example.com> sent:
>
> Test
>
> Roman"""
eq_("Test reply", quotations.extract_from_plain(msg_body))
def test_line_starts_with_on():
msg_body = """Blah-blah-blah
On blah-blah-blah"""
eq_(msg_body, quotations.extract_from_plain(msg_body))
def test_reply_and_quotation_splitter_share_line():
# reply lines and 'On <date> <person> wrote:' splitter pattern
# are on the same line
msg_body = """reply On Wed, Apr 4, 2012 at 3:59 PM, bob@example.com wrote:
> Hi"""
eq_('reply', quotations.extract_from_plain(msg_body))
# test pattern '--- On <date> <person> wrote:' with reply text on
# the same line
msg_body = """reply--- On Wed, Apr 4, 2012 at 3:59 PM, me@domain.com wrote:
> Hi"""
eq_('reply', quotations.extract_from_plain(msg_body))
# test pattern '--- On <date> <person> wrote:' with reply text containing
# '-' symbol
msg_body = """reply
bla-bla - bla--- On Wed, Apr 4, 2012 at 3:59 PM, me@domain.com wrote:
> Hi"""
reply = """reply
bla-bla - bla"""
eq_(reply, quotations.extract_from_plain(msg_body))
def test_pattern_original_message():
msg_body = """Test reply
-----Original Message-----
Test"""
eq_("Test reply", quotations.extract_from_plain(msg_body))
msg_body = """Test reply
-----Original Message-----
Test"""
eq_("Test reply", quotations.extract_from_plain(msg_body))
def test_reply_after_quotations():
msg_body = """On 04/19/2011 07:10 AM, Roman Tkachenko wrote:
>
> Test
Test reply"""
eq_("Test reply", quotations.extract_from_plain(msg_body))
def test_reply_wraps_quotations():
msg_body = """Test reply
On 04/19/2011 07:10 AM, Roman Tkachenko wrote:
>
> Test
Regards, Roman"""
reply = """Test reply
Regards, Roman"""
eq_(reply, quotations.extract_from_plain(msg_body))
def test_reply_wraps_nested_quotations():
msg_body = """Test reply
On 04/19/2011 07:10 AM, Roman Tkachenko wrote:
>Test test
>On 04/19/2011 07:10 AM, Roman Tkachenko wrote:
>
>>
>> Test.
>>
>> Roman
Regards, Roman"""
reply = """Test reply
Regards, Roman"""
eq_(reply, quotations.extract_from_plain(msg_body))
def test_quotation_separator_takes_2_lines():
msg_body = """Test reply
On Fri, May 6, 2011 at 6:03 PM, Roman Tkachenko from Hacker News
<roman@definebox.com> wrote:
> Test.
>
> Roman
Regards, Roman"""
reply = """Test reply
Regards, Roman"""
eq_(reply, quotations.extract_from_plain(msg_body))
def test_quotation_separator_takes_3_lines():
msg_body = """Test reply
On Nov 30, 2011, at 12:47 PM, Somebody <
416ffd3258d4d2fa4c85cfa4c44e1721d66e3e8f4@somebody.domain.com>
wrote:
Test message
"""
eq_("Test reply", quotations.extract_from_plain(msg_body))
def test_short_quotation():
msg_body = """Hi
On 04/19/2011 07:10 AM, Roman Tkachenko wrote:
> Hello"""
eq_("Hi", quotations.extract_from_plain(msg_body))
def test_pattern_date_email_with_unicode():
msg_body = """Replying ok
2011/4/7 Nathan \xd0\xb8ova <support@example.com>
> Cool beans, scro"""
eq_("Replying ok", quotations.extract_from_plain(msg_body))
def test_pattern_from_block():
msg_body = """Allo! Follow up MIME!
From: somebody@example.com
Sent: March-19-11 5:42 PM
To: Somebody
Subject: The manager has commented on your Loop
Blah-blah-blah
"""
eq_("Allo! Follow up MIME!", quotations.extract_from_plain(msg_body))
def test_quotation_marker_false_positive():
msg_body = """Visit us now for assistance...
>>> >>> http://www.domain.com <<<
Visit our site by clicking the link above"""
eq_(msg_body, quotations.extract_from_plain(msg_body))
def test_link_closed_with_quotation_marker_on_new_line():
msg_body = '''8.45am-1pm
From: somebody@example.com
<http://email.example.com/c/dHJhY2tpbmdfY29kZT1mMDdjYzBmNzM1ZjYzMGIxNT
> <bob@example.com <mailto:bob@example.com> >
Requester: '''
eq_('8.45am-1pm', quotations.extract_from_plain(msg_body))
def test_link_breaks_quotation_markers_sequence():
# link starts and ends on the same line
msg_body = """Blah
On Thursday, October 25, 2012 at 3:03 PM, life is short. on Bob wrote:
>
> Post a response by replying to this email
>
(http://example.com/c/YzOTYzMmE) >
> life is short. (http://example.com/c/YzMmE)
>
"""
eq_("Blah", quotations.extract_from_plain(msg_body))
# link starts after some text on one line and ends on another
msg_body = """Blah
On Monday, 24 September, 2012 at 3:46 PM, bob wrote:
> [Ticket #50] test from bob
>
> View ticket (http://example.com/action
_nonce=3dd518)
>
"""
eq_("Blah", quotations.extract_from_plain(msg_body))
def test_from_block_starts_with_date():
msg_body = """Blah
Date: Wed, 16 May 2012 00:15:02 -0600
To: klizhentas@example.com"""
eq_('Blah', quotations.extract_from_plain(msg_body))
def test_bold_from_block():
msg_body = """Hi
*From:* bob@example.com [mailto:
bob@example.com]
*Sent:* Wednesday, June 27, 2012 3:05 PM
*To:* travis@example.com
*Subject:* Hello
"""
eq_("Hi", quotations.extract_from_plain(msg_body))
def test_weird_date_format_in_date_block():
msg_body = """Blah
Date: Fri=2C 28 Sep 2012 10:55:48 +0000
From: tickets@example.com
To: bob@example.com
Subject: [Ticket #8] Test
"""
eq_('Blah', quotations.extract_from_plain(msg_body))
def test_dont_parse_quotations_for_forwarded_messages():
msg_body = """FYI
---------- Forwarded message ----------
From: bob@example.com
Date: Tue, Sep 4, 2012 at 1:35 PM
Subject: Two
line subject
To: rob@example.com
Text"""
eq_(msg_body, quotations.extract_from_plain(msg_body))
def test_forwarded_message_in_quotations():
msg_body = """Blah
-----Original Message-----
FYI
---------- Forwarded message ----------
From: bob@example.com
Date: Tue, Sep 4, 2012 at 1:35 PM
Subject: Two
line subject
To: rob@example.com
"""
eq_("Blah", quotations.extract_from_plain(msg_body))
def test_mark_message_lines():
# e - empty line
# s - splitter line
# m - line starting with quotation marker '>'
# t - the rest
lines = ['Hello', '',
# next line should be marked as splitter
'_____________',
'From: foo@bar.com',
'',
'> Hi',
'',
'Signature']
eq_('tessemet', quotations.mark_message_lines(lines))
lines = ['Just testing the email reply',
'',
'Robert J Samson',
'Sent from my iPhone',
'',
# all 3 next lines should be marked as splitters
'On Nov 30, 2011, at 12:47 PM, Skapture <',
('416ffd3258d4d2fa4c85cfa4c44e1721d66e3e8f4'
'@skapture-staging.mailgun.org>'),
'wrote:',
'',
'Tarmo Lehtpuu has posted the following message on']
eq_('tettessset', quotations.mark_message_lines(lines))
def test_process_marked_lines():
# quotations and last message lines are mixed
# consider all to be a last message
markers = 'tsemmtetm'
lines = [str(i) for i in range(len(markers))]
lines = [str(i) for i in range(len(markers))]
eq_(lines, quotations.process_marked_lines(lines, markers))
# no splitter => no markers
markers = 'tmm'
lines = ['1', '2', '3']
eq_(['1', '2', '3'], quotations.process_marked_lines(lines, markers))
# text after splitter without markers is quotation
markers = 'tst'
lines = ['1', '2', '3']
eq_(['1'], quotations.process_marked_lines(lines, markers))
# message + quotation + signature
markers = 'tsmt'
lines = ['1', '2', '3', '4']
eq_(['1', '4'], quotations.process_marked_lines(lines, markers))
# message + <quotation without markers> + nested quotation
markers = 'tstsmt'
lines = ['1', '2', '3', '4', '5', '6']
eq_(['1'], quotations.process_marked_lines(lines, markers))
# test links wrapped with paranthesis
# link starts on the marker line
markers = 'tsmttem'
lines = ['text',
'splitter',
'>View (http://example.com',
'/abc',
')',
'',
'> quote']
eq_(lines[:1], quotations.process_marked_lines(lines, markers))
# link starts on the new line
markers = 'tmmmtm'
lines = ['text',
'>'
'>',
'>',
'(http://example.com) > ',
'> life is short. (http://example.com) '
]
eq_(lines[:1], quotations.process_marked_lines(lines, markers))
# check all "inline" replies
markers = 'tsmtmtm'
lines = ['text',
'splitter',
'>',
'(http://example.com)',
'>',
'inline reply',
'>']
eq_(lines, quotations.process_marked_lines(lines, markers))
# inline reply with link not wrapped in paranthesis
markers = 'tsmtm'
lines = ['text',
'splitter',
'>',
'inline reply with link http://example.com',
'>']
eq_(lines, quotations.process_marked_lines(lines, markers))
# inline reply with link wrapped in paranthesis
markers = 'tsmtm'
lines = ['text',
'splitter',
'>',
'inline reply (http://example.com)',
'>']
eq_(lines, quotations.process_marked_lines(lines, markers))
def test_preprocess():
msg = ('Hello\n'
'See <http://google.com\n'
'> for more\n'
'information On Nov 30, 2011, at 12:47 PM, Somebody <\n'
'416ffd3258d4d2fa4c85cfa4c44e1721d66e3e8f4\n'
'@example.com>'
'wrote:\n'
'\n'
'> Hi')
# test the link is rewritten
# 'On <date> <person> wrote:' pattern starts from a new line
prepared_msg = ('Hello\n'
'See @@http://google.com\n'
'@@ for more\n'
'information\n'
' On Nov 30, 2011, at 12:47 PM, Somebody <\n'
'416ffd3258d4d2fa4c85cfa4c44e1721d66e3e8f4\n'
'@example.com>'
'wrote:\n'
'\n'
'> Hi')
eq_(prepared_msg, quotations.preprocess(msg, '\n'))
msg = """
> <http://teemcl.mailgun.org/u/**aD1mZmZiNGU5ODQwMDNkZWZlMTExNm**
> MxNjQ4Y2RmOTNlMCZyPXNlcmdleS5v**YnlraG92JTQwbWFpbGd1bmhxLmNvbS**
> Z0PSUyQSZkPWUwY2U<http://example.org/u/aD1mZmZiNGU5ODQwMDNkZWZlMTExNmMxNjQ4Y>
"""
eq_(msg, quotations.preprocess(msg, '\n'))
# 'On <date> <person> wrote' shouldn't be spread across too many lines
msg = ('Hello\n'
'How are you? On Nov 30, 2011, at 12:47 PM,\n '
'Example <\n'
'416ffd3258d4d2fa4c85cfa4c44e1721d66e3e8f4\n'
'@example.org>'
'wrote:\n'
'\n'
'> Hi')
eq_(msg, quotations.preprocess(msg, '\n'))
msg = ('Hello On Nov 30, smb wrote:\n'
'Hi\n'
'On Nov 29, smb wrote:\n'
'hi')
prepared_msg = ('Hello\n'
' On Nov 30, smb wrote:\n'
'Hi\n'
'On Nov 29, smb wrote:\n'
'hi')
eq_(prepared_msg, quotations.preprocess(msg, '\n'))
def test_preprocess_postprocess_2_links():
msg_body = "<http://link1> <http://link2>"
eq_(msg_body, quotations.extract_from_plain(msg_body))
def test_standard_replies():
for filename in os.listdir(STANDARD_REPLIES):
filename = os.path.join(STANDARD_REPLIES, filename)
if os.path.isdir(filename):
continue
with open(filename) as f:
msg = f.read()
m = mime.from_string(msg)
for part in m.walk():
if part.content_type == 'text/plain':
text = part.body
stripped_text = quotations.extract_from_plain(text)
reply_text_fn = filename[:-4] + '_reply_text'
if os.path.isfile(reply_text_fn):
with open(reply_text_fn) as f:
reply_text = f.read()
else:
reply_text = 'Hello'
eq_(reply_text, stripped_text,
"'%(reply)s' != %(stripped)s for %(fn)s" %
{'reply': reply_text, 'stripped': stripped_text,
'fn': filename})

9
tests/utils_test.py Normal file
View File

@@ -0,0 +1,9 @@
from . import *
from talon import utils
def test_get_delimiter():
eq_('\r\n', utils.get_delimiter('abc\r\n123'))
eq_('\n', utils.get_delimiter('abc\n123'))
eq_('\n', utils.get_delimiter('abc'))