Merge pull request #62 from tgwizard/better-support-for-scandinavian-languages

Add better support for Scandinavian languages
Merge pull request #65 from mailgun/sergey/cssselect
2015-10-14 21:48:10 -07:00 · 2015-10-14 20:34:02 -07:00 · 2015-10-14 20:31:26 -07:00 · 2015-10-14 12:38:06 -07:00 · 2015-09-21 21:42:01 +02:00 · 2015-09-21 21:33:57 +02:00
35 changed files with 689 additions and 381 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -48,4 +48,7 @@ tramp
 *_archive
 # Trial temp
-_trial_temp
+_trial_temp
 # OSX
 .DS_Store
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -0,0 +1,9 @@
 recursive-include tests *
 recursive-include talon *
 recursive-exclude tests *.pyc *~
 recursive-exclude talon *.pyc *~
 include train.data
 include classifier
 include LICENSE
 include MANIFEST.in
 include README.rst
--- a/README.md
+++ b/README.md
@@ -1,97 +0,0 @@
 talon
 =====
 Mailgun library to extract message quotations and signatures.
 If you ever tried to parse message quotations or signatures you know that absense of any formatting standards in this area
 could make this task a nightmare. Hopefully this library will make your life much eathier. The name of the project is
 inspired by TALON - multipurpose robot  designed to perform missions ranging from reconnaissance to combat and operate in
 a number of hostile environments. That's what a good quotations and signature parser should be like :smile:
 Usage
 -----
 Here's how you initialize the library and extract a reply from a text message:
 ```python
 import talon
 from talon import quotations
 talon.init()
 text =  """Reply
 -----Original Message-----
 Quote"""
 reply = quotations.extract_from(text, 'text/plain')
 reply = quotations.extract_from_plain(text)
 # reply == "Reply"
 ```
 To extract a reply from html:
 ```python
 html = """Reply
 <blockquote>
  <div>
    On 11-Apr-2011, at 6:54 PM, Bob &lt;bob@example.com&gt; wrote:
  </div>
  <div>
    Quote
  </div>
 </blockquote>"""
 reply = quotations.extract_from(html, 'text/html')
 reply = quotations.extract_from_html(html)
 # reply == "<html><body><p>Reply</p></body></html>"
 ```
 Often the best way is the easiest one. Here's how you can extract signature from email message without any
 machine learning fancy stuff:
 ```python
 from talon.signature.bruteforce import extract_signature
 message = """Wow. Awesome!
 --
 Bob Smith"""
 text, signature = extract_signature(message)
 # text == "Wow. Awesome!"
 # signature == "--\nBob Smith"
 ```
 Quick and works like a charm 90% of the time. For other 10% you can use the power of machine learning algorithms:
 ```python
 from talon import signature
 message = """Thanks Sasha, I can't go any higher and is why I limited it to the
 homepage.
 John Doe
 via mobile"""
 text, signature = signature.extract(message, sender='john.doe@example.com')
 # text == "Thanks Sasha, I can't go any higher and is why I limited it to the\nhomepage."
 # signature == "John Doe\nvia mobile"
 ```
 For machine learning talon currently uses [PyML](http://pyml.sourceforge.net/) library to build SVM classifiers. The core of machine learning algorithm lays in ``talon.signature.learning package``. It defines a set of features to apply to a message (``featurespace.py``), how data sets are built (``dataset.py``), classifier's interface (``classifier.py``).
 The data used for training is taken from our personal email conversations and from [ENRON](https://www.cs.cmu.edu/~enron/) dataset. As a result of applying our set of features to the dataset we provide files ``classifier`` and ``train.data`` that don't have any personal information but could be used to load trained classifier. Those files should be regenerated every time the feature/data set is changed.
 Research
 --------
 The library is inspired by the following research papers and projects:
 * http://www.cs.cmu.edu/~vitor/papers/sigFilePaper_finalversion.pdf
 * http://www.cs.cornell.edu/people/tj/publications/joachims_01a.pdf
--- a/README.rst
+++ b/README.rst
@@ -0,0 +1,128 @@
 talon
 =====
 Mailgun library to extract message quotations and signatures.
 If you ever tried to parse message quotations or signatures you know that absence of any formatting standards in this area could make this task a nightmare. Hopefully this library will make your life much easier. The name of the project is inspired by TALON - multipurpose robot designed to perform missions ranging from reconnaissance to combat and operate in a number of hostile environments. That’s what a good quotations and signature parser should be like :smile:
 Usage
 -----
 Here’s how you initialize the library and extract a reply from a text
 message:
 .. code:: python
    import talon
    from talon import quotations
    talon.init()
    text =  """Reply
    -----Original Message-----
    Quote"""
    reply = quotations.extract_from(text, 'text/plain')
    reply = quotations.extract_from_plain(text)
    # reply == "Reply"
 To extract a reply from html:
 .. code:: python
    html = """Reply
    <blockquote>
      <div>
        On 11-Apr-2011, at 6:54 PM, Bob &lt;bob@example.com&gt; wrote:
      </div>
      <div>
        Quote
      </div>
    </blockquote>"""
    reply = quotations.extract_from(html, 'text/html')
    reply = quotations.extract_from_html(html)
    # reply == "<html><body><p>Reply</p></body></html>"
 Often the best way is the easiest one. Here’s how you can extract
 signature from email message without any
 machine learning fancy stuff:
 .. code:: python
    from talon.signature.bruteforce import extract_signature
    message = """Wow. Awesome!
    --
    Bob Smith"""
    text, signature = extract_signature(message)
    # text == "Wow. Awesome!"
    # signature == "--\nBob Smith"
 Quick and works like a charm 90% of the time. For other 10% you can use
 the power of machine learning algorithms:
 .. code:: python
    import talon
    # don't forget to init the library first
    # it loads machine learning classifiers
    talon.init()
    from talon import signature
    message = """Thanks Sasha, I can't go any higher and is why I limited it to the
    homepage.
    John Doe
    via mobile"""
    text, signature = signature.extract(message, sender='john.doe@example.com')
    # text == "Thanks Sasha, I can't go any higher and is why I limited it to the\nhomepage."
    # signature == "John Doe\nvia mobile"
 For machine learning talon currently uses the `scikit-learn`_ library to build SVM
 classifiers. The core of machine learning algorithm lays in
 ``talon.signature.learning package``. It defines a set of features to
 apply to a message (``featurespace.py``), how data sets are built
 (``dataset.py``), classifier’s interface (``classifier.py``).
 The data used for training is taken from our personal email
 conversations and from `ENRON`_ dataset. As a result of applying our set
 of features to the dataset we provide files ``classifier`` and
 ``train.data`` that don’t have any personal information but could be
 used to load trained classifier. Those files should be regenerated every
 time the feature/data set is changed.
 To regenerate the model files, you can run
 .. code:: sh
    python train.py
 or
 .. code:: python
    from talon.signature import EXTRACTOR_FILENAME, EXTRACTOR_DATA
    from talon.signature.learning.classifier import train, init
    train(init(), EXTRACTOR_DATA, EXTRACTOR_FILENAME)
 .. _scikit-learn: http://scikit-learn.org
 .. _ENRON: https://www.cs.cmu.edu/~enron/
 Research
 --------
 The library is inspired by the following research papers and projects:
 -  http://www.cs.cmu.edu/~vitor/papers/sigFilePaper_finalversion.pdf
 -  http://www.cs.cornell.edu/people/tj/publications/joachims_01a.pdf
--- a/setup.cfg
+++ b/setup.cfg
@@ -1,2 +0,0 @@
 [metadata]
 description-file = README.md
--- a/setup.py
+++ b/setup.py
@@ -1,16 +1,11 @@
 import os
 import sys
 import contextlib
 from distutils.spawn import find_executable
 from setuptools import setup, find_packages
 setup(name='talon',
-      version='1.0',
+      version='1.0.9',
      description=("Mailgun library "
                   "to extract message quotations and signatures."),
-      long_description=open("README.md").read(),
+      long_description=open("README.rst").read(),
      author='Mailgun Inc.',
      author_email='admin@mailgunhq.com',
      url='https://github.com/mailgun/talon',
@@ -19,87 +14,19 @@ setup(name='talon',
      include_package_data=True,
      zip_safe=True,
      install_requires=[
-          "lxml==2.3.3",
+          "lxml>=2.3.3",
-          "regex==0.1.20110315",
+          "regex>=1",
          "chardet==1.0.1",
          "dnspython==1.11.1",
          "html2text",
-          "nose==1.2.1",
+          "numpy",
          "scipy",
          "scikit-learn==0.16.1", # pickled versions of classifier, else rebuild
          'chardet>=1.0.1',
          'cchardet>=0.3.5',
          'cssselect'
          ],
      tests_require=[
          "mock",
          "nose>=1.2.1",
          "coverage"
          ]
      )
 def install_pyml():
    '''
    Downloads and installs PyML
    '''
    try:
        import PyML
    except:
        pass
    else:
        return
    # install numpy first
    pip('install numpy==1.6.1 --upgrade')
    pyml_tarball = (
        'http://09cce49df173f6f6e61f-fd6930021b51685920a6fa76529ee321'
        '.r45.cf2.rackcdn.com/PyML-0.7.9.tar.gz')
    pyml_srcidr = 'PyML-0.7.9'
    # see if PyML tarball needs to be fetched:
    if not dir_exists(pyml_srcidr):
        run("curl %s | tar -xz" % pyml_tarball)
    # compile&install:
    with cd(pyml_srcidr):
        python('setup.py build')
        python('setup.py install')
 def run(command):
    if os.system(command) != 0:
        raise Exception("Failed '{}'".format(command))
    else:
        return 0
 def python(command):
    command = '{} {}'.format(sys.executable, command)
    run(command)
 def enforce_executable(name, install_info):
    if os.system("which {}".format(name)) != 0:
        raise Exception(
            '{} utility is missing.\nTo install, run:\n\n{}\n'.format(
                name, install_info))
 def pip(command):
    command = '{} {}'.format(find_executable('pip'), command)
    run(command)
 def dir_exists(path):
    return os.path.isdir(path)
@contextlib.contextmanager
 def cd(directory):
    curdir = os.getcwd()
    try:
        os.chdir(directory)
        yield {}
    finally:
        os.chdir(curdir)
 if __name__ == '__main__':
    if len(sys.argv) > 1 and sys.argv[1] in ['develop', 'install']:
        enforce_executable('curl', 'sudo aptitude install curl')
        install_pyml()
--- a/talon/html_quotations.py
+++ b/talon/html_quotations.py
@@ -138,9 +138,10 @@ def cut_by_id(html_message):
 def cut_blockquote(html_message):
-    ''' Cuts blockquote with wrapping elements. '''
+    ''' Cuts the last non-nested blockquote with wrapping elements. '''
-    quote = html_message.find('.//blockquote')
+    quote = html_message.xpath('(.//blockquote)[not(ancestor::blockquote)][last()]')
-    if quote is not None:
+    if quote:
        quote = quote[0]
        quote.getparent().remove(quote)
        return True
--- a/talon/quotations.py
+++ b/talon/quotations.py
@@ -12,8 +12,7 @@ from copy import deepcopy
 from lxml import html, etree
 import html2text
-from talon.constants import RE_DELIMITER
+from talon.utils import get_delimiter
 from talon.utils import random_token, get_delimiter
 from talon import html_quotations
@@ -23,14 +22,65 @@ log = logging.getLogger(__name__)
 RE_FWD = re.compile("^[-]+[ ]*Forwarded message[ ]*[-]+$", re.I | re.M)
 RE_ON_DATE_SMB_WROTE = re.compile(
-    r'''
+    u'(-*[>]?[ ]?({0})[ ].*({1})(.*\n){{0,2}}.*({2}):?-*)'.format(
-    (
+        # Beginning of the line
-        -*  # could include dashes
+        u'|'.join((
-        [ ]?On[ ].*,  # date part ends with comma
+            # English
-        (.*\n){0,2}  # splitter takes 4 lines at most
+            'On',
-        .*(wrote|sent):
+            # French
            'Le',
            # Polish
            'W dniu',
            # Dutch
            'Op',
            # German
            'Am',
            # Norwegian
            u'På',
            # Swedish, Danish
            'Den',
        )),
        # Date and sender separator
        u'|'.join((
            # most languages separate date and sender address by comma
            ',',
            # polish date and sender address separator
            u'użytkownik'
        )),
        # Ending of the line
        u'|'.join((
            # English
            'wrote', 'sent',
            # French
            u'a écrit',
            # Polish
            u'napisał',
            # Dutch
            'schreef','verzond','geschreven',
            # German
            'schrieb',
            # Norwegian, Swedish
            'skrev',
        ))
    ))
 # Special case for languages where text is translated like this: 'on {date} wrote {somebody}:'
 RE_ON_DATE_WROTE_SMB = re.compile(
    u'(-*[>]?[ ]?({0})[ ].*(.*\n){{0,2}}.*({1})[ ]*.*:)'.format(
        # Beginning of the line
        u'|'.join((
        	'Op',
        	#German
        	'Am'
        )),
        # Ending of the line
        u'|'.join((
            # Dutch
            'schreef','verzond','geschreven',
            # German
            'schrieb'
        ))
    )
    )
    ''', re.VERBOSE)
 RE_QUOTATION = re.compile(
    r'''
@@ -66,13 +116,33 @@ RE_EMPTY_QUOTATION = re.compile(
    e*
    ''', re.VERBOSE)
 # ------Original Message------ or ---- Reply Message ----
 # With variations in other languages.
 RE_ORIGINAL_MESSAGE = re.compile(u'[\s]*[-]+[ ]*({})[ ]*[-]+'.format(
    u'|'.join((
        # English
        'Original Message', 'Reply Message',
        # German
        u'Ursprüngliche Nachricht', 'Antwort Nachricht',
        # Danish
        'Oprindelig meddelelse',
    ))), re.I)
 RE_FROM_COLON_OR_DATE_COLON = re.compile(u'(_+\r?\n)?[\s]*(:?[*]?{})[\s]?:[*]? .*'.format(
    u'|'.join((
        # "From" in different languages.
        'From', 'Van', 'De', 'Von', 'Fra', u'Från',
        # "Date" in different languages.
        'Date', 'Datum', u'Envoyé', 'Skickat', 'Sendt',
    ))), re.I)
 SPLITTER_PATTERNS = [
-    # ------Original Message------ or ---- Reply Message ----
+    RE_ORIGINAL_MESSAGE,
    re.compile("[\s]*[-]+[ ]*(Original|Reply) Message[ ]*[-]+", re.I),
    # <date> <person>
    re.compile("(\d+/\d+/\d+|\d+\.\d+\.\d+).*@", re.VERBOSE),
    RE_ON_DATE_SMB_WROTE,
-    re.compile('(_+\r?\n)?[\s]*(:?[*]?From|Date):[*]? .*'),
+    RE_ON_DATE_WROTE_SMB,
    RE_FROM_COLON_OR_DATE_COLON,
    re.compile('\S{3,10}, \d\d? \S{3,10} 20\d\d,? \d\d?:\d\d(:\d\d)?'
               '( \S+){3,6}@\S+:')
    ]
@@ -81,7 +151,7 @@ SPLITTER_PATTERNS = [
 RE_LINK = re.compile('<(http://[^>]*)>')
 RE_NORMALIZED_LINK = re.compile('@@(http://[^>@]*)@@')
-RE_PARANTHESIS_LINK = re.compile("\(https?://")
+RE_PARENTHESIS_LINK = re.compile("\(https?://")
 SPLITTER_MAX_LINES = 4
 MAX_LINES_COUNT = 1000
@@ -96,7 +166,7 @@ def extract_from(msg_body, content_type='text/plain'):
            return extract_from_plain(msg_body)
        elif content_type == 'text/html':
            return extract_from_html(msg_body)
-    except Exception, e:
+    except Exception:
        log.exception('ERROR extracting message')
    return msg_body
@@ -127,6 +197,7 @@ def mark_message_lines(lines):
        else:
            # in case splitter is spread across several lines
            splitter = is_splitter('\n'.join(lines[i:i + SPLITTER_MAX_LINES]))
            if splitter:
                # append as many splitter markers as lines in splitter
                splitter_lines = splitter.group().splitlines()
@@ -169,8 +240,8 @@ def process_marked_lines(lines, markers, return_flags=[False, -1, -1]):
        # long links could break sequence of quotation lines but they shouldn't
        # be considered an inline reply
        links = (
-            RE_PARANTHESIS_LINK.search(lines[inline_reply.start() - 1]) or
+            RE_PARENTHESIS_LINK.search(lines[inline_reply.start() - 1]) or
-            RE_PARANTHESIS_LINK.match(lines[inline_reply.start()].strip()))
+            RE_PARENTHESIS_LINK.match(lines[inline_reply.start()].strip()))
        if not links:
            return_flags[:] = [False, -1, -1]
            return lines
@@ -197,7 +268,7 @@ def preprocess(msg_body, delimiter, content_type='text/plain'):
    """Prepares msg_body for being stripped.
    Replaces link brackets so that they couldn't be taken for quotation marker.
-    Splits line in two if splitter pattern preceeded by some text on the same
+    Splits line in two if splitter pattern preceded by some text on the same
    line (done only for 'On <date> <person> wrote:' pattern).
    """
    # normalize links i.e. replace '<', '>' wrapping the link with some symbols
@@ -213,7 +284,7 @@ def preprocess(msg_body, delimiter, content_type='text/plain'):
    msg_body = re.sub(RE_LINK, link_wrapper, msg_body)
    def splitter_wrapper(splitter):
-        """Wrapps splitter with new line"""
+        """Wraps splitter with new line"""
        if splitter.start() and msg_body[splitter.start() - 1] != '\n':
            return '%s%s' % (delimiter, splitter.group())
        else:
@@ -239,12 +310,8 @@ def extract_from_plain(msg_body):
    delimiter = get_delimiter(msg_body)
    msg_body = preprocess(msg_body, delimiter)
    lines = msg_body.splitlines()
    # don't process too long messages
-    if len(lines) > MAX_LINES_COUNT:
+    lines = msg_body.splitlines()[:MAX_LINES_COUNT]
        return stripped_text
    markers = mark_message_lines(lines)
    lines = process_marked_lines(lines, markers)
@@ -254,7 +321,7 @@ def extract_from_plain(msg_body):
    return msg_body
-def extract_from_html(msg_body):
+def extract_from_html(s):
    """
    Extract not quoted message from provided html message body
    using tags and plain text algorithm.
@@ -268,11 +335,15 @@ def extract_from_html(msg_body):
    then converting html to text,
    then extracting quotations from text,
    then checking deleted checkpoints,
-    then deleting neccessary tags.
+    then deleting necessary tags.
    """
-    if msg_body.strip() == '':
+    if s.strip() == '':
-        return msg_body
+        return s
    # replace CRLF with LF temporaraly otherwise CR will be converted to '&#13;'
    # when doing deepcopy on html tree
    msg_body, replaced = _CRLF_to_LF(s)
    html_tree = html.document_fromstring(
        msg_body,
@@ -289,7 +360,7 @@ def extract_from_html(msg_body):
    html_tree_copy = deepcopy(html_tree)
    number_of_checkpoints = html_quotations.add_checkpoint(html_tree, 0)
-    quotation_checkpoints = [False for i in xrange(number_of_checkpoints)]
+    quotation_checkpoints = [False] * number_of_checkpoints
    msg_with_checkpoints = html.tostring(html_tree)
    h = html2text.HTML2Text()
@@ -303,15 +374,12 @@ def extract_from_html(msg_body):
    plain_text = plain_text.replace('*', '')
    # Unmask saved star symbols
    plain_text = plain_text.replace('3423oorkg432', '*')
-
+    plain_text = preprocess(plain_text, '\n', content_type='text/html')
    delimiter = get_delimiter(plain_text)
    plain_text = preprocess(plain_text, delimiter, content_type='text/html')
    lines = plain_text.splitlines()
    # Don't process too long messages
    if len(lines) > MAX_LINES_COUNT:
-        return msg_body
+        return s
    # Collect checkpoints on each line
    line_checkpoints = [
@@ -336,9 +404,9 @@ def extract_from_html(msg_body):
                quotation_checkpoints[checkpoint] = True
    else:
        if cut_quotations:
-            return html.tostring(html_tree_copy)
+            return _restore_CRLF(html.tostring(html_tree_copy), replaced)
        else:
-            return msg_body
+            return s
    # Remove tags with quotation checkpoints
    html_quotations.delete_quotation_tags(
@@ -374,3 +442,37 @@ def register_xpath_extensions():
    ns.prefix = 'mg'
    ns['text_content'] = text_content
    ns['tail'] = tail
 def _restore_CRLF(s, replaced=True):
    """Restore CRLF if previously CRLF was replaced with LF
    >>> _restore_CRLF('a\nb')
    'a\r\nb'
    >>> _restore_CRLF('a\nb', replaced=False)
    'a\nb'
    """
    if replaced:
        return s.replace('\n', '\r\n')
    return s
 def _CRLF_to_LF(s):
    """Replace CRLF with LF
    >>> s, changed = _CRLF_to_LF('a\r\n'b)
    >>> s
    'a\nb'
    >>> changed
    True
    >>> s, changed = _CRLF_to_LF('a\n'b)
    >>> s
    'a\nb'
    >>> changed
    False
    """
    delimiter = get_delimiter(s)
    if delimiter == '\r\n':
        return s.replace(delimiter, '\n'), True
    return s, False
--- a/talon/signature/init.py
+++ b/talon/signature/init.py
@@ -21,11 +21,9 @@ trained against, don't forget to regenerate:
 """
 import os
 import sys
 from cStringIO import StringIO
 from . import extraction
-from . extraction import extract
+from . extraction import extract  #noqa
 from . learning import classifier
@@ -36,13 +34,5 @@ EXTRACTOR_DATA = os.path.join(DATA_DIR, 'train.data')
 def initialize():
-    try:
+    extraction.EXTRACTOR = classifier.load(EXTRACTOR_FILENAME,
-        # redirect output
+                                           EXTRACTOR_DATA)
        so, sys.stdout = sys.stdout, StringIO()
        extraction.EXTRACTOR = classifier.load(EXTRACTOR_FILENAME,
                                               EXTRACTOR_DATA)
        sys.stdout = so
    except Exception, e:
        raise Exception(
            "Failed initializing signature parsing with classifiers", e)
--- a/talon/signature/bruteforce.py
+++ b/talon/signature/bruteforce.py
@@ -49,7 +49,7 @@ RE_PHONE_SIGNATURE = re.compile(r'''
 # c - could be signature line
 # d - line starts with dashes (could be signature or list item)
 # l - long line
-RE_SIGNATURE_CANDIDAATE = re.compile(r'''
+RE_SIGNATURE_CANDIDATE = re.compile(r'''
    (?P<candidate>c+d)[^d]
    |
    (?P<candidate>c+d)$
@@ -184,5 +184,5 @@ def _process_marked_candidate_indexes(candidate, markers):
    >>> _process_marked_candidate_indexes([9, 12, 14, 15, 17], 'clddc')
    [15, 17]
    """
-    match = RE_SIGNATURE_CANDIDAATE.match(markers[::-1])
+    match = RE_SIGNATURE_CANDIDATE.match(markers[::-1])
    return candidate[-match.end('candidate'):] if match else []
--- a/talon/signature/data/classifier
+++ b/talon/signature/data/classifier
--- a/talon/signature/data/classifier_01.npy
+++ b/talon/signature/data/classifier_01.npy
--- a/talon/signature/data/classifier_02.npy
+++ b/talon/signature/data/classifier_02.npy
--- a/talon/signature/data/classifier_03.npy
+++ b/talon/signature/data/classifier_03.npy
--- a/talon/signature/data/classifier_04.npy
+++ b/talon/signature/data/classifier_04.npy
--- a/talon/signature/data/classifier_05.npy
+++ b/talon/signature/data/classifier_05.npy
--- a/talon/signature/extraction.py
+++ b/talon/signature/extraction.py
@@ -1,14 +1,10 @@
 # -*- coding: utf-8 -*-
 import os
 import logging
 import regex as re
-from PyML import SparseDataSet
+import numpy
 from talon.constants import RE_DELIMITER
 from talon.signature.constants import (SIGNATURE_MAX_LINES,
                                       TOO_LONG_SIGNATURE_LINE)
 from talon.signature.learning.featurespace import features, build_pattern
 from talon.utils import get_delimiter
 from talon.signature.bruteforce import get_signature_candidate
@@ -36,8 +32,8 @@ RE_REVERSE_SIGNATURE = re.compile(r'''
 def is_signature_line(line, sender, classifier):
    '''Checks if the line belongs to signature. Returns True or False.'''
-    data = SparseDataSet([build_pattern(line, features(sender))])
+    data = numpy.array(build_pattern(line, features(sender)))
-    return classifier.decisionFunc(data, 0) > 0
+    return classifier.predict(data) > 0
 def extract(body, sender):
@@ -61,7 +57,7 @@ def extract(body, sender):
                text = delimiter.join(text)
                if text.strip():
                    return (text, delimiter.join(signature))
-    except Exception, e:
+    except Exception:
        log.exception('ERROR when extracting signature with classifiers')
    return (body, None)
--- a/talon/signature/learning/classifier.py
+++ b/talon/signature/learning/classifier.py
@@ -5,32 +5,27 @@ The classifier could be used to detect if a certain line of the message
 body belongs to the signature.
 """
-import os
+from numpy import genfromtxt
-import sys
+from sklearn.svm import LinearSVC
-
+from sklearn.externals import joblib
 from PyML import SparseDataSet, SVM
 def init():
-    '''Inits classifier with optimal options.'''
+    """Inits classifier with optimal options."""
-    return SVM(C=10, optimization='liblinear')
+    return LinearSVC(C=10.0)
 def train(classifier, train_data_filename, save_classifier_filename=None):
-    '''Trains and saves classifier so that it could be easily loaded later.'''
+    """Trains and saves classifier so that it could be easily loaded later."""
-    data = SparseDataSet(train_data_filename, labelsColumn=-1)
+    file_data = genfromtxt(train_data_filename, delimiter=",")
-    classifier.train(data)
+    train_data, labels = file_data[:, :-1], file_data[:, -1]
    classifier.fit(train_data, labels)
    if save_classifier_filename:
-        classifier.save(save_classifier_filename)
+        joblib.dump(classifier, save_classifier_filename)
    return classifier
 def load(saved_classifier_filename, train_data_filename):
-    """Loads saved classifier.
+    """Loads saved classifier. """
-
+    return joblib.load(saved_classifier_filename)
    Classifier should be loaded with the same data it was trained against
    """
    train_data = SparseDataSet(train_data_filename, labelsColumn=-1)
    classifier = init()
    classifier.load(saved_classifier_filename, train_data)
    return classifier
--- a/talon/signature/learning/featurespace.py
+++ b/talon/signature/learning/featurespace.py
@@ -1,13 +1,14 @@
 # -*- coding: utf-8 -*-
-""" The module provides functions for convertion of a message body/body lines
+""" The module provides functions for conversion of a message body/body lines
 into classifiers features space.
 The body and the message sender string are converted into unicode before
 applying features to them.
 """
-from talon.signature.constants import SIGNATURE_MAX_LINES
+from talon.signature.constants import (SIGNATURE_MAX_LINES,
                                       TOO_LONG_SIGNATURE_LINE)
 from talon.signature.learning.helpers import *
@@ -20,7 +21,7 @@ def features(sender=''):
        # This one is not from paper.
        # Line is too long.
        # This one is less aggressive than `Line is too short`
-        lambda line: 1 if len(line) > 60 else 0,
+        lambda line: 1 if len(line) > TOO_LONG_SIGNATURE_LINE else 0,
        # Line contains email pattern.
        binary_regex_search(RE_EMAIL),
        # Line contains url.
@@ -47,9 +48,9 @@ def apply_features(body, features):
    '''Applies features to message body lines.
    Returns list of lists. Each of the lists corresponds to the body line
-    and is constituted by the numbers of features occurances (0 or 1).
+    and is constituted by the numbers of features occurrences (0 or 1).
    E.g. if element j of list i equals 1 this means that
-    feature j occured in line i (counting from the last line of the body).
+    feature j occurred in line i (counting from the last line of the body).
    '''
    # collect all non empty lines
    lines = [line for line in body.splitlines() if line.strip()]
@@ -66,7 +67,7 @@ def build_pattern(body, features):
    '''Converts body into a pattern i.e. a point in the features space.
    Applies features to the body lines and sums up the results.
-    Elements of the pattern indicate how many times a certain feature occured
+    Elements of the pattern indicate how many times a certain feature occurred
    in the last lines of the body.
    '''
    line_patterns = apply_features(body, features)
--- a/talon/signature/learning/helpers.py
+++ b/talon/signature/learning/helpers.py
@@ -16,8 +16,8 @@ from talon.signature.constants import SIGNATURE_MAX_LINES
 rc = re.compile
-RE_EMAIL = rc('@')
+RE_EMAIL = rc('\S@\S')
-RE_RELAX_PHONE = rc('.*(\(? ?[\d]{2,3} ?\)?.{,3}){2,}')
+RE_RELAX_PHONE = rc('(\(? ?[\d]{2,3} ?\)?.{,3}?){2,}')
 RE_URL = rc(r'''https?://|www\.[\S]+\.[\S]''')
 # Taken from:
@@ -40,14 +40,6 @@ RE_SIGNATURE_WORDS = rc(('(T|t)hank.*,|(B|b)est|(R|r)egards|'
 # Line contains a pattern like Vitor R. Carvalho or William W. Cohen.
 RE_NAME = rc('[A-Z][a-z]+\s\s?[A-Z][\.]?\s\s?[A-Z][a-z]+')
 # Pattern to match if e.g. 'Sender:' header field has sender names.
 SENDER_WITH_NAME_PATTERN = '([\s]*[\S]+,?)+[\s]*<.*>.*'
 RE_SENDER_WITH_NAME = rc(SENDER_WITH_NAME_PATTERN)
 # Reply line clue line endings, as in regular expression:
 # " wrote:$" or " writes:$"
 RE_CLUE_LINE_END = rc('.*(W|w)rotes?:$')
 INVALID_WORD_START = rc('\(|\+|[\d]')
 BAD_SENDER_NAMES = [
@@ -94,7 +86,7 @@ def binary_regex_match(prog):
 def flatten_list(list_to_flatten):
-    """Simple list comprehesion to flatten list.
+    """Simple list comprehension to flatten list.
    >>> flatten_list([[1, 2], [3, 4, 5]])
    [1, 2, 3, 4, 5]
@@ -128,7 +120,7 @@ def contains_sender_names(sender):
    names = names or sender
    if names != '':
        return binary_regex_search(re.compile(names))
-    return lambda s: False
+    return lambda s: 0
 def extract_names(sender):
@@ -142,7 +134,7 @@ def extract_names(sender):
    >>> extract_names('')
    []
    """
-    sender = to_unicode(sender)
+    sender = to_unicode(sender, precise=True)
    # Remove non-alphabetical characters
    sender = "".join([char if char.isalpha() else ' ' for char in sender])
    # Remove too short words and words from "black" list i.e.
@@ -155,7 +147,7 @@ def extract_names(sender):
 def categories_percent(s, categories):
-    '''Returns category characters persent.
+    '''Returns category characters percent.
    >>> categories_percent("qqq ggg hhh", ["Po"])
    0.0
@@ -169,7 +161,7 @@ def categories_percent(s, categories):
    50.0
    '''
    count = 0
-    s = to_unicode(s)
+    s = to_unicode(s, precise=True)
    for c in s:
        if unicodedata.category(c) in categories:
            count += 1
@@ -177,7 +169,7 @@ def categories_percent(s, categories):
 def punctuation_percent(s):
-    '''Returns punctuation persent.
+    '''Returns punctuation percent.
    >>> punctuation_percent("qqq ggg hhh")
    0.0
@@ -189,7 +181,7 @@ def punctuation_percent(s):
 def capitalized_words_percent(s):
    '''Returns capitalized words percent.'''
-    s = to_unicode(s)
+    s = to_unicode(s, precise=True)
    words = re.split('\s', s)
    words = [w for w in words if w.strip()]
    capitalized_words_counter = 0
--- a/talon/utils.py
+++ b/talon/utils.py
@@ -2,13 +2,12 @@
 import logging
 from random import shuffle
 import chardet
 import cchardet
 from talon.constants import RE_DELIMITER
 log = logging.getLogger(__name__)
 def safe_format(format_string, *args, **kwargs):
    """
    Helper: formats string with any combination of bytestrings/unicode
@@ -42,12 +41,44 @@ def to_unicode(str_or_unicode, precise=False):
        u'привет'
    If `precise` flag is True, tries to guess the correct encoding first.
    """
-    encoding = detect_encoding(str_or_unicode) if precise else 'utf-8'
+    encoding = quick_detect_encoding(str_or_unicode) if precise else 'utf-8'
    if isinstance(str_or_unicode, str):
        return unicode(str_or_unicode, encoding, 'replace')
    return str_or_unicode
 def detect_encoding(string):
    """
    Tries to detect the encoding of the passed string.
    Defaults to UTF-8.
    """
    try:
        detected = chardet.detect(string)
        if detected:
            return detected.get('encoding') or 'utf-8'
    except Exception, e:
        print 11111111111, e
        pass
    return 'utf-8'
 def quick_detect_encoding(string):
    """
    Tries to detect the encoding of the passed string.
    Uses cchardet. Fallbacks to detect_encoding.
    """
    try:
        detected = cchardet.detect(string)
        if detected:
            return detected.get('encoding') or detect_encoding(string)
    except Exception, e:
        print 222222222222, e
        pass
    return detect_encoding(string)
 def to_utf8(str_or_unicode):
    """
    Safely returns a UTF-8 version of a given string
--- a/tests/fixtures/html_replies/hotmail.html
+++ b/tests/fixtures/html_replies/hotmail.html
@@ -1,3 +1,4 @@
 <?xml version="1.0" encoding="UTF-8"?>
 <html>
 <head>
 <style><!--
--- a/tests/fixtures/standard_replies/apple_mail_2.eml
+++ b/tests/fixtures/standard_replies/apple_mail_2.eml
@@ -0,0 +1,19 @@
 Content-Type: text/plain;
 	charset=us-ascii
 Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2104\))
 Subject: Re: Hello there
 X-Universally-Unique-Identifier: 85B1075D-5841-46A9-8565-FCB287A93AC4
 From: Adam Renberg <adam@tictail.com>
 In-Reply-To: <CABzQGhkMXDxUt_tSVQcg=43aniUhtsVfCZVzu-PG0kwS_uzqMw@mail.gmail.com>
 Date: Sat, 22 Aug 2015 19:22:20 +0200
 Content-Transfer-Encoding: 7bit
 X-Smtp-Server: smtp.gmail.com:adam@tictail.com
 Message-Id: <68001B29-8EA4-444C-A894-0537D2CA5208@tictail.com>
 References: <CABzQGhkMXDxUt_tSVQcg=43aniUhtsVfCZVzu-PG0kwS_uzqMw@mail.gmail.com>
 To: Adam Renberg <tgwizard@gmail.com>
 Hello
 > On 22 Aug 2015, at 19:21, Adam Renberg <tgwizard@gmail.com> wrote:
 >
 > Hi there!
--- a/tests/fixtures/standard_replies/iphone.eml
+++ b/tests/fixtures/standard_replies/iphone.eml
@@ -9,11 +9,11 @@ To: bob <bob@example.com>
 Content-Transfer-Encoding: quoted-printable
 Mime-Version: 1.0 (1.0)
-hello
+Hello
 Sent from my iPhone
 On Apr 3, 2012, at 4:19 PM, bob <bob@example.com> wr=
 ote:
-> Hi
+> Hi
--- a/tests/fixtures/standard_replies/iphone_reply_text
+++ b/tests/fixtures/standard_replies/iphone_reply_text
@@ -0,0 +1,3 @@
 Hello
 Sent from my iPhone
--- a/tests/html_quotations_test.py
+++ b/tests/html_quotations_test.py
@@ -4,7 +4,6 @@ from . import *
 from . fixtures import *
 import regex as re
 from flanker import mime
 from talon import quotations
@@ -29,8 +28,8 @@ def test_quotation_splitter_inside_blockquote():
 </blockquote>"""
-    eq_("<html><body><p>Reply</p></body></html>",
+    eq_("<html><body><p>Reply\n</p></body></html>",
-        RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body)))
+        quotations.extract_from_html(msg_body))
 def test_quotation_splitter_outside_blockquote():
@@ -50,6 +49,24 @@ def test_quotation_splitter_outside_blockquote():
        RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body)))
 def test_regular_blockquote():
    msg_body = """Reply
 <blockquote>Regular</blockquote>
 <div>
  On 11-Apr-2011, at 6:54 PM, Bob &lt;bob@example.com&gt; wrote:
 </div>
 <blockquote>
  <div>
    <blockquote>Nested</blockquote>
  </div>
 </blockquote>
 """
    eq_("<html><body><p>Reply</p><blockquote>Regular</blockquote><div></div></body></html>",
        RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body)))
 def test_no_blockquote():
    msg_body = """
 <html>
@@ -224,10 +241,7 @@ def test_reply_shares_div_with_from_block():
 def test_reply_quotations_share_block():
-    msg = mime.from_string(REPLY_QUOTATIONS_SHARE_BLOCK)
+    stripped_html = quotations.extract_from_plain(REPLY_QUOTATIONS_SHARE_BLOCK)
    html_part = list(msg.walk())[1]
    assert html_part.content_type == 'text/html'
    stripped_html = quotations.extract_from_html(html_part.body)
    ok_(stripped_html)
    ok_('From' not in stripped_html)
@@ -250,7 +264,7 @@ RE_REPLY = re.compile(r"^Hi\. I am fine\.\s*\n\s*Thanks,\s*\n\s*Alex\s*$")
 def extract_reply_and_check(filename):
    f = open(filename)
-    msg_body = f.read().decode("utf-8")
+    msg_body = f.read()
    reply = quotations.extract_from_html(msg_body)
    h = html2text.HTML2Text()
@@ -296,3 +310,25 @@ def test_windows_mail_reply():
 def test_yandex_ru_reply():
    extract_reply_and_check("tests/fixtures/html_replies/yandex_ru.html")
 def test_CRLF():
    """CR is not converted to '&#13;'
    """
    eq_('<html>\r\n</html>', quotations.extract_from_html('<html>\r\n</html>'))
    msg_body = """Reply
 <blockquote>
  <div>
    On 11-Apr-2011, at 6:54 PM, Bob &lt;bob@example.com&gt; wrote:
  </div>
  <div>
    Test
  </div>
 </blockquote>"""
    msg_body = msg_body.replace('\n', '\r\n')
    eq_("<html><body><p>Reply\r\n</p></body></html>",
        quotations.extract_from_html(msg_body))
--- a/tests/quotations_test.py
+++ b/tests/quotations_test.py
@@ -3,8 +3,6 @@
 from . import *
 from . fixtures import *
 from flanker import mime
 from talon import quotations
@@ -31,3 +29,15 @@ def test_crash_inside_extract_from():
 def test_empty_body():
    eq_('', quotations.extract_from_plain(''))
 def test__CRLF_to_LF():
    eq_(('\n\r', True), quotations._CRLF_to_LF('\r\n\r'))
    eq_(('\n\r', False), quotations._CRLF_to_LF('\n\r'))
 def test__restore_CRLF():
    eq_('\n', quotations._restore_CRLF('\n', replaced=False))
    eq_('\r\n', quotations._restore_CRLF('\n', replaced=True))    
    # default
    eq_('\r\n', quotations._restore_CRLF('\n'))
--- a/tests/signature/bruteforce_test.py
+++ b/tests/signature/bruteforce_test.py
@@ -2,10 +2,6 @@
 from .. import *
 import os
 from flanker import mime
 from talon.signature import bruteforce
--- a/tests/signature/extraction_test.py
+++ b/tests/signature/extraction_test.py
@@ -4,8 +4,6 @@ from .. import *
 import os
 from PyML import SparseDataSet
 from talon.signature.learning import dataset
 from talon import signature
 from talon.signature import extraction as e
--- a/tests/signature/learning/dataset_test.py
+++ b/tests/signature/learning/dataset_test.py
@@ -3,9 +3,8 @@
 from ... import *
 import os
-from PyML import SparseDataSet
+from numpy import genfromtxt
 from talon.utils import to_unicode
 from talon.signature.learning import dataset as d
 from talon.signature.learning.featurespace import features
@@ -42,10 +41,13 @@ def test_build_extraction_dataset():
    d.build_extraction_dataset(os.path.join(EMAILS_DIR, 'P'),
                               os.path.join(TMP_DIR,
                                            'extraction.data'), 1)
-    test_data = SparseDataSet(os.path.join(TMP_DIR, 'extraction.data'),
+
-                              labelsColumn=-1)
+    filename = os.path.join(TMP_DIR, 'extraction.data')
    file_data = genfromtxt(filename, delimiter=",")
    test_data = file_data[:, :-1]
    # the result is a loadable signature extraction dataset
    # 32 comes from 3 emails in emails/P folder, 11 lines checked to be
    # a signature, one email has only 10 lines
-    eq_(test_data.size(), 32)
+    eq_(test_data.shape[0], 32)
-    eq_(len(features('')), test_data.numFeatures)
+    eq_(len(features('')), test_data.shape[1])
--- a/tests/signature/learning/featurespace_test.py
+++ b/tests/signature/learning/featurespace_test.py
@@ -6,7 +6,9 @@ from talon.signature.learning import featurespace as fs
 def test_apply_features():
-    s = '''John Doe
+    s = '''This is John Doe
 Tuesday @3pm suits. I'll chat to you then.
 VP Research and Development, Xxxx Xxxx Xxxxx
@@ -19,11 +21,12 @@ john@example.com'''
    # note that we don't consider the first line because signatures don't
    # usually take all the text, empty lines are not considered
    eq_(result, [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
                 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                 [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
                 [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
-    with patch.object(fs, 'SIGNATURE_MAX_LINES', 4):
+    with patch.object(fs, 'SIGNATURE_MAX_LINES', 5):
        features = fs.features(sender)
        new_result = fs.apply_features(s, features)
        # result remains the same because we don't consider empty lines
--- a/tests/signature/learning/helpers_test.py
+++ b/tests/signature/learning/helpers_test.py
@@ -43,7 +43,7 @@ VALID_PHONE_NUMBERS = [e.strip() for e in VALID.splitlines() if e.strip()]
 def test_match_phone_numbers():
    for phone in VALID_PHONE_NUMBERS:
-        ok_(RE_RELAX_PHONE.match(phone), "{} should be matched".format(phone))
+        ok_(RE_RELAX_PHONE.search(phone), "{} should be matched".format(phone))
 def test_match_names():
@@ -52,29 +52,6 @@ def test_match_names():
        ok_(RE_NAME.match(name), "{} should be matched".format(name))
 def test_sender_with_name():
    ok_lines = ['Sergey Obukhov <serobnic@example.com>',
                '\tSergey  <serobnic@example.com>',
                ('"Doe, John (TX)"'
                 '<DowJ@example.com>@EXAMPLE'
                 '<IMCEANOTES-+22Doe+2C+20John+20'
                 '+28TX+29+22+20+3CDoeJ+40example+2Ecom+3E'
                 '+40EXAMPLE@EXAMPLE.com>'),
                ('Company Sleuth <csleuth@email.xxx.com>'
                 '@EXAMPLE <XXX-Company+20Sleuth+20+3Ccsleuth'
                 '+40email+2Exxx+2Ecom+3E+40EXAMPLE@EXAMPLE.com>'),
                ('Doe III, John '
                 '</O=EXAMPLE/OU=NA/CN=RECIPIENTS/CN=jDOE5>')]
    for line in ok_lines:
        ok_(RE_SENDER_WITH_NAME.match(line),
            '{} should be matched'.format(line))
    nok_lines = ['', '<serobnic@xxx.ru>', 'Sergey serobnic@xxx.ru']
    for line in nok_lines:
        assert_false(RE_SENDER_WITH_NAME.match(line),
                     '{} should not be matched'.format(line))
 # Now test helpers functions
 def test_binary_regex_search():
    eq_(1, h.binary_regex_search(re.compile("12"))("12"))
--- a/tests/text_quotations_test.py
+++ b/tests/text_quotations_test.py
@@ -5,19 +5,18 @@ from . fixtures import *
 import os
-from flanker import mime
+import email.iterators
 from talon import quotations
@patch.object(quotations, 'MAX_LINES_COUNT', 1)
 def test_too_many_lines():
    msg_body = """Test reply
-
+Hi
 -----Original Message-----
 Test"""
-    eq_(msg_body, quotations.extract_from_plain(msg_body))
+    eq_("Test reply", quotations.extract_from_plain(msg_body))
 def test_pattern_on_date_somebody_wrote():
@@ -33,6 +32,16 @@ On 11-Apr-2011, at 6:54 PM, Roman Tkachenko <romant@example.com> wrote:
    eq_("Test reply", quotations.extract_from_plain(msg_body))
 def test_pattern_on_date_wrote_somebody():
    eq_('Lorem', quotations.extract_from_plain(
    """Lorem
 Op 13-02-2014 3:18 schreef Julius Caesar <pantheon@rome.com>:
 Veniam laborum mlkshk kale chips authentic. Normcore mumblecore laboris, fanny pack readymade eu blog chia pop-up freegan enim master cleanse.
 """))
 def test_pattern_on_date_somebody_wrote_date_with_slashes():
    msg_body = """Test reply
@@ -98,22 +107,24 @@ bla-bla - bla"""
    eq_(reply, quotations.extract_from_plain(msg_body))
-def test_pattern_original_message():
+def _check_pattern_original_message(original_message_indicator):
-    msg_body = """Test reply
+    msg_body = u"""Test reply
-----Original Message-----
+-----{}-----
 Test"""
    eq_('Test reply', quotations.extract_from_plain(msg_body.format(unicode(original_message_indicator))))
-    eq_("Test reply", quotations.extract_from_plain(msg_body))
+def test_english_original_message():
    _check_pattern_original_message('Original Message')
    _check_pattern_original_message('Reply Message')
-    msg_body = """Test reply
+def test_german_original_message():
    _check_pattern_original_message(u'Ursprüngliche Nachricht')
    _check_pattern_original_message('Antwort Nachricht')
- -----Original Message-----
+def test_danish_original_message():
-
+    _check_pattern_original_message('Oprindelig meddelelse')
 Test"""
    eq_("Test reply", quotations.extract_from_plain(msg_body))
 def test_reply_after_quotations():
@@ -199,6 +210,33 @@ On 04/19/2011 07:10 AM, Roman Tkachenko wrote:
 > Hello"""
    eq_("Hi", quotations.extract_from_plain(msg_body))
 def test_with_indent():
    msg_body = """YOLO salvia cillum kogi typewriter mumblecore cardigan skateboard Austin.
 ------On 12/29/1987 17:32 PM, Julius Caesar wrote-----
 Brunch mumblecore pug Marfa tofu, irure taxidermy hoodie readymade pariatur. 
    """
    eq_("YOLO salvia cillum kogi typewriter mumblecore cardigan skateboard Austin.", quotations.extract_from_plain(msg_body))
 def test_short_quotation_with_newline():
    msg_body = """Btw blah blah...
 On Tue, Jan 27, 2015 at 12:42 PM -0800, "Company" <christine.XXX@XXX.com> wrote:
 Hi Mark,
 Blah blah? 
 Thanks,Christine 
 On Jan 27, 2015, at 11:55 AM, Mark XXX <mark@XXX.com> wrote:
 Lorem ipsum?
 Mark
 Sent from Acompli"""
    eq_("Btw blah blah...", quotations.extract_from_plain(msg_body))
 def test_pattern_date_email_with_unicode():
    msg_body = """Replying ok
@@ -208,8 +246,8 @@ def test_pattern_date_email_with_unicode():
    eq_("Replying ok", quotations.extract_from_plain(msg_body))
-def test_pattern_from_block():
+def test_english_from_block():
-    msg_body = """Allo! Follow up MIME!
+    eq_('Allo! Follow up MIME!', quotations.extract_from_plain("""Allo! Follow up MIME!
 From: somebody@example.com
 Sent: March-19-11 5:42 PM
@@ -217,8 +255,97 @@ To: Somebody
 Subject: The manager has commented on your Loop
 Blah-blah-blah
-"""
+"""))
-    eq_("Allo! Follow up MIME!", quotations.extract_from_plain(msg_body))
+
 def test_german_from_block():
    eq_('Allo! Follow up MIME!', quotations.extract_from_plain(
    """Allo! Follow up MIME!
 Von: somebody@example.com
 Gesendet: Dienstag, 25. November 2014 14:59
 An: Somebody
 Betreff: The manager has commented on your Loop
 Blah-blah-blah
 """))
 def test_french_multiline_from_block():
    eq_('Lorem ipsum', quotations.extract_from_plain(
    u"""Lorem ipsum
 De : Brendan xxx [mailto:brendan.xxx@xxx.com]
 Envoyé : vendredi 23 janvier 2015 16:39
 À : Camille XXX
 Objet : Follow Up
 Blah-blah-blah
 """))
 def test_french_from_block():
    eq_('Lorem ipsum', quotations.extract_from_plain(
    u"""Lorem ipsum
 Le 23 janv. 2015 à 22:03, Brendan xxx <brendan.xxx@xxx.com<mailto:brendan.xxx@xxx.com>> a écrit:
 Bonjour!"""))
 def test_polish_from_block():
    eq_('Lorem ipsum', quotations.extract_from_plain(
    u"""Lorem ipsum
 W dniu 28 stycznia 2015 01:53 użytkownik Zoe xxx <zoe.xxx@xxx.com>
 napisał:
 Blah!
 """))
 def test_danish_from_block():
    eq_('Allo! Follow up MIME!', quotations.extract_from_plain(
    """Allo! Follow up MIME!
 Fra: somebody@example.com
 Sendt: 19. march 2011 12:10
 Til: Somebody
 Emne: The manager has commented on your Loop
 Blah-blah-blah
 """))
 def test_swedish_from_block():
    eq_('Allo! Follow up MIME!', quotations.extract_from_plain(
    u"""Allo! Follow up MIME!
 Från: Anno Sportel [mailto:anno.spoel@hsbcssad.com]
 Skickat: den 26 augusti 2015 14:45
 Till: Isacson Leiff
 Ämne: RE: Week 36
 Blah-blah-blah
 """))
 def test_swedish_from_line():
    eq_('Lorem', quotations.extract_from_plain(
    """Lorem
 Den 14 september, 2015 02:23:18, Valentino Rudy (valentino@rudy.be) skrev:
 Veniam laborum mlkshk kale chips authentic. Normcore mumblecore laboris, fanny pack readymade eu blog chia pop-up freegan enim master cleanse.
 """))
 def test_norwegian_from_line():
    eq_('Lorem', quotations.extract_from_plain(
    u"""Lorem
 På 14 september 2015 på 02:23:18, Valentino Rudy (valentino@rudy.be) skrev:
 Veniam laborum mlkshk kale chips authentic. Normcore mumblecore laboris, fanny pack readymade eu blog chia pop-up freegan enim master cleanse.
 """))
 def test_dutch_from_block():
    eq_('Gluten-free culpa lo-fi et nesciunt nostrud.', quotations.extract_from_plain(
    """Gluten-free culpa lo-fi et nesciunt nostrud. 
 Op 17-feb.-2015, om 13:18 heeft Julius Caesar <pantheon@rome.com> het volgende geschreven:
 Small batch beard laboris tempor, non listicle hella Tumblr heirloom. 
 """))
 def test_quotation_marker_false_positive():
@@ -513,22 +640,21 @@ def test_preprocess_postprocess_2_links():
 def test_standard_replies():
    for filename in os.listdir(STANDARD_REPLIES):
        filename = os.path.join(STANDARD_REPLIES, filename)
-        if os.path.isdir(filename):
+        if not filename.endswith('.eml') or os.path.isdir(filename):
            continue
        with open(filename) as f:
-            msg = f.read()
+            message = email.message_from_file(f)
-            m = mime.from_string(msg)
+            body = email.iterators.typed_subpart_iterator(message, subtype='plain').next()
-            for part in m.walk():
+            text = ''.join(email.iterators.body_line_iterator(body, True))
-                if part.content_type == 'text/plain':
+
-                    text = part.body
+            stripped_text = quotations.extract_from_plain(text)
-                    stripped_text = quotations.extract_from_plain(text)
+            reply_text_fn = filename[:-4] + '_reply_text'
-                    reply_text_fn = filename[:-4] + '_reply_text'
+            if os.path.isfile(reply_text_fn):
-                    if os.path.isfile(reply_text_fn):
+                with open(reply_text_fn) as f:
-                        with open(reply_text_fn) as f:
+                    reply_text = f.read().strip()
-                            reply_text = f.read()
+            else:
-                    else:
+                reply_text = 'Hello'
-                        reply_text = 'Hello'
+            yield eq_, reply_text, stripped_text, \
-                    eq_(reply_text, stripped_text,
+                "'%(reply)s' != %(stripped)s for %(fn)s" % \
-                        "'%(reply)s' != %(stripped)s for %(fn)s" %
+                {'reply': reply_text, 'stripped': stripped_text,
-                        {'reply': reply_text, 'stripped': stripped_text,
+                 'fn': filename}
                         'fn': filename})
--- a/tests/utils_test.py
+++ b/tests/utils_test.py
@@ -1,9 +1,60 @@
 # coding:utf-8
 from . import *
-from talon import utils
+from talon import utils as u
 import cchardet
 def test_get_delimiter():
-    eq_('\r\n', utils.get_delimiter('abc\r\n123'))
+    eq_('\r\n', u.get_delimiter('abc\r\n123'))
-    eq_('\n', utils.get_delimiter('abc\n123'))
+    eq_('\n', u.get_delimiter('abc\n123'))
-    eq_('\n', utils.get_delimiter('abc'))
+    eq_('\n', u.get_delimiter('abc'))
 def test_unicode():
    eq_ (u'hi', u.to_unicode('hi'))
    eq_ (type(u.to_unicode('hi')), unicode )
    eq_ (type(u.to_unicode(u'hi')), unicode )
    eq_ (type(u.to_unicode('привет')), unicode )
    eq_ (type(u.to_unicode(u'привет')), unicode )
    eq_ (u"привет", u.to_unicode('привет'))
    eq_ (u"привет", u.to_unicode(u'привет'))
    # some latin1 stuff
    eq_ (u"Versión", u.to_unicode('Versi\xf3n', precise=True))
 def test_detect_encoding():
    eq_ ('ascii', u.detect_encoding('qwe').lower())
    eq_ ('iso-8859-2', u.detect_encoding('Versi\xf3n').lower())
    eq_ ('utf-8', u.detect_encoding('привет').lower())
    # fallback to utf-8
    with patch.object(u.chardet, 'detect') as detect:
        detect.side_effect = Exception
        eq_ ('utf-8', u.detect_encoding('qwe').lower())
 def test_quick_detect_encoding():
    eq_ ('ascii', u.quick_detect_encoding('qwe').lower())
    eq_ ('windows-1252', u.quick_detect_encoding('Versi\xf3n').lower())
    eq_ ('utf-8', u.quick_detect_encoding('привет').lower())
@patch.object(cchardet, 'detect')
@patch.object(u, 'detect_encoding')
 def test_quick_detect_encoding_edge_cases(detect_encoding, cchardet_detect):
    cchardet_detect.return_value = {'encoding': 'ascii'}
    eq_('ascii', u.quick_detect_encoding("qwe"))
    cchardet_detect.assert_called_once_with("qwe")
    # fallback to detect_encoding
    cchardet_detect.return_value = {}
    detect_encoding.return_value = 'utf-8'
    eq_('utf-8', u.quick_detect_encoding("qwe"))
    # exception
    detect_encoding.reset_mock()
    cchardet_detect.side_effect = Exception()
    detect_encoding.return_value = 'utf-8'
    eq_('utf-8', u.quick_detect_encoding("qwe"))
    ok_(detect_encoding.called)
--- a/train.py
+++ b/train.py
@@ -0,0 +1,10 @@
 from talon.signature import EXTRACTOR_FILENAME, EXTRACTOR_DATA
 from talon.signature.learning.classifier import train, init
 def train_model():
    """ retrain model and persist """
    train(init(), EXTRACTOR_DATA, EXTRACTOR_FILENAME)
 if __name__ == "__main__":
    train_model()
Author	SHA1	Message	Date
Sergey Obukhov	2c416ecc0e	Merge pull request #62 from tgwizard/better-support-for-scandinavian-languages Add better support for Scandinavian languages	2015-10-14 21:48:10 -07:00
Sergey Obukhov	3ab33c557b	Merge pull request #65 from mailgun/sergey/cssselect add cssselect to dependencies	2015-10-14 20:34:02 -07:00
Sergey Obukhov	8db05f4950	add cssselect to dependencies	2015-10-14 20:31:26 -07:00
Sergey Obukhov	3d5bc82a03	Merge pull request #61 from tgwizard/fix-for-apple-mail Add fix for Apple Mail email format	2015-10-14 12:38:06 -07:00
Adam Renberg	14e3a0d80b	Add better support for Scandinavian languages This is a port of https://github.com/tictail/claw/pull/6 by @simonflore.	2015-09-21 21:42:01 +02:00
Adam Renberg	fcd9e2716a	Add fix for Apple Mail email format Where they have an initial > on the "date line".	2015-09-21 21:33:57 +02:00
Sergey Obukhov	d62d633215	bump up version	2015-09-21 09:55:51 -07:00
Sergey Obukhov	3b0c9273c1	Merge pull request #60 from mailgun/sergey/26 fixes mailgun/talon#26	2015-09-21 09:54:35 -07:00
Sergey Obukhov	e4c1c11845	remove print	2015-09-21 09:52:47 -07:00
Sergey Obukhov	ae508fe0e5	fixes mailgun/talon#26	2015-09-21 09:51:26 -07:00
Sergey Obukhov	2cb9b5399c	bump up version	2015-09-18 05:23:29 -07:00
Sergey Obukhov	134c47f515	Merge pull request #59 from mailgun/sergey/43 fixes mailgun/talon#43	2015-09-18 05:20:51 -07:00
Sergey Obukhov	d328c9d128	fixes mailgun/talon#43	2015-09-18 05:19:59 -07:00
Sergey Obukhov	77b62b0fef	Merge pull request #58 from mailgun/sergey/52 fixes mailgun/talon#52	2015-09-18 04:48:50 -07:00
Sergey Obukhov	ad09b18f3f	fixes mailgun/talon#52	2015-09-18 04:47:23 -07:00
Sergey Obukhov	b5af9c03a5	bump up version	2015-09-11 10:42:26 -07:00
Sergey Obukhov	176c7e7532	Merge pull request #57 from mailgun/sergey/to_unicode use precise encoding when converting to unicode	2015-09-11 10:40:52 -07:00
Sergey Obukhov	15976888a0	use precise encoding when converting to unicode	2015-09-11 10:38:28 -07:00
Sergey Obukhov	9bee502903	bump up version	2015-09-11 06:27:12 -07:00
Sergey Obukhov	e3cb8dc3e6	Merge pull request #56 from mailgun/sergey/1000+German+NL process first 1000 lines for long messages, support for German and Dutch	2015-09-11 06:20:34 -07:00
Sergey Obukhov	385285e5de	process first 1000 lines for long messages, support for German and Dutch	2015-09-11 06:17:14 -07:00
Sergey Obukhov	127771dac9	bump up version	2015-09-11 04:51:39 -07:00
Sergey Obukhov	cc98befba5	Merge pull request #50 from Easy-D/preserve-regular-blockquotes Preserve regular blockquotes	2015-09-11 04:49:36 -07:00
Sergey Obukhov	567549cba4	bump up talon version	2015-09-10 10:47:16 -07:00
Sergey Obukhov	76c4f49be8	Merge pull request #55 from mailgun/sergey/lxml unpin lxml version	2015-09-10 10:44:59 -07:00
Sergey Obukhov	d9d89dc250	unpin lxml version	2015-09-10 10:44:05 -07:00
Sergey Obukhov	9358db6cee	bump up talon version	2015-09-03 11:03:01 -07:00
Sergey Obukhov	08c9d7db03	Merge pull request #45 from AlexRiina/master Replace PyML with sklearn and clean up dependencies	2015-09-03 10:56:18 -07:00
Easy-D	390b0a6dc9	preserve regular blockquotes	2015-07-16 21:31:41 +02:00
Easy-D	ed6b861a47	add failing test that shows how regular blockquotes are removed	2015-07-16 21:24:49 +02:00
Alex Riina	85c7ee980c	add script to regenerate ml model	2015-07-02 21:49:09 -04:00
Oliver Song	7ea773e6a9	Fix iphone test	2015-07-02 21:49:09 -04:00
Scott MacVicar	e3c4ff38fe	move test stuff out to its own section	2015-07-02 21:49:09 -04:00
Scott MacVicar	8b1f87b1c0	Get this building and passing tests Changes: * add .DS_Store to .gitignore * Decode base64 encoded emails for tests * Pick a version of scikit since the pickled clasifiers are based on that * Add missing numpy and scipy dependencies	2015-07-02 21:49:09 -04:00
Alex Riina	c5e4cd9ab4	dont be too restrictive on the test library version	2015-07-02 21:49:09 -04:00
Alex Riina	215e36e9ed	allow higher version of regex library	2015-07-02 21:49:09 -04:00
Alex Riina	e3ef622031	remove unused regex	2015-07-02 21:49:09 -04:00
Alex Riina	f16760c466	Remove flanker and replace PyML with scikit-learn I never was actually able to successfully install PyML but the source-forge distribution and lack of python3 support convinced me that scikit-learn would be a fine substitute. Flanker was also difficult for me to install and seemed only to be used in the tests, so I removed it as well to get into a position where I could run the tests. As of this commit, only one is not passing (test_standard_replies with android.eml) though I'm not familiar with the `email` library yet.	2015-07-02 21:49:09 -04:00
Alex Riina	b36287e573	clean up style and extra imports	2015-07-02 21:49:09 -04:00
Alex Riina	4df7aa284b	remove extra imports	2015-07-02 21:49:09 -04:00
Jeremy Schlatter	3a37d8b649	Merge pull request #41 from simonflore/master New splitter pattern for Dutch mail replies	2015-04-22 12:17:39 -07:00
Simon	f9f428f4c3	Revert "Change of behavior when msg_body has more then 1000 lines" This reverts commit `84a83e865e`.	2015-04-16 13:26:17 +02:00
Simon	84a83e865e	Change of behavior when msg_body has more then 1000 lines	2015-04-16 13:22:18 +02:00
Simon	b4c180b9ff	Extra spaces check in RE_ON_DATE_WROTE_SMB reggae	2015-04-15 13:55:59 +02:00
Simon	072a440837	Test cases for new patterns	2015-04-15 13:55:17 +02:00
Simon	105d16644d	For patterns like this '---- On {date} {name} {mail} wrote ---- '	2015-04-14 18:52:45 +02:00
Simon	df3338192a	Another submission to a dutch variation	2015-04-14 18:49:26 +02:00
Simon	f0ed5d6c07	New splitter pattern for Dutch mail replies	2015-04-14 18:22:48 +02:00
Sergey Obukhov	790463821f	Merge pull request #31 from tsheasha/patch-1 Utilising the Constants	2015-03-02 14:48:41 -08:00
Sergey Obukhov	763d3b308e	Merge pull request #35 from futuresimple/more_formats Support some polish and french formats	2015-03-02 14:25:26 -08:00
szymonsobczak	3c9ef4653f	some more french fromats	2015-02-24 12:18:54 +01:00
szymonsobczak	b16060261a	support some polish and french formats	2015-02-24 11:39:12 +01:00
Tarek Sheasha	13dc43e960	Utilising the Constants Checking for the length of a line to determine if it is possibly a signature or not could be done in a more generic way by determining the maximum size of the line via a constant. Hence advocating the spirit of the modifying the code in only one place and propagating that change everywhere. This exact approach has already been used at:	2015-01-21 15:54:57 +01:00
Jeremy Schlatter	3768d7ba31	make a separate test function for each language	2014-12-30 14:41:20 -08:00
Jeremy Schlatter	613d1fc815	Add extra splitter expressions and tests for German and Danish. Also some refactoring to make it a bit easier to add more languages.	2014-12-23 15:44:04 -08:00
Sergey Obukhov	52505bba8a	Update README.rst Clarified that some signature extraction methods require initializing the lib first.	2014-09-14 09:03:10 -07:00
Sergey Obukhov	79cd4fcc52	Merge pull request #15 from willemdelbare/master added extra splitter expressions for Dutch, French, German	2014-09-14 08:38:39 -07:00
Willem Delbare	a4f156b174	added extra splitter expressions for Dutch, French, German	2014-09-13 15:33:08 +02:00
Sergey Obukhov	1789ccf3c8	Merge branch 'master' of github.com:mailgun/talon	2014-07-24 20:37:47 -07:00
Sergey Obukhov	7a42ab3b28	fix #4 add flanker to setup.py	2014-07-24 20:37:33 -07:00
Sergey Obukhov	12b0e88a01	Merge pull request #5 from pborreli/typos Fixed typos	2014-07-24 20:32:57 -07:00
Pascal Borreli	8b78da5977	Fixed typos	2014-07-25 02:40:37 +00:00
Sergey Obukhov	b299feab1e	Merge branch 'master' of github.com:mailgun/talon	2014-07-24 15:43:11 -07:00
Sergey Obukhov	95182dcfc4	add train data, classifier and fixtures to MANIFEST.in	2014-07-24 15:42:50 -07:00
Ashish Gandhi	f9fe412fa4	Merge pull request #2 from ivuk/master Fix a typo in README.rst	2014-07-24 15:27:05 -07:00
Igor Vuk	00a8db2e3e	Fix a typo in README.rst	2014-07-25 00:18:13 +02:00
Sergey Obukhov	71ae26ccd1	added MANIFEST.in and bump up the version	2014-07-24 15:09:01 -07:00
Sergey Obukhov	b0851d5363	bump up the version	2014-07-24 14:59:32 -07:00
Sergey Obukhov	ac4f5201bb	Update README.rst	2014-07-24 07:22:25 -07:00
Sergey Obukhov	81e88d9222	use reStructuredText instead of Markdown	2014-07-24 06:56:10 -07:00