Update README.rst

Clarified that some signature extraction methods require initializing the lib first.
Merge pull request #15 from willemdelbare/master
2014-09-14 09:03:10 -07:00 · 2014-09-14 08:38:39 -07:00 · 2014-09-13 15:33:08 +02:00 · 2014-07-24 20:37:47 -07:00 · 2014-07-24 20:37:33 -07:00 · 2014-07-24 20:32:57 -07:00
9 changed files with 145 additions and 117 deletions
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -0,0 +1,9 @@
+recursive-include tests *
+recursive-include talon *
+recursive-exclude tests *.pyc *~
+recursive-exclude talon *.pyc *~
+include train.data
+include classifier
+include LICENSE
+include MANIFEST.in
+include README.rst
--- a/README.md
+++ b/README.md
@@ -1,97 +0,0 @@
-talon
-=====
-
-Mailgun library to extract message quotations and signatures.
-
-If you ever tried to parse message quotations or signatures you know that absense of any formatting standards in this area
-could make this task a nightmare. Hopefully this library will make your life much eathier. The name of the project is
-inspired by TALON - multipurpose robot  designed to perform missions ranging from reconnaissance to combat and operate in
-a number of hostile environments. That's what a good quotations and signature parser should be like :smile:
-
-Usage
-----
-
-Here's how you initialize the library and extract a reply from a text message:
-
-```python
-import talon
-from talon import quotations
-
-talon.init()
-
-text =  """Reply
-
-----Original Message-----
-
-Quote"""
-
-reply = quotations.extract_from(text, 'text/plain')
-reply = quotations.extract_from_plain(text)
-# reply == "Reply"
-```
-
-To extract a reply from html:
-
-```python
-html = """Reply
-<blockquote>
-
-  <div>
-    On 11-Apr-2011, at 6:54 PM, Bob &lt;bob@example.com&gt; wrote:
-  </div>
-
-  <div>
-    Quote
-  </div>
-
-</blockquote>"""
-
-reply = quotations.extract_from(html, 'text/html')
-reply = quotations.extract_from_html(html)
-# reply == "<html><body><p>Reply</p></body></html>"
-```
-
-Often the best way is the easiest one. Here's how you can extract signature from email message without any
-machine learning fancy stuff:
-
-```python
-from talon.signature.bruteforce import extract_signature
-
-
-message = """Wow. Awesome!
--
-Bob Smith"""
-
-text, signature = extract_signature(message)
-# text == "Wow. Awesome!"
-# signature == "--\nBob Smith"
-```
-
-Quick and works like a charm 90% of the time. For other 10% you can use the power of machine learning algorithms:
-
-```python
-from talon import signature
-
-
-message = """Thanks Sasha, I can't go any higher and is why I limited it to the
-homepage.
-
-John Doe
-via mobile"""
-
-text, signature = signature.extract(message, sender='john.doe@example.com')
-# text == "Thanks Sasha, I can't go any higher and is why I limited it to the\nhomepage."
-# signature == "John Doe\nvia mobile"
-```
-
-For machine learning talon currently uses [PyML](http://pyml.sourceforge.net/) library to build SVM classifiers. The core of machine learning algorithm lays in ``talon.signature.learning package``. It defines a set of features to apply to a message (``featurespace.py``), how data sets are built (``dataset.py``), classifier's interface (``classifier.py``).
-
-The data used for training is taken from our personal email conversations and from [ENRON](https://www.cs.cmu.edu/~enron/) dataset. As a result of applying our set of features to the dataset we provide files ``classifier`` and ``train.data`` that don't have any personal information but could be used to load trained classifier. Those files should be regenerated every time the feature/data set is changed.
-
-Research
--------
-
-The library is inspired by the following research papers and projects:
-
-* http://www.cs.cmu.edu/~vitor/papers/sigFilePaper_finalversion.pdf
-* http://www.cs.cornell.edu/people/tj/publications/joachims_01a.pdf
--- a/README.rst
+++ b/README.rst
@@ -0,0 +1,114 @@
+talon
+=====
+
+Mailgun library to extract message quotations and signatures.
+
+If you ever tried to parse message quotations or signatures you know that absence of any formatting standards in this area could make this task a nightmare. Hopefully this library will make your life much easier. The name of the project is inspired by TALON - multipurpose robot designed to perform missions ranging from reconnaissance to combat and operate in a number of hostile environments. That’s what a good quotations and signature parser should be like :smile:
+
+Usage
+-----
+
+Here’s how you initialize the library and extract a reply from a text
+message:
+
+.. code:: python
+
+    import talon
+    from talon import quotations
+
+    talon.init()
+
+    text =  """Reply
+
+    -----Original Message-----
+
+    Quote"""
+
+    reply = quotations.extract_from(text, 'text/plain')
+    reply = quotations.extract_from_plain(text)
+    # reply == "Reply"
+
+To extract a reply from html:
+
+.. code:: python
+
+    html = """Reply
+    <blockquote>
+
+      <div>
+        On 11-Apr-2011, at 6:54 PM, Bob &lt;bob@example.com&gt; wrote:
+      </div>
+
+      <div>
+        Quote
+      </div>
+
+    </blockquote>"""
+
+    reply = quotations.extract_from(html, 'text/html')
+    reply = quotations.extract_from_html(html)
+    # reply == "<html><body><p>Reply</p></body></html>"
+
+Often the best way is the easiest one. Here’s how you can extract
+signature from email message without any
+machine learning fancy stuff:
+
+.. code:: python
+
+    from talon.signature.bruteforce import extract_signature
+
+
+    message = """Wow. Awesome!
+    --
+    Bob Smith"""
+
+    text, signature = extract_signature(message)
+    # text == "Wow. Awesome!"
+    # signature == "--\nBob Smith"
+
+Quick and works like a charm 90% of the time. For other 10% you can use
+the power of machine learning algorithms:
+
+.. code:: python
+
+    import talon
+    # don't forget to init the library first
+    # it loads machine learning classifiers
+    talon.init()
+
+    from talon import signature
+
+
+    message = """Thanks Sasha, I can't go any higher and is why I limited it to the
+    homepage.
+
+    John Doe
+    via mobile"""
+
+    text, signature = signature.extract(message, sender='john.doe@example.com')
+    # text == "Thanks Sasha, I can't go any higher and is why I limited it to the\nhomepage."
+    # signature == "John Doe\nvia mobile"
+
+For machine learning talon currently uses `PyML`_ library to build SVM
+classifiers. The core of machine learning algorithm lays in
+``talon.signature.learning package``. It defines a set of features to
+apply to a message (``featurespace.py``), how data sets are built
+(``dataset.py``), classifier’s interface (``classifier.py``).
+
+The data used for training is taken from our personal email
+conversations and from `ENRON`_ dataset. As a result of applying our set
+of features to the dataset we provide files ``classifier`` and
+``train.data`` that don’t have any personal information but could be
+used to load trained classifier. Those files should be regenerated every
+time the feature/data set is changed.
+
+.. _PyML: http://pyml.sourceforge.net/
+.. _ENRON: https://www.cs.cmu.edu/~enron/
+
+Research
+--------
+
+The library is inspired by the following research papers and projects:
+
+-  http://www.cs.cmu.edu/~vitor/papers/sigFilePaper_finalversion.pdf
+-  http://www.cs.cornell.edu/people/tj/publications/joachims_01a.pdf
--- a/setup.cfg
+++ b/setup.cfg
@@ -1,2 +0,0 @@
-[metadata]
-description-file = README.md
--- a/setup.py
+++ b/setup.py
@@ -7,10 +7,10 @@ from setuptools import setup, find_packages


 setup(name='talon',
-      version='1.0',
+      version='1.0.2',
      description=("Mailgun library "
                   "to extract message quotations and signatures."),
-      long_description=open("README.md").read(),
+      long_description=open("README.rst").read(),
      author='Mailgun Inc.',
      author_email='admin@mailgunhq.com',
      url='https://github.com/mailgun/talon',
@@ -26,7 +26,8 @@ setup(name='talon',
          "html2text",
          "nose==1.2.1",
          "mock",
-          "coverage"
+          "coverage",
+          "flanker"
          ]
      )

--- a/talon/quotations.py
+++ b/talon/quotations.py
@@ -73,6 +73,9 @@ SPLITTER_PATTERNS = [
    re.compile("(\d+/\d+/\d+|\d+\.\d+\.\d+).*@", re.VERBOSE),
    RE_ON_DATE_SMB_WROTE,
    re.compile('(_+\r?\n)?[\s]*(:?[*]?From|Date):[*]? .*'),
+    re.compile('(_+\r?\n)?[\s]*(:?[*]?Van|Datum):[*]? .*'),
+    re.compile('(_+\r?\n)?[\s]*(:?[*]?De|Date):[*]? .*'),
+    re.compile('(_+\r?\n)?[\s]*(:?[*]?Von|Datum):[*]? .*'),
    re.compile('\S{3,10}, \d\d? \S{3,10} 20\d\d,? \d\d?:\d\d(:\d\d)?'
               '( \S+){3,6}@\S+:')
    ]
@@ -81,7 +84,7 @@ SPLITTER_PATTERNS = [
 RE_LINK = re.compile('<(http://[^>]*)>')
 RE_NORMALIZED_LINK = re.compile('@@(http://[^>@]*)@@')

-RE_PARANTHESIS_LINK = re.compile("\(https?://")
+RE_PARENTHESIS_LINK = re.compile("\(https?://")

 SPLITTER_MAX_LINES = 4
 MAX_LINES_COUNT = 1000
@@ -169,8 +172,8 @@ def process_marked_lines(lines, markers, return_flags=[False, -1, -1]):
        # long links could break sequence of quotation lines but they shouldn't
        # be considered an inline reply
        links = (
-            RE_PARANTHESIS_LINK.search(lines[inline_reply.start() - 1]) or
-            RE_PARANTHESIS_LINK.match(lines[inline_reply.start()].strip()))
+            RE_PARENTHESIS_LINK.search(lines[inline_reply.start() - 1]) or
+            RE_PARENTHESIS_LINK.match(lines[inline_reply.start()].strip()))
        if not links:
            return_flags[:] = [False, -1, -1]
            return lines
@@ -197,7 +200,7 @@ def preprocess(msg_body, delimiter, content_type='text/plain'):
    """Prepares msg_body for being stripped.

    Replaces link brackets so that they couldn't be taken for quotation marker.
-    Splits line in two if splitter pattern preceeded by some text on the same
+    Splits line in two if splitter pattern preceded by some text on the same
    line (done only for 'On <date> <person> wrote:' pattern).
    """
    # normalize links i.e. replace '<', '>' wrapping the link with some symbols
@@ -213,7 +216,7 @@ def preprocess(msg_body, delimiter, content_type='text/plain'):
    msg_body = re.sub(RE_LINK, link_wrapper, msg_body)

    def splitter_wrapper(splitter):
-        """Wrapps splitter with new line"""
+        """Wraps splitter with new line"""
        if splitter.start() and msg_body[splitter.start() - 1] != '\n':
            return '%s%s' % (delimiter, splitter.group())
        else:
@@ -268,7 +271,7 @@ def extract_from_html(msg_body):
    then converting html to text,
    then extracting quotations from text,
    then checking deleted checkpoints,
-    then deleting neccessary tags.
+    then deleting necessary tags.
    """

    if msg_body.strip() == '':
--- a/talon/signature/bruteforce.py
+++ b/talon/signature/bruteforce.py
@@ -49,7 +49,7 @@ RE_PHONE_SIGNATURE = re.compile(r'''
 # c - could be signature line
 # d - line starts with dashes (could be signature or list item)
 # l - long line
-RE_SIGNATURE_CANDIDAATE = re.compile(r'''
+RE_SIGNATURE_CANDIDATE = re.compile(r'''
    (?P<candidate>c+d)[^d]
    |
    (?P<candidate>c+d)$
@@ -184,5 +184,5 @@ def _process_marked_candidate_indexes(candidate, markers):
    >>> _process_marked_candidate_indexes([9, 12, 14, 15, 17], 'clddc')
    [15, 17]
    """
-    match = RE_SIGNATURE_CANDIDAATE.match(markers[::-1])
+    match = RE_SIGNATURE_CANDIDATE.match(markers[::-1])
    return candidate[-match.end('candidate'):] if match else []
--- a/talon/signature/learning/featurespace.py
+++ b/talon/signature/learning/featurespace.py
@@ -1,6 +1,6 @@
 # -*- coding: utf-8 -*-

-""" The module provides functions for convertion of a message body/body lines
+""" The module provides functions for conversion of a message body/body lines
 into classifiers features space.

 The body and the message sender string are converted into unicode before
@@ -47,9 +47,9 @@ def apply_features(body, features):
    '''Applies features to message body lines.

    Returns list of lists. Each of the lists corresponds to the body line
-    and is constituted by the numbers of features occurances (0 or 1).
+    and is constituted by the numbers of features occurrences (0 or 1).
    E.g. if element j of list i equals 1 this means that
-    feature j occured in line i (counting from the last line of the body).
+    feature j occurred in line i (counting from the last line of the body).
    '''
    # collect all non empty lines
    lines = [line for line in body.splitlines() if line.strip()]
@@ -66,7 +66,7 @@ def build_pattern(body, features):
    '''Converts body into a pattern i.e. a point in the features space.

    Applies features to the body lines and sums up the results.
-    Elements of the pattern indicate how many times a certain feature occured
+    Elements of the pattern indicate how many times a certain feature occurred
    in the last lines of the body.
    '''
    line_patterns = apply_features(body, features)
--- a/talon/signature/learning/helpers.py
+++ b/talon/signature/learning/helpers.py
@@ -94,7 +94,7 @@ def binary_regex_match(prog):


 def flatten_list(list_to_flatten):
-    """Simple list comprehesion to flatten list.
+    """Simple list comprehension to flatten list.

    >>> flatten_list([[1, 2], [3, 4, 5]])
    [1, 2, 3, 4, 5]
@@ -155,7 +155,7 @@ def extract_names(sender):


 def categories_percent(s, categories):
-    '''Returns category characters persent.
+    '''Returns category characters percent.

    >>> categories_percent("qqq ggg hhh", ["Po"])
    0.0
@@ -177,7 +177,7 @@ def categories_percent(s, categories):


 def punctuation_percent(s):
-    '''Returns punctuation persent.
+    '''Returns punctuation percent.

    >>> punctuation_percent("qqq ggg hhh")
    0.0
Author	SHA1	Message	Date
Sergey Obukhov	52505bba8a	Update README.rst Clarified that some signature extraction methods require initializing the lib first.	2014-09-14 09:03:10 -07:00
Sergey Obukhov	79cd4fcc52	Merge pull request #15 from willemdelbare/master added extra splitter expressions for Dutch, French, German	2014-09-14 08:38:39 -07:00
Willem Delbare	a4f156b174	added extra splitter expressions for Dutch, French, German	2014-09-13 15:33:08 +02:00
Sergey Obukhov	1789ccf3c8	Merge branch 'master' of github.com:mailgun/talon	2014-07-24 20:37:47 -07:00
Sergey Obukhov	7a42ab3b28	fix #4 add flanker to setup.py	2014-07-24 20:37:33 -07:00
Sergey Obukhov	12b0e88a01	Merge pull request #5 from pborreli/typos Fixed typos	2014-07-24 20:32:57 -07:00
Pascal Borreli	8b78da5977	Fixed typos	2014-07-25 02:40:37 +00:00
Sergey Obukhov	b299feab1e	Merge branch 'master' of github.com:mailgun/talon	2014-07-24 15:43:11 -07:00
Sergey Obukhov	95182dcfc4	add train data, classifier and fixtures to MANIFEST.in	2014-07-24 15:42:50 -07:00
Ashish Gandhi	f9fe412fa4	Merge pull request #2 from ivuk/master Fix a typo in README.rst	2014-07-24 15:27:05 -07:00
Igor Vuk	00a8db2e3e	Fix a typo in README.rst	2014-07-25 00:18:13 +02:00
Sergey Obukhov	71ae26ccd1	added MANIFEST.in and bump up the version	2014-07-24 15:09:01 -07:00
Sergey Obukhov	b0851d5363	bump up the version	2014-07-24 14:59:32 -07:00
Sergey Obukhov	ac4f5201bb	Update README.rst	2014-07-24 07:22:25 -07:00
Sergey Obukhov	81e88d9222	use reStructuredText instead of Markdown	2014-07-24 06:56:10 -07:00