Update README.rst

Clarified that some signature extraction methods require initializing the lib first.
Merge pull request #15 from willemdelbare/master
2014-09-14 09:03:10 -07:00 · 2014-09-14 08:38:39 -07:00 · 2014-09-13 15:33:08 +02:00 · 2014-07-24 20:37:47 -07:00 · 2014-07-24 20:37:33 -07:00 · 2014-07-24 20:32:57 -07:00
9 changed files with 145 additions and 117 deletions
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -0,0 +1,9 @@
 recursive-include tests *
 recursive-include talon *
 recursive-exclude tests *.pyc *~
 recursive-exclude talon *.pyc *~
 include train.data
 include classifier
 include LICENSE
 include MANIFEST.in
 include README.rst
--- a/README.md
+++ b/README.md
@@ -1,97 +0,0 @@
 talon
 =====
 Mailgun library to extract message quotations and signatures.
 If you ever tried to parse message quotations or signatures you know that absense of any formatting standards in this area
 could make this task a nightmare. Hopefully this library will make your life much eathier. The name of the project is
 inspired by TALON - multipurpose robot  designed to perform missions ranging from reconnaissance to combat and operate in
 a number of hostile environments. That's what a good quotations and signature parser should be like :smile:
 Usage
 -----
 Here's how you initialize the library and extract a reply from a text message:
 ```python
 import talon
 from talon import quotations
 talon.init()
 text =  """Reply
 -----Original Message-----
 Quote"""
 reply = quotations.extract_from(text, 'text/plain')
 reply = quotations.extract_from_plain(text)
 # reply == "Reply"
 ```
 To extract a reply from html:
 ```python
 html = """Reply
 <blockquote>
  <div>
    On 11-Apr-2011, at 6:54 PM, Bob &lt;bob@example.com&gt; wrote:
  </div>
  <div>
    Quote
  </div>
 </blockquote>"""
 reply = quotations.extract_from(html, 'text/html')
 reply = quotations.extract_from_html(html)
 # reply == "<html><body><p>Reply</p></body></html>"
 ```
 Often the best way is the easiest one. Here's how you can extract signature from email message without any
 machine learning fancy stuff:
 ```python
 from talon.signature.bruteforce import extract_signature
 message = """Wow. Awesome!
 --
 Bob Smith"""
 text, signature = extract_signature(message)
 # text == "Wow. Awesome!"
 # signature == "--\nBob Smith"
 ```
 Quick and works like a charm 90% of the time. For other 10% you can use the power of machine learning algorithms:
 ```python
 from talon import signature
 message = """Thanks Sasha, I can't go any higher and is why I limited it to the
 homepage.
 John Doe
 via mobile"""
 text, signature = signature.extract(message, sender='john.doe@example.com')
 # text == "Thanks Sasha, I can't go any higher and is why I limited it to the\nhomepage."
 # signature == "John Doe\nvia mobile"
 ```
 For machine learning talon currently uses [PyML](http://pyml.sourceforge.net/) library to build SVM classifiers. The core of machine learning algorithm lays in ``talon.signature.learning package``. It defines a set of features to apply to a message (``featurespace.py``), how data sets are built (``dataset.py``), classifier's interface (``classifier.py``).
 The data used for training is taken from our personal email conversations and from [ENRON](https://www.cs.cmu.edu/~enron/) dataset. As a result of applying our set of features to the dataset we provide files ``classifier`` and ``train.data`` that don't have any personal information but could be used to load trained classifier. Those files should be regenerated every time the feature/data set is changed.
 Research
 --------
 The library is inspired by the following research papers and projects:
 * http://www.cs.cmu.edu/~vitor/papers/sigFilePaper_finalversion.pdf
 * http://www.cs.cornell.edu/people/tj/publications/joachims_01a.pdf
--- a/README.rst
+++ b/README.rst
@@ -0,0 +1,114 @@
 talon
 =====
 Mailgun library to extract message quotations and signatures.
 If you ever tried to parse message quotations or signatures you know that absence of any formatting standards in this area could make this task a nightmare. Hopefully this library will make your life much easier. The name of the project is inspired by TALON - multipurpose robot designed to perform missions ranging from reconnaissance to combat and operate in a number of hostile environments. That’s what a good quotations and signature parser should be like :smile:
 Usage
 -----
 Here’s how you initialize the library and extract a reply from a text
 message:
 .. code:: python
    import talon
    from talon import quotations
    talon.init()
    text =  """Reply
    -----Original Message-----
    Quote"""
    reply = quotations.extract_from(text, 'text/plain')
    reply = quotations.extract_from_plain(text)
    # reply == "Reply"
 To extract a reply from html:
 .. code:: python
    html = """Reply
    <blockquote>
      <div>
        On 11-Apr-2011, at 6:54 PM, Bob &lt;bob@example.com&gt; wrote:
      </div>
      <div>
        Quote
      </div>
    </blockquote>"""
    reply = quotations.extract_from(html, 'text/html')
    reply = quotations.extract_from_html(html)
    # reply == "<html><body><p>Reply</p></body></html>"
 Often the best way is the easiest one. Here’s how you can extract
 signature from email message without any
 machine learning fancy stuff:
 .. code:: python
    from talon.signature.bruteforce import extract_signature
    message = """Wow. Awesome!
    --
    Bob Smith"""
    text, signature = extract_signature(message)
    # text == "Wow. Awesome!"
    # signature == "--\nBob Smith"
 Quick and works like a charm 90% of the time. For other 10% you can use
 the power of machine learning algorithms:
 .. code:: python
    import talon
    # don't forget to init the library first
    # it loads machine learning classifiers
    talon.init()
    from talon import signature
    message = """Thanks Sasha, I can't go any higher and is why I limited it to the
    homepage.
    John Doe
    via mobile"""
    text, signature = signature.extract(message, sender='john.doe@example.com')
    # text == "Thanks Sasha, I can't go any higher and is why I limited it to the\nhomepage."
    # signature == "John Doe\nvia mobile"
 For machine learning talon currently uses `PyML`_ library to build SVM
 classifiers. The core of machine learning algorithm lays in
 ``talon.signature.learning package``. It defines a set of features to
 apply to a message (``featurespace.py``), how data sets are built
 (``dataset.py``), classifier’s interface (``classifier.py``).
 The data used for training is taken from our personal email
 conversations and from `ENRON`_ dataset. As a result of applying our set
 of features to the dataset we provide files ``classifier`` and
 ``train.data`` that don’t have any personal information but could be
 used to load trained classifier. Those files should be regenerated every
 time the feature/data set is changed.
 .. _PyML: http://pyml.sourceforge.net/
 .. _ENRON: https://www.cs.cmu.edu/~enron/
 Research
 --------
 The library is inspired by the following research papers and projects:
 -  http://www.cs.cmu.edu/~vitor/papers/sigFilePaper_finalversion.pdf
 -  http://www.cs.cornell.edu/people/tj/publications/joachims_01a.pdf
--- a/setup.cfg
+++ b/setup.cfg
@@ -1,2 +0,0 @@
 [metadata]
 description-file = README.md
--- a/setup.py
+++ b/setup.py
@@ -7,10 +7,10 @@ from setuptools import setup, find_packages
 setup(name='talon',
-      version='1.0',
+      version='1.0.2',
      description=("Mailgun library "
                   "to extract message quotations and signatures."),
-      long_description=open("README.md").read(),
+      long_description=open("README.rst").read(),
      author='Mailgun Inc.',
      author_email='admin@mailgunhq.com',
      url='https://github.com/mailgun/talon',
@@ -26,7 +26,8 @@ setup(name='talon',
          "html2text",
          "nose==1.2.1",
          "mock",
-          "coverage"
+          "coverage",
          "flanker"
          ]
      )
--- a/talon/quotations.py
+++ b/talon/quotations.py
@@ -73,6 +73,9 @@ SPLITTER_PATTERNS = [
    re.compile("(\d+/\d+/\d+|\d+\.\d+\.\d+).*@", re.VERBOSE),
    RE_ON_DATE_SMB_WROTE,
    re.compile('(_+\r?\n)?[\s]*(:?[*]?From|Date):[*]? .*'),
    re.compile('(_+\r?\n)?[\s]*(:?[*]?Van|Datum):[*]? .*'),
    re.compile('(_+\r?\n)?[\s]*(:?[*]?De|Date):[*]? .*'),
    re.compile('(_+\r?\n)?[\s]*(:?[*]?Von|Datum):[*]? .*'),
    re.compile('\S{3,10}, \d\d? \S{3,10} 20\d\d,? \d\d?:\d\d(:\d\d)?'
               '( \S+){3,6}@\S+:')
    ]
@@ -81,7 +84,7 @@ SPLITTER_PATTERNS = [
 RE_LINK = re.compile('<(http://[^>]*)>')
 RE_NORMALIZED_LINK = re.compile('@@(http://[^>@]*)@@')
-RE_PARANTHESIS_LINK = re.compile("\(https?://")
+RE_PARENTHESIS_LINK = re.compile("\(https?://")
 SPLITTER_MAX_LINES = 4
 MAX_LINES_COUNT = 1000
@@ -169,8 +172,8 @@ def process_marked_lines(lines, markers, return_flags=[False, -1, -1]):
        # long links could break sequence of quotation lines but they shouldn't
        # be considered an inline reply
        links = (
-            RE_PARANTHESIS_LINK.search(lines[inline_reply.start() - 1]) or
+            RE_PARENTHESIS_LINK.search(lines[inline_reply.start() - 1]) or
-            RE_PARANTHESIS_LINK.match(lines[inline_reply.start()].strip()))
+            RE_PARENTHESIS_LINK.match(lines[inline_reply.start()].strip()))
        if not links:
            return_flags[:] = [False, -1, -1]
            return lines
@@ -197,7 +200,7 @@ def preprocess(msg_body, delimiter, content_type='text/plain'):
    """Prepares msg_body for being stripped.
    Replaces link brackets so that they couldn't be taken for quotation marker.
-    Splits line in two if splitter pattern preceeded by some text on the same
+    Splits line in two if splitter pattern preceded by some text on the same
    line (done only for 'On <date> <person> wrote:' pattern).
    """
    # normalize links i.e. replace '<', '>' wrapping the link with some symbols
@@ -213,7 +216,7 @@ def preprocess(msg_body, delimiter, content_type='text/plain'):
    msg_body = re.sub(RE_LINK, link_wrapper, msg_body)
    def splitter_wrapper(splitter):
-        """Wrapps splitter with new line"""
+        """Wraps splitter with new line"""
        if splitter.start() and msg_body[splitter.start() - 1] != '\n':
            return '%s%s' % (delimiter, splitter.group())
        else:
@@ -268,7 +271,7 @@ def extract_from_html(msg_body):
    then converting html to text,
    then extracting quotations from text,
    then checking deleted checkpoints,
-    then deleting neccessary tags.
+    then deleting necessary tags.
    """
    if msg_body.strip() == '':
--- a/talon/signature/bruteforce.py
+++ b/talon/signature/bruteforce.py
@@ -49,7 +49,7 @@ RE_PHONE_SIGNATURE = re.compile(r'''
 # c - could be signature line
 # d - line starts with dashes (could be signature or list item)
 # l - long line
-RE_SIGNATURE_CANDIDAATE = re.compile(r'''
+RE_SIGNATURE_CANDIDATE = re.compile(r'''
    (?P<candidate>c+d)[^d]
    |
    (?P<candidate>c+d)$
@@ -184,5 +184,5 @@ def _process_marked_candidate_indexes(candidate, markers):
    >>> _process_marked_candidate_indexes([9, 12, 14, 15, 17], 'clddc')
    [15, 17]
    """
-    match = RE_SIGNATURE_CANDIDAATE.match(markers[::-1])
+    match = RE_SIGNATURE_CANDIDATE.match(markers[::-1])
    return candidate[-match.end('candidate'):] if match else []
--- a/talon/signature/learning/featurespace.py
+++ b/talon/signature/learning/featurespace.py
@@ -1,6 +1,6 @@
 # -*- coding: utf-8 -*-
-""" The module provides functions for convertion of a message body/body lines
+""" The module provides functions for conversion of a message body/body lines
 into classifiers features space.
 The body and the message sender string are converted into unicode before
@@ -47,9 +47,9 @@ def apply_features(body, features):
    '''Applies features to message body lines.
    Returns list of lists. Each of the lists corresponds to the body line
-    and is constituted by the numbers of features occurances (0 or 1).
+    and is constituted by the numbers of features occurrences (0 or 1).
    E.g. if element j of list i equals 1 this means that
-    feature j occured in line i (counting from the last line of the body).
+    feature j occurred in line i (counting from the last line of the body).
    '''
    # collect all non empty lines
    lines = [line for line in body.splitlines() if line.strip()]
@@ -66,7 +66,7 @@ def build_pattern(body, features):
    '''Converts body into a pattern i.e. a point in the features space.
    Applies features to the body lines and sums up the results.
-    Elements of the pattern indicate how many times a certain feature occured
+    Elements of the pattern indicate how many times a certain feature occurred
    in the last lines of the body.
    '''
    line_patterns = apply_features(body, features)
--- a/talon/signature/learning/helpers.py
+++ b/talon/signature/learning/helpers.py
@@ -94,7 +94,7 @@ def binary_regex_match(prog):
 def flatten_list(list_to_flatten):
-    """Simple list comprehesion to flatten list.
+    """Simple list comprehension to flatten list.
    >>> flatten_list([[1, 2], [3, 4, 5]])
    [1, 2, 3, 4, 5]
@@ -155,7 +155,7 @@ def extract_names(sender):
 def categories_percent(s, categories):
-    '''Returns category characters persent.
+    '''Returns category characters percent.
    >>> categories_percent("qqq ggg hhh", ["Po"])
    0.0
@@ -177,7 +177,7 @@ def categories_percent(s, categories):
 def punctuation_percent(s):
-    '''Returns punctuation persent.
+    '''Returns punctuation percent.
    >>> punctuation_percent("qqq ggg hhh")
    0.0
Author	SHA1	Message	Date
Sergey Obukhov	52505bba8a	Update README.rst Clarified that some signature extraction methods require initializing the lib first.	2014-09-14 09:03:10 -07:00
Sergey Obukhov	79cd4fcc52	Merge pull request #15 from willemdelbare/master added extra splitter expressions for Dutch, French, German	2014-09-14 08:38:39 -07:00
Willem Delbare	a4f156b174	added extra splitter expressions for Dutch, French, German	2014-09-13 15:33:08 +02:00
Sergey Obukhov	1789ccf3c8	Merge branch 'master' of github.com:mailgun/talon	2014-07-24 20:37:47 -07:00
Sergey Obukhov	7a42ab3b28	fix #4 add flanker to setup.py	2014-07-24 20:37:33 -07:00
Sergey Obukhov	12b0e88a01	Merge pull request #5 from pborreli/typos Fixed typos	2014-07-24 20:32:57 -07:00
Pascal Borreli	8b78da5977	Fixed typos	2014-07-25 02:40:37 +00:00
Sergey Obukhov	b299feab1e	Merge branch 'master' of github.com:mailgun/talon	2014-07-24 15:43:11 -07:00
Sergey Obukhov	95182dcfc4	add train data, classifier and fixtures to MANIFEST.in	2014-07-24 15:42:50 -07:00
Ashish Gandhi	f9fe412fa4	Merge pull request #2 from ivuk/master Fix a typo in README.rst	2014-07-24 15:27:05 -07:00
Igor Vuk	00a8db2e3e	Fix a typo in README.rst	2014-07-25 00:18:13 +02:00
Sergey Obukhov	71ae26ccd1	added MANIFEST.in and bump up the version	2014-07-24 15:09:01 -07:00
Sergey Obukhov	b0851d5363	bump up the version	2014-07-24 14:59:32 -07:00
Sergey Obukhov	ac4f5201bb	Update README.rst	2014-07-24 07:22:25 -07:00
Sergey Obukhov	81e88d9222	use reStructuredText instead of Markdown	2014-07-24 06:56:10 -07:00