Merge branch 'master' of github.com:mailgun/talon

add train data, classifier and fixtures to MANIFEST.in
Merge pull request #2 from ivuk/master
2014-07-24 15:43:11 -07:00 · 2014-07-24 15:42:50 -07:00 · 2014-07-24 15:27:05 -07:00 · 2014-07-25 00:18:13 +02:00 · 2014-07-24 15:09:01 -07:00 · 2014-07-24 14:59:32 -07:00
5 changed files with 120 additions and 101 deletions
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -0,0 +1,9 @@
 recursive-include tests *
 recursive-include talon *
 recursive-exclude tests *.pyc *~
 recursive-exclude talon *.pyc *~
 include train.data
 include classifier
 include LICENSE
 include MANIFEST.in
 include README.rst
--- a/README.md
+++ b/README.md
@@ -1,97 +0,0 @@
 talon
 =====
 Mailgun library to extract message quotations and signatures.
 If you ever tried to parse message quotations or signatures you know that absense of any formatting standards in this area
 could make this task a nightmare. Hopefully this library will make your life much eathier. The name of the project is
 inspired by TALON - multipurpose robot  designed to perform missions ranging from reconnaissance to combat and operate in
 a number of hostile environments. That's what a good quotations and signature parser should be like :smile:
 Usage
 -----
 Here's how you initialize the library and extract a reply from a text message:
 ```python
 import talon
 from talon import quotations
 talon.init()
 text =  """Reply
 -----Original Message-----
 Quote"""
 reply = quotations.extract_from(text, 'text/plain')
 reply = quotations.extract_from_plain(text)
 # reply == "Reply"
 ```
 To extract a reply from html:
 ```python
 html = """Reply
 <blockquote>
  <div>
    On 11-Apr-2011, at 6:54 PM, Bob &lt;bob@example.com&gt; wrote:
  </div>
  <div>
    Quote
  </div>
 </blockquote>"""
 reply = quotations.extract_from(html, 'text/html')
 reply = quotations.extract_from_html(html)
 # reply == "<html><body><p>Reply</p></body></html>"
 ```
 Often the best way is the easiest one. Here's how you can extract signature from email message without any
 machine learning fancy stuff:
 ```python
 from talon.signature.bruteforce import extract_signature
 message = """Wow. Awesome!
 --
 Bob Smith"""
 text, signature = extract_signature(message)
 # text == "Wow. Awesome!"
 # signature == "--\nBob Smith"
 ```
 Quick and works like a charm 90% of the time. For other 10% you can use the power of machine learning algorithms:
 ```python
 from talon import signature
 message = """Thanks Sasha, I can't go any higher and is why I limited it to the
 homepage.
 John Doe
 via mobile"""
 text, signature = signature.extract(message, sender='john.doe@example.com')
 # text == "Thanks Sasha, I can't go any higher and is why I limited it to the\nhomepage."
 # signature == "John Doe\nvia mobile"
 ```
 For machine learning talon currently uses [PyML](http://pyml.sourceforge.net/) library to build SVM classifiers. The core of machine learning algorithm lays in ``talon.signature.learning package``. It defines a set of features to apply to a message (``featurespace.py``), how data sets are built (``dataset.py``), classifier's interface (``classifier.py``).
 The data used for training is taken from our personal email conversations and from [ENRON](https://www.cs.cmu.edu/~enron/) dataset. As a result of applying our set of features to the dataset we provide files ``classifier`` and ``train.data`` that don't have any personal information but could be used to load trained classifier. Those files should be regenerated every time the feature/data set is changed.
 Research
 --------
 The library is inspired by the following research papers and projects:
 * http://www.cs.cmu.edu/~vitor/papers/sigFilePaper_finalversion.pdf
 * http://www.cs.cornell.edu/people/tj/publications/joachims_01a.pdf
--- a/README.rst
+++ b/README.rst
@@ -0,0 +1,109 @@
 talon
 =====
 Mailgun library to extract message quotations and signatures.
 If you ever tried to parse message quotations or signatures you know that absense of any formatting standards in this area could make this task a nightmare. Hopefully this library will make your life much easier. The name of the project is inspired by TALON - multipurpose robot designed to perform missions ranging from reconnaissance to combat and operate in a number of hostile environments. That’s what a good quotations and signature parser should be like :smile:
 Usage
 -----
 Here’s how you initialize the library and extract a reply from a text
 message:
 .. code:: python
    import talon
    from talon import quotations
    talon.init()
    text =  """Reply
    -----Original Message-----
    Quote"""
    reply = quotations.extract_from(text, 'text/plain')
    reply = quotations.extract_from_plain(text)
    # reply == "Reply"
 To extract a reply from html:
 .. code:: python
    html = """Reply
    <blockquote>
      <div>
        On 11-Apr-2011, at 6:54 PM, Bob &lt;bob@example.com&gt; wrote:
      </div>
      <div>
        Quote
      </div>
    </blockquote>"""
    reply = quotations.extract_from(html, 'text/html')
    reply = quotations.extract_from_html(html)
    # reply == "<html><body><p>Reply</p></body></html>"
 Often the best way is the easiest one. Here’s how you can extract
 signature from email message without any
 machine learning fancy stuff:
 .. code:: python
    from talon.signature.bruteforce import extract_signature
    message = """Wow. Awesome!
    --
    Bob Smith"""
    text, signature = extract_signature(message)
    # text == "Wow. Awesome!"
    # signature == "--\nBob Smith"
 Quick and works like a charm 90% of the time. For other 10% you can use
 the power of machine learning algorithms:
 .. code:: python
    from talon import signature
    message = """Thanks Sasha, I can't go any higher and is why I limited it to the
    homepage.
    John Doe
    via mobile"""
    text, signature = signature.extract(message, sender='john.doe@example.com')
    # text == "Thanks Sasha, I can't go any higher and is why I limited it to the\nhomepage."
    # signature == "John Doe\nvia mobile"
 For machine learning talon currently uses `PyML`_ library to build SVM
 classifiers. The core of machine learning algorithm lays in
 ``talon.signature.learning package``. It defines a set of features to
 apply to a message (``featurespace.py``), how data sets are built
 (``dataset.py``), classifier’s interface (``classifier.py``).
 The data used for training is taken from our personal email
 conversations and from `ENRON`_ dataset. As a result of applying our set
 of features to the dataset we provide files ``classifier`` and
 ``train.data`` that don’t have any personal information but could be
 used to load trained classifier. Those files should be regenerated every
 time the feature/data set is changed.
 .. _PyML: http://pyml.sourceforge.net/
 .. _ENRON: https://www.cs.cmu.edu/~enron/
 Research
 --------
 The library is inspired by the following research papers and projects:
 -  http://www.cs.cmu.edu/~vitor/papers/sigFilePaper_finalversion.pdf
 -  http://www.cs.cornell.edu/people/tj/publications/joachims_01a.pdf
--- a/setup.cfg
+++ b/setup.cfg
@@ -1,2 +0,0 @@
 [metadata]
 description-file = README.md
--- a/setup.py
+++ b/setup.py
@@ -7,10 +7,10 @@ from setuptools import setup, find_packages
 setup(name='talon',
-      version='1.0',
+      version='1.0.2',
      description=("Mailgun library "
                   "to extract message quotations and signatures."),
-      long_description=open("README.md").read(),
+      long_description=open("README.rst").read(),
      author='Mailgun Inc.',
      author_email='admin@mailgunhq.com',
      url='https://github.com/mailgun/talon',
Author	SHA1	Message	Date
Sergey Obukhov	b299feab1e	Merge branch 'master' of github.com:mailgun/talon	2014-07-24 15:43:11 -07:00
Sergey Obukhov	95182dcfc4	add train data, classifier and fixtures to MANIFEST.in	2014-07-24 15:42:50 -07:00
Ashish Gandhi	f9fe412fa4	Merge pull request #2 from ivuk/master Fix a typo in README.rst	2014-07-24 15:27:05 -07:00
Igor Vuk	00a8db2e3e	Fix a typo in README.rst	2014-07-25 00:18:13 +02:00
Sergey Obukhov	71ae26ccd1	added MANIFEST.in and bump up the version	2014-07-24 15:09:01 -07:00
Sergey Obukhov	b0851d5363	bump up the version	2014-07-24 14:59:32 -07:00
Sergey Obukhov	ac4f5201bb	Update README.rst	2014-07-24 07:22:25 -07:00
Sergey Obukhov	81e88d9222	use reStructuredText instead of Markdown	2014-07-24 06:56:10 -07:00