Merge pull request #95 from mailgun/sergey/forge

open-sourcing email dataset
2016-06-10 15:45:29 -07:00 · 2016-06-10 14:10:53 -07:00 · 2016-05-31 20:16:13 -07:00 · 2016-05-31 18:42:47 -07:00 · 2016-05-31 18:39:07 -07:00 · 2016-05-31 18:15:28 -07:00
3 changed files with 17 additions and 3 deletions
--- a/README.rst
+++ b/README.rst
@@ -95,7 +95,7 @@ classifiers. The core of machine learning algorithm lays in
 apply to a message (``featurespace.py``), how data sets are built
 (``dataset.py``), classifier’s interface (``classifier.py``).

-The data used for training is taken from our personal email
+Currently the data used for training is taken from our personal email
 conversations and from `ENRON`_ dataset. As a result of applying our set
 of features to the dataset we provide files ``classifier`` and
 ``train.data`` that don’t have any personal information but could be
@@ -116,8 +116,19 @@ or
    from talon.signature.learning.classifier import train, init
    train(init(), EXTRACTOR_DATA, EXTRACTOR_FILENAME)

+Open-source Dataset
+-------------------
+
+Recently we started a `forge`_ project to create an open-source, annotated dataset of raw emails. In the project we
+used a subset of `ENRON`_ data, cleansed of private, health and financial information by `EDRM`_. At the moment over 190
+emails are annotated. Any contribution and collaboration on the project are welcome. Once the dataset is ready we plan to
+start using it for talon.
+
 .. _scikit-learn: http://scikit-learn.org
 .. _ENRON: https://www.cs.cmu.edu/~enron/
+.. _EDRM: http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set
+.. _forge: https://github.com/mailgun/forge
+

 Research
 --------
--- a/setup.py
+++ b/setup.py
@@ -2,7 +2,7 @@ from setuptools import setup, find_packages


 setup(name='talon',
-      version='1.2.7',
+      version='1.2.10',
      description=("Mailgun library "
                   "to extract message quotations and signatures."),
      long_description=open("README.rst").read(),
--- a/talon/html_quotations.py
+++ b/talon/html_quotations.py
@@ -86,9 +86,12 @@ def cut_gmail_quote(html_message):
 def cut_microsoft_quote(html_message):
    ''' Cuts splitter block and all following blocks. '''
    splitter = html_message.xpath(
-        #outlook 2007, 2010
+        #outlook 2007, 2010 (international)
        "//div[@style='border:none;border-top:solid #B5C4DF 1.0pt;"
        "padding:3.0pt 0cm 0cm 0cm']|"
+        #outlook 2007, 2010 (american)
+        "//div[@style='border:none;border-top:solid #B5C4DF 1.0pt;"
+        "padding:3.0pt 0in 0in 0in']|"
        #windows mail
        "//div[@style='padding-top: 5px; "
        "border-top-color: rgb(229, 229, 229); "
Author	SHA1	Message	Date
Sergey Obukhov	35645f9ade	Merge pull request #95 from mailgun/sergey/forge open-sourcing email dataset	2016-06-10 15:45:29 -07:00
Sergey Obukhov	7c3d91301c	open-sourcing email dataset	2016-06-10 14:10:53 -07:00
Sergey Obukhov	5bcf7403ad	Merge pull request #94 from mailgun/obukhov-sergey-patch-1 Update README.rst	2016-05-31 20:16:13 -07:00
Sergey Obukhov	2d6c092b65	bump version	2016-05-31 18:42:47 -07:00
Sergey Obukhov	6d0689cad6	Update README.rst	2016-05-31 18:39:07 -07:00
Sergey Obukhov	3f80e93ee0	Merge pull request #93 from mailgun/sergey/version-bump bump	2016-05-31 18:15:28 -07:00
Sergey Obukhov	1b18abab1d	bump	2016-05-31 16:53:41 -07:00
Sergey Obukhov	03dd5af5ab	Merge pull request #91 from KevinCathcart/patch-1 Support outlook 2007/2010 running in en-us locale	2016-05-31 16:50:35 -07:00
Sergey Obukhov	dfba82b07c	Merge pull request #92 from mailgun/obukhov-sergey-kuntzcamera Update README.rst	2016-05-31 15:42:34 -07:00
Sergey Obukhov	08ca02c87f	Update README.rst	2016-05-31 15:14:32 -07:00
Kevin Cathcart	b61f4ec095	Support outlook 2007/2010 running in en-us locale My American English copy of outlook 2007 is using inches in the reply separator rather than centimeters. The separator is otherwise Identical. What a strange thing to localize. I'm guessing it uses whatever it thinks the preferred units for page margins are.	2016-05-23 17:23:53 -04:00