Update README.rst

2016-05-31 15:14:32 -07:00
parent 9dbe6a494b
commit 08ca02c87f
1 changed files with 11 additions and 1 deletions
--- a/README.rst
+++ b/README.rst
@@ -95,7 +95,7 @@ classifiers. The core of machine learning algorithm lays in
 apply to a message (``featurespace.py``), how data sets are built
 (``dataset.py``), classifier’s interface (``classifier.py``).

-The data used for training is taken from our personal email
+Currently the data used for training is taken from our personal email
 conversations and from `ENRON`_ dataset. As a result of applying our set
 of features to the dataset we provide files ``classifier`` and
 ``train.data`` that don’t have any personal information but could be
@@ -116,8 +116,18 @@ or
    from talon.signature.learning.classifier import train, init
    train(init(), EXTRACTOR_DATA, EXTRACTOR_FILENAME)

+Open-source Dataset
+-------------------------
+
+Recently we started a `kuntzcamera`_ project to create an open-source, annotated dataset of raw emails. In the project we
+used a subset of `ENRON`_ data, cleansed of private, health and financial information by `EDRM`_. At the moment over 190
+emails are annotated. Any contribution and collaboration on the project are welcome. Once the dataset is ready we plan to
+start using it for talon.
+
 .. _scikit-learn: http://scikit-learn.org
 .. _ENRON: https://www.cs.cmu.edu/~enron/
+.. _EDRM: http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set
+.. _kuntzcamera: https://github.com/mailgun/kuntzcamera

 Research
 --------