diff --git a/README.rst b/README.rst index 2517450..ddb4efc 100644 --- a/README.rst +++ b/README.rst @@ -95,7 +95,7 @@ classifiers. The core of machine learning algorithm lays in apply to a message (``featurespace.py``), how data sets are built (``dataset.py``), classifier’s interface (``classifier.py``). -The data used for training is taken from our personal email +Currently the data used for training is taken from our personal email conversations and from `ENRON`_ dataset. As a result of applying our set of features to the dataset we provide files ``classifier`` and ``train.data`` that don’t have any personal information but could be @@ -116,8 +116,18 @@ or from talon.signature.learning.classifier import train, init train(init(), EXTRACTOR_DATA, EXTRACTOR_FILENAME) +Open-source Dataset +------------------------- + +Recently we started a `kuntzcamera`_ project to create an open-source, annotated dataset of raw emails. In the project we +used a subset of `ENRON`_ data, cleansed of private, health and financial information by `EDRM`_. At the moment over 190 +emails are annotated. Any contribution and collaboration on the project are welcome. Once the dataset is ready we plan to +start using it for talon. + .. _scikit-learn: http://scikit-learn.org .. _ENRON: https://www.cs.cmu.edu/~enron/ +.. _EDRM: http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set +.. _kuntzcamera: https://github.com/mailgun/kuntzcamera Research --------