Compare commits
6 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
3f80e93ee0 | ||
|
|
1b18abab1d | ||
|
|
03dd5af5ab | ||
|
|
dfba82b07c | ||
|
|
08ca02c87f | ||
|
|
b61f4ec095 |
12
README.rst
12
README.rst
@@ -95,7 +95,7 @@ classifiers. The core of machine learning algorithm lays in
|
||||
apply to a message (``featurespace.py``), how data sets are built
|
||||
(``dataset.py``), classifier’s interface (``classifier.py``).
|
||||
|
||||
The data used for training is taken from our personal email
|
||||
Currently the data used for training is taken from our personal email
|
||||
conversations and from `ENRON`_ dataset. As a result of applying our set
|
||||
of features to the dataset we provide files ``classifier`` and
|
||||
``train.data`` that don’t have any personal information but could be
|
||||
@@ -116,8 +116,18 @@ or
|
||||
from talon.signature.learning.classifier import train, init
|
||||
train(init(), EXTRACTOR_DATA, EXTRACTOR_FILENAME)
|
||||
|
||||
Open-source Dataset
|
||||
-------------------------
|
||||
|
||||
Recently we started a `kuntzcamera`_ project to create an open-source, annotated dataset of raw emails. In the project we
|
||||
used a subset of `ENRON`_ data, cleansed of private, health and financial information by `EDRM`_. At the moment over 190
|
||||
emails are annotated. Any contribution and collaboration on the project are welcome. Once the dataset is ready we plan to
|
||||
start using it for talon.
|
||||
|
||||
.. _scikit-learn: http://scikit-learn.org
|
||||
.. _ENRON: https://www.cs.cmu.edu/~enron/
|
||||
.. _EDRM: http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set
|
||||
.. _kuntzcamera: https://github.com/mailgun/kuntzcamera
|
||||
|
||||
Research
|
||||
--------
|
||||
|
||||
2
setup.py
2
setup.py
@@ -2,7 +2,7 @@ from setuptools import setup, find_packages
|
||||
|
||||
|
||||
setup(name='talon',
|
||||
version='1.2.7',
|
||||
version='1.2.8',
|
||||
description=("Mailgun library "
|
||||
"to extract message quotations and signatures."),
|
||||
long_description=open("README.rst").read(),
|
||||
|
||||
@@ -86,9 +86,12 @@ def cut_gmail_quote(html_message):
|
||||
def cut_microsoft_quote(html_message):
|
||||
''' Cuts splitter block and all following blocks. '''
|
||||
splitter = html_message.xpath(
|
||||
#outlook 2007, 2010
|
||||
#outlook 2007, 2010 (international)
|
||||
"//div[@style='border:none;border-top:solid #B5C4DF 1.0pt;"
|
||||
"padding:3.0pt 0cm 0cm 0cm']|"
|
||||
#outlook 2007, 2010 (american)
|
||||
"//div[@style='border:none;border-top:solid #B5C4DF 1.0pt;"
|
||||
"padding:3.0pt 0in 0in 0in']|"
|
||||
#windows mail
|
||||
"//div[@style='padding-top: 5px; "
|
||||
"border-top-color: rgb(229, 229, 229); "
|
||||
|
||||
Reference in New Issue
Block a user