Merge pull request #177 from mailgun/obukhov-sergey-patch-1

Update Readme with how to retrain on your own data
2018-11-02 15:22:18 +03:00 · 2018-11-02 15:21:36 +03:00 · 2018-11-02 15:03:02 +03:00 · 2018-11-02 14:52:38 +03:00 · 2018-11-02 09:12:43 +03:00 · 2018-11-02 09:11:07 +03:00
2 changed files with 27 additions and 1 deletions
--- a/README.rst
+++ b/README.rst
@@ -129,6 +129,22 @@ start using it for talon.
 .. _EDRM: http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set
 .. _forge: https://github.com/mailgun/forge

+Training on your dataset
+------------------------
+
+talon comes with a pre-processed dataset and a pre-trained classifier. To retrain the classifier on your own dataset of raw emails, structure and annotate them in the same way the `forge`_ project does. Then do:
+
+.. code:: python
+
+    from talon.signature.learning.dataset import build_extraction_dataset
+    from talon.signature.learning import classifier as c 
+    
+    build_extraction_dataset("/path/to/your/P/folder", "/path/to/talon/signature/data/train.data")
+    c.train(c.init(), "/path/to/talon/signature/data/train.data", "/path/to/talon/signature/data/classifier")
+
+Note that for signature extraction you need just the folder with the positive samples with annotated signature lines (P folder).
+
+.. _forge: https://github.com/mailgun/forge

 Research
 --------
--- a/talon/quotations.py
+++ b/talon/quotations.py
@@ -38,6 +38,8 @@ RE_ON_DATE_SMB_WROTE = re.compile(
            'Op',
            # German
            'Am',
+            # Portuguese
+            'Em',
            # Norwegian
            u'På',
            # Swedish, Danish
@@ -64,6 +66,8 @@ RE_ON_DATE_SMB_WROTE = re.compile(
            'schreef','verzond','geschreven',
            # German
            'schrieb',
+            # Portuguese
+            'escreveu',
            # Norwegian, Swedish
            'skrev',
            # Vietnamese
@@ -286,7 +290,7 @@ def process_marked_lines(lines, markers, return_flags=[False, -1, -1]):
    # inlined reply
    # use lookbehind assertions to find overlapping entries e.g. for 'mtmtm'
    # both 't' entries should be found
-    for inline_reply in re.finditer('(?<=m)e*((?:t+e*)+)m', markers):
+    for inline_reply in re.finditer('(?<=m)e*(t[te]*)m', markers):
        # long links could break sequence of quotation lines but they shouldn't
        # be considered an inline reply
        links = (
@@ -430,6 +434,9 @@ def _extract_from_html(msg_body):
    Extract not quoted message from provided html message body
    using tags and plain text algorithm.

+    Cut out first some encoding html tags such as xml and doctype
+    for avoiding conflict with unicode decoding
+
    Cut out the 'blockquote', 'gmail_quote' tags.
    Cut Microsoft quotations.

@@ -445,6 +452,9 @@ def _extract_from_html(msg_body):
        return msg_body

    msg_body = msg_body.replace(b'\r\n', b'\n')
+
+    msg_body = re.sub(r"\<\?xml.+\?\>|\<\!DOCTYPE.+]\>", "", msg_body)
+
    html_tree = html_document_fromstring(msg_body)

    if html_tree is None:
Author	SHA1	Message	Date
Sergey Obukhov	6a304215c3	Merge pull request #177 from mailgun/obukhov-sergey-patch-1 Update Readme with how to retrain on your own data	2018-11-02 15:22:18 +03:00
Sergey Obukhov	31714506bd	Update Readme with how to retrain on your own data	2018-11-02 15:21:36 +03:00
Sergey Obukhov	403d80cf3b	Merge pull request #161 from glaand/master Fix: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.	2018-11-02 15:03:02 +03:00
Sergey Obukhov	7cf20f2877	Merge branch 'master' into master	2018-11-02 14:52:38 +03:00
Sergey Obukhov	685abb1905	Merge pull request #171 from gabriellima95/Add-Portuguese-Language Add Portuguese language to quotations	2018-11-02 09:12:43 +03:00
Sergey Obukhov	41990727a3	Merge branch 'master' into Add-Portuguese-Language	2018-11-02 09:11:07 +03:00
Sergey Obukhov	b113d8ab33	Merge pull request #172 from ad-m/patch-1 Fix catastrophic backtracking in regexp	2018-11-02 09:09:49 +03:00
Adam Dobrawy	7bd0e9cc2f	Fix catastrophic backtracking in regexp Co-Author: @Nipsuli	2018-09-21 22:00:10 +02:00
gabriellima95	1e030a51d4	Add Portuguese language to quotations	2018-09-11 15:27:39 -03:00
André Glatzl	53b24ffb3d	Cut out first some encoding html tags such as xml and doctype for avoiding conflict with unicode decoding	2017-12-19 15:15:10 +01:00
Sergey Obukhov	a7404afbcb	Merge pull request #155 from mailgun/sergey/appointment fix appointments in text	2017-10-23 16:34:08 -07:00