11 Commits

Author SHA1 Message Date
Sergey Obukhov
6a304215c3 Merge pull request #177 from mailgun/obukhov-sergey-patch-1
Update Readme with how to retrain on your own data
2018-11-02 15:22:18 +03:00
Sergey Obukhov
31714506bd Update Readme with how to retrain on your own data 2018-11-02 15:21:36 +03:00
Sergey Obukhov
403d80cf3b Merge pull request #161 from glaand/master
Fix: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
2018-11-02 15:03:02 +03:00
Sergey Obukhov
7cf20f2877 Merge branch 'master' into master 2018-11-02 14:52:38 +03:00
Sergey Obukhov
685abb1905 Merge pull request #171 from gabriellima95/Add-Portuguese-Language
Add Portuguese language to quotations
2018-11-02 09:12:43 +03:00
Sergey Obukhov
41990727a3 Merge branch 'master' into Add-Portuguese-Language 2018-11-02 09:11:07 +03:00
Sergey Obukhov
b113d8ab33 Merge pull request #172 from ad-m/patch-1
Fix catastrophic backtracking in regexp
2018-11-02 09:09:49 +03:00
Adam Dobrawy
7bd0e9cc2f Fix catastrophic backtracking in regexp
Co-Author: @Nipsuli
2018-09-21 22:00:10 +02:00
gabriellima95
1e030a51d4 Add Portuguese language to quotations 2018-09-11 15:27:39 -03:00
André Glatzl
53b24ffb3d Cut out first some encoding html tags such as xml and doctype for avoiding conflict with unicode decoding 2017-12-19 15:15:10 +01:00
Sergey Obukhov
a7404afbcb Merge pull request #155 from mailgun/sergey/appointment
fix appointments in text
2017-10-23 16:34:08 -07:00
2 changed files with 27 additions and 1 deletions

View File

@@ -129,6 +129,22 @@ start using it for talon.
.. _EDRM: http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set
.. _forge: https://github.com/mailgun/forge
Training on your dataset
------------------------
talon comes with a pre-processed dataset and a pre-trained classifier. To retrain the classifier on your own dataset of raw emails, structure and annotate them in the same way the `forge`_ project does. Then do:
.. code:: python
from talon.signature.learning.dataset import build_extraction_dataset
from talon.signature.learning import classifier as c
build_extraction_dataset("/path/to/your/P/folder", "/path/to/talon/signature/data/train.data")
c.train(c.init(), "/path/to/talon/signature/data/train.data", "/path/to/talon/signature/data/classifier")
Note that for signature extraction you need just the folder with the positive samples with annotated signature lines (P folder).
.. _forge: https://github.com/mailgun/forge
Research
--------

View File

@@ -38,6 +38,8 @@ RE_ON_DATE_SMB_WROTE = re.compile(
'Op',
# German
'Am',
# Portuguese
'Em',
# Norwegian
u'',
# Swedish, Danish
@@ -64,6 +66,8 @@ RE_ON_DATE_SMB_WROTE = re.compile(
'schreef','verzond','geschreven',
# German
'schrieb',
# Portuguese
'escreveu',
# Norwegian, Swedish
'skrev',
# Vietnamese
@@ -286,7 +290,7 @@ def process_marked_lines(lines, markers, return_flags=[False, -1, -1]):
# inlined reply
# use lookbehind assertions to find overlapping entries e.g. for 'mtmtm'
# both 't' entries should be found
for inline_reply in re.finditer('(?<=m)e*((?:t+e*)+)m', markers):
for inline_reply in re.finditer('(?<=m)e*(t[te]*)m', markers):
# long links could break sequence of quotation lines but they shouldn't
# be considered an inline reply
links = (
@@ -430,6 +434,9 @@ def _extract_from_html(msg_body):
Extract not quoted message from provided html message body
using tags and plain text algorithm.
Cut out first some encoding html tags such as xml and doctype
for avoiding conflict with unicode decoding
Cut out the 'blockquote', 'gmail_quote' tags.
Cut Microsoft quotations.
@@ -445,6 +452,9 @@ def _extract_from_html(msg_body):
return msg_body
msg_body = msg_body.replace(b'\r\n', b'\n')
msg_body = re.sub(r"\<\?xml.+\?\>|\<\!DOCTYPE.+]\>", "", msg_body)
html_tree = html_document_fromstring(msg_body)
if html_tree is None: