Merge pull request #94 from mailgun/obukhov-sergey-patch-1

Update README.rst
bump version
2016-05-31 20:16:13 -07:00 · 2016-05-31 18:42:47 -07:00 · 2016-05-31 18:39:07 -07:00 · 2016-05-31 18:15:28 -07:00 · 2016-05-31 16:53:41 -07:00 · 2016-05-31 16:50:35 -07:00
4 changed files with 42 additions and 5 deletions
--- a/README.rst
+++ b/README.rst
@@ -95,7 +95,7 @@ classifiers. The core of machine learning algorithm lays in
 apply to a message (``featurespace.py``), how data sets are built
 (``dataset.py``), classifier’s interface (``classifier.py``).

-The data used for training is taken from our personal email
+Currently the data used for training is taken from our personal email
 conversations and from `ENRON`_ dataset. As a result of applying our set
 of features to the dataset we provide files ``classifier`` and
 ``train.data`` that don’t have any personal information but could be
--- a/setup.py
+++ b/setup.py
@@ -2,7 +2,7 @@ from setuptools import setup, find_packages


 setup(name='talon',
-      version='1.2.6',
+      version='1.2.9',
      description=("Mailgun library "
                   "to extract message quotations and signatures."),
      long_description=open("README.rst").read(),
--- a/talon/html_quotations.py
+++ b/talon/html_quotations.py
@@ -86,9 +86,12 @@ def cut_gmail_quote(html_message):
 def cut_microsoft_quote(html_message):
    ''' Cuts splitter block and all following blocks. '''
    splitter = html_message.xpath(
-        #outlook 2007, 2010
+        #outlook 2007, 2010 (international)
        "//div[@style='border:none;border-top:solid #B5C4DF 1.0pt;"
        "padding:3.0pt 0cm 0cm 0cm']|"
+        #outlook 2007, 2010 (american)
+        "//div[@style='border:none;border-top:solid #B5C4DF 1.0pt;"
+        "padding:3.0pt 0in 0in 0in']|"
        #windows mail
        "//div[@style='padding-top: 5px; "
        "border-top-color: rgb(229, 229, 229); "
@@ -175,7 +178,21 @@ def cut_from_block(html_message):
                len(maybe_body.getchildren()) == 1)

            if not parent_div_is_all_content:
-                block.getparent().remove(block)
+                parent = block.getparent()
+                next_sibling = block.getnext()
+
+                # remove all tags after found From block
+                # (From block and quoted message are in separate divs)
+                while next_sibling is not None:
+                    parent.remove(block)
+                    block = next_sibling
+                    next_sibling = block.getnext()
+
+                # remove the last sibling (or the
+                # From block if no siblings)
+                if block is not None:
+                    parent.remove(block)
+
                return True
        else:
            return False
--- a/tests/html_quotations_test.py
+++ b/tests/html_quotations_test.py
@@ -279,6 +279,26 @@ def test_reply_separated_by_hr():
            '', quotations.extract_from_html(REPLY_SEPARATED_BY_HR)))


+def test_from_block_and_quotations_in_separate_divs():
+    msg_body = '''
+Reply
+<div>
+  <hr/>
+  <div>
+    <font>
+      <b>From: bob@example.com</b>
+      <b>Date: Thu, 24 Mar 2016 08:07:12 -0700</b>
+    </font>
+  </div>
+  <div>
+    Quoted message
+  </div>
+</div>
+'''
+    eq_('<html><body><p>Reply</p><div><hr></div></body></html>',
+        RE_WHITESPACE.sub('', quotations.extract_from_html(msg_body)))
+
+
 def extract_reply_and_check(filename):
    f = open(filename)
Author	SHA1	Message	Date
Sergey Obukhov	5bcf7403ad	Merge pull request #94 from mailgun/obukhov-sergey-patch-1 Update README.rst	2016-05-31 20:16:13 -07:00
Sergey Obukhov	2d6c092b65	bump version	2016-05-31 18:42:47 -07:00
Sergey Obukhov	6d0689cad6	Update README.rst	2016-05-31 18:39:07 -07:00
Sergey Obukhov	3f80e93ee0	Merge pull request #93 from mailgun/sergey/version-bump bump	2016-05-31 18:15:28 -07:00
Sergey Obukhov	1b18abab1d	bump	2016-05-31 16:53:41 -07:00
Sergey Obukhov	03dd5af5ab	Merge pull request #91 from KevinCathcart/patch-1 Support outlook 2007/2010 running in en-us locale	2016-05-31 16:50:35 -07:00
Sergey Obukhov	dfba82b07c	Merge pull request #92 from mailgun/obukhov-sergey-kuntzcamera Update README.rst	2016-05-31 15:42:34 -07:00
Sergey Obukhov	08ca02c87f	Update README.rst	2016-05-31 15:14:32 -07:00
Kevin Cathcart	b61f4ec095	Support outlook 2007/2010 running in en-us locale My American English copy of outlook 2007 is using inches in the reply separator rather than centimeters. The separator is otherwise Identical. What a strange thing to localize. I'm guessing it uses whatever it thinks the preferred units for page margins are.	2016-05-23 17:23:53 -04:00
Sergey Obukhov	9dbe6a494b	Merge pull request #90 from mailgun/sergey/89 fixes mailgun/talon#89	2016-05-17 16:01:56 -07:00
Sergey Obukhov	44e70939d6	fixes mailgun/talon#89	2016-05-17 15:31:01 -07:00